<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">

  <title><![CDATA[Sébastien Han]]></title>
  <link href="http://sebastien-han.fr/atom.xml" rel="self"/>
  <link href="http://sebastien-han.fr/"/>
  <updated>2013-05-13T11:51:23+02:00</updated>
  <id>http://sebastien-han.fr/</id>
  <author>
    <name><![CDATA[Sébastien Han]]></name>
    
  </author>
  <generator uri="http://octopress.org/">Octopress</generator>

  
  <entry>
    <title type="html"><![CDATA[Deploy a Ceph MDS server]]></title>
    <link href="http://sebastien-han.fr/blog/2013/05/13/deploy-a-ceph-mds-server/"/>
    <updated>2013-05-13T16:15:00+02:00</updated>
    <id>http://sebastien-han.fr/blog/2013/05/13/deploy-a-ceph-mds-server</id>
    <content type="html"><![CDATA[<p>How-to quickly deploy a MDS server.</p>

<!--more -->


<p>Assuming that <code>/var/lib/ceph/mds/mds</code> is the mds data point.</p>

<p>Edit <code>ceph.conf</code> and add a MDS section like so:</p>

<pre><code>[mds]
  mds data = /var/lib/ceph/mds/mds.$id
  keyring = /var/lib/ceph/mds/mds.$id/mds.$id.keyring

[mds.0]
  host = {hostname}
</code></pre>

<p>Create the authentication key (<strong>only if you use cephX</strong>):</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>sudo ceph auth get-or-create mds.0 mds <span class="s1">&#39;allow &#39;</span> osd <span class="s1">&#39;allow *&#39;</span> mon <span class="s1">&#39;allow rwx&#39;</span> &gt; /var/lib/ceph/mds/mds0/mds.0.keyring
</span></code></pre></td></tr></table></div></figure>


<p>Eventually start the service</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>sudo service ceph start mds.0
</span><span class='line'><span class="o">===</span> mds.0 <span class="o">===</span>
</span><span class='line'>Starting Ceph mds.0 on ceph...
</span><span class='line'>starting mds.0 at :/0
</span></code></pre></td></tr></table></div></figure>


<p>Check the status of the cluster:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>ceph -s
</span><span class='line'>   health HEALTH_OK
</span><span class='line'>   monmap e3: 1 mons at <span class="o">{</span><span class="nv">1</span><span class="o">=</span>192.168.251.100:6790/0<span class="o">}</span>, election epoch 1, quorum 0 1
</span><span class='line'>   osdmap e318: 2 osds: 2 up, 2 in
</span><span class='line'>    pgmap v8214: 280 pgs: 280 active+clean; 3818 MB data, 9545 MB used, 10432 MB / 19978 MB avail
</span><span class='line'>   mdsmap e70: 1/1/1 up <span class="o">{</span><span class="nv">0</span><span class="o">=</span><span class="nv">0</span><span class="o">=</span>up:active<span class="o">}</span>
</span></code></pre></td></tr></table></div></figure>


<p>Note if you want to add more MDSs, they will appear like this:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>ceph -s
</span><span class='line'>   health HEALTH_OK
</span><span class='line'>   monmap e3: 1 mons at <span class="o">{</span><span class="nv">1</span><span class="o">=</span>192.168.251.100:6790/0<span class="o">}</span>, election epoch 1, quorum 0 1
</span><span class='line'>   osdmap e318: 2 osds: 2 up, 2 in
</span><span class='line'>    pgmap v8214: 280 pgs: 280 active+clean; 3818 MB data, 9545 MB used, 10432 MB / 19978 MB avail
</span><span class='line'>   mdsmap e70: 1/1/1 up <span class="o">{</span><span class="nv">0</span><span class="o">=</span><span class="nv">0</span><span class="o">=</span>up:active<span class="o">}</span>, 1 up:standby
</span></code></pre></td></tr></table></div></figure>




<br />


<blockquote><p>Easy, isn&#8217;t it? FYI filesystem metadata live in RADOS cluster. So MDS servers are quite ephemeral daemons. Don&#8217;t be surprised if you don&#8217;t find anything (expect the MDS key) inside the mds data directory.</p></blockquote>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Use existing RBD images and put it into Glance]]></title>
    <link href="http://sebastien-han.fr/blog/2013/05/07/use-existing-rbd-images-and-put-it-into-glance/"/>
    <updated>2013-05-07T17:15:00+02:00</updated>
    <id>http://sebastien-han.fr/blog/2013/05/07/use-existing-rbd-images-and-put-it-into-glance</id>
    <content type="html"><![CDATA[<p><img class="center" src="http://sebastien-han.fr/images/glance-location-rbd.jpg" title="Use existing RBD images and put it into Glance" ></p>

<p>The title of the article is not that explicit, actually I had trouble to find a proper one. Thus let me clarify a bit. Here is the context I was wondering if Glance was capable of converting images within its store. The quick answer is no, but I think such feature is worth to be implemented. Glance could be able to convert a QCOW2 image to a RAW format. Usually if you already have an image within let&#8217;s say a Ceph cluster (RBD), you have to download the image (since you probably don&#8217;t have the source image file anymore), then manually convert it with qemu-img (QCOW2 &#8211;> RAW) and eventually import it into Glance. Enough talk about this, I&#8217;ll address this in a future article. For now let&#8217;s stick to the first matter. Imagine that you have a KVM cluster backed by a Ceph Cluster and your CTO wants you to migrate the whole environment to OpenStack because it&#8217;s trendy (joking, OpenStack just rocks!). You&#8217;re not going to backup all your images and then build a new cluster or something like that, you might want OpenStack (Glance) to be aware of your Ceph cluster. Generally speaking you <em>just</em> have to connect Glance to one of your image pool. After this, the only thing to do is to create (it&#8217;s more registering the images ID and metadata than creating a new image) into Glance. No worries here&#8217;s the explanation. Longest introduction ever.</p>

<!--more-->


<p>In this article, I&#8217;m assuming that Glance is already connected to Ceph and to the proper RBD pool. Before starting anything, please understand that <strong>within the current Grizzly stable branch, the RBD backend is not implemented</strong>. That&#8217;s funny because we don&#8217;t need that much to implement it. The bug report is on <a href="https://bugs.launchpad.net/glance/+bug/1176994">launchpad</a> and the proposed feature is under review on <a href="https://review.openstack.org/#/c/28325/">Gerrit</a>.</p>

<p>However if you want to enable the fix now:</p>

<ul>
<li>Go to the line <strong>278</strong> of <code>/opt/stack/glance/glance/api/v1/images.py</code></li>
<li>Then simply edit the line like so:</li>
</ul>


<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
</pre></td><td class='code'><pre><code class='python'><span class='line'><span class="k">for</span> <span class="n">scheme</span> <span class="ow">in</span> <span class="p">[</span><span class="s">&#39;s3&#39;</span><span class="p">,</span> <span class="s">&#39;swift&#39;</span><span class="p">,</span> <span class="s">&#39;http&#39;</span><span class="p">,</span> <span class="s">&#39;rbd&#39;</span><span class="p">]:</span>
</span></code></pre></td></tr></table></div></figure>


<h2>Let&#8217;s test this!</h2>

<p>Get the image size from the rbd client:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>rbd -p images info ubuntu-raw
</span><span class='line'>rbd image <span class="s1">&#39;ubuntu-raw&#39;</span>:
</span><span class='line'>size 2048 MB in 512 objects
</span><span class='line'>order 22 <span class="o">(</span>4096 KB objects<span class="o">)</span>
</span><span class='line'>block_name_prefix: rb.0.3ded.2eb141f2
</span><span class='line'>format: 1
</span></code></pre></td></tr></table></div></figure>


<p>Eventually create/register the new image:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
<span class='line-number'>14</span>
<span class='line-number'>15</span>
<span class='line-number'>16</span>
<span class='line-number'>17</span>
<span class='line-number'>18</span>
<span class='line-number'>19</span>
<span class='line-number'>20</span>
<span class='line-number'>21</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>glance image-create --size 2147483648 --name ubuntu-rbd --store rbd --disk-format raw --container-format ovf --location rbd://ubuntu-raw
</span><span class='line'>+------------------+--------------------------------------+
</span><span class='line'>| Property         | Value                                |
</span><span class='line'>+------------------+--------------------------------------+
</span><span class='line'>| checksum         | None                                 |
</span><span class='line'>| container_format | ovf                                  |
</span><span class='line'>| created_at       | 2013-05-06T15:29:26                  |
</span><span class='line'>| deleted          | False                                |
</span><span class='line'>| deleted_at       | None                                 |
</span><span class='line'>| disk_format      | raw                                  |
</span><span class='line'>| id               | 0d47c421-b079-44ff-bcc5-ee711d500512 |
</span><span class='line'>| is_public        | False                                |
</span><span class='line'>| min_disk         | 0                                    |
</span><span class='line'>| min_ram          | 0                                    |
</span><span class='line'>| name             | ubuntu-rbd-hack                      |
</span><span class='line'>| owner            | 19292b3b597b4ecc9a41103cc312a42f     |
</span><span class='line'>| protected        | False                                |
</span><span class='line'>| size             | 2147483648                           |
</span><span class='line'>| status           | active                               |
</span><span class='line'>| updated_at       | 2013-05-06T15:29:26                  |
</span><span class='line'>+------------------+--------------------------------------+
</span></code></pre></td></tr></table></div></figure>


<p><span class="text_quote">R </span>Note about the URI from the <code>--location</code> option, there are 2 way to build it, it can be:</p>

<ul>
<li><code>rbd://&lt;fsid&gt;/&lt;pool&gt;/&lt;image&gt;/&lt;snapshot&gt;</code></li>
<li><code>rbd://&lt;image-name&gt;</code> ; Glance will figured out the pool since you put it into the Glance configuration.</li>
</ul>


<p><strong>It either 1 or 4 field(s).</strong></p>

<br />


<blockquote><p>Of course the example was only with one image but the method will definitely work for a whole Ceph cluster with tons of images!</p></blockquote>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[HA from DevOops side: OpenStack summit video]]></title>
    <link href="http://sebastien-han.fr/blog/2013/05/06/ha-from-devoops-side-openstack-summit-video/"/>
    <updated>2013-05-06T16:52:00+02:00</updated>
    <id>http://sebastien-han.fr/blog/2013/05/06/ha-from-devoops-side-openstack-summit-video</id>
    <content type="html"><![CDATA[<p>It&#8217;s a bit late that I&#8217;m happy to share our talk (with <a href="http://my1.fr/">Emilien Macchi</a>) at the OpenStack Summit. Ok that was my first talk, so please be gentle ^<sup>.</sup> In the meantime, here the video. In this presentation, we shared 2 HA reference architectures for OpenStack.</p>

<div class="embed-video-container"><iframe src="http://www.youtube.com/embed/HJaLvid0X9U "></iframe></div>


<p>Slides are available on <a href="http://www.slideshare.net/enovance/summit-portland-ha-from-dev-ops-side">Slideshare</a></p>

<iframe src="http://www.slideshare.net/slideshow/embed_code/19054325 " width="595" height="446" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" style="border:1px solid #CCC;border-width:1px 1px 0;margin-bottom:5px" allowfullscreen></iframe>


<p></p>

<br />


<blockquote><p>See you in Hong Kong! Cheers!</p></blockquote>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Ceph and Cinder multi-backend]]></title>
    <link href="http://sebastien-han.fr/blog/2013/04/25/ceph-and-cinder-multi-backend/"/>
    <updated>2013-04-25T12:03:00+02:00</updated>
    <id>http://sebastien-han.fr/blog/2013/04/25/ceph-and-cinder-multi-backend</id>
    <content type="html"><![CDATA[<p><img class="center" src="http://sebastien-han.fr/images/ceph-cinder-multi-backed.jpg"></p>

<p>Grizzly brought the multi-backend functionality to cinder and tons of new drivers. The main purpose of this article is to demonstrate how we can take advantage of the tiering capability of Ceph.</p>

<!--more-->


<h1>I. Ceph</h1>

<p>To configure Ceph to use different storage devices see my previous article: <a href="http://www.sebastien-han.fr/blog/2012/12/07/ceph-2-speed-storage-with-crush/">Ceph 2 speed storage with CRUSH</a>.</p>

<br />


<h1>II. Cinder</h1>

<p>Assuming your 2 pools are called:</p>

<ul>
<li>rbd-sata points to the SATA rack</li>
<li>rbd-ssd points to the SSD rack</li>
</ul>


<h2>II.1 Configuration</h2>

<p>Cinder configuration file:</p>

<pre><code># Multi backend options

# Define the names of the groups for multiple volume backends
enabled_backends=rbd-sata,rbd-ssd

# Define the groups as above
[rbd-sata]
volume_driver=cinder.volume.driver.RBDDriver
rbd_pool=cinder-sata
volume_backend_name=RBD_SATA
# if cephX is enable
#rbd_user=cinder
#rbd_secret_uuid=&lt;None&gt;
[rbd-ssd]
volume_driver=cinder.volume.driver.RBDDriver
rbd_pool=cinder-ssd
volume_backend_name=RBD_SSD
# if cephX is enable
#rbd_user=cinder
#rbd_secret_uuid=&lt;None&gt;
</code></pre>

<p>Unfortunately the rbd driver doesn&#8217;t support this variable yet. This feature has been submitted here: <a href="https://review.openstack.org/#/c/27535/">https://review.openstack.org/#/c/27535/</a>.</p>

<p>Then create the pointers:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>cinder <span class="nb">type</span>-key ssd <span class="nb">set </span><span class="nv">volume_backend_name</span><span class="o">=</span>RBD_SSD
</span><span class='line'><span class="nv">$ </span>cinder <span class="nb">type</span>-key sata <span class="nb">set </span><span class="nv">volume_backend_name</span><span class="o">=</span>RBD_SATA
</span><span class='line'><span class="nv">$ </span>cinder extra-specs-list
</span><span class='line'>+--------------------------------------+------+---------------------------------------+
</span><span class='line'>|                  ID                  | Name |              extra_specs              |
</span><span class='line'>+--------------------------------------+------+---------------------------------------+
</span><span class='line'>| b1522968-e4fa-4372-8ac4-3925b7c79ee1 | ssd  |  <span class="o">{</span>u<span class="s1">&#39;volume_backend_name&#39;</span>: u<span class="s1">&#39;RBD_SSD&#39;</span><span class="o">}</span> |
</span><span class='line'>| b50bf5a3-6044-4392-beeb-432302f6421c | sata | <span class="o">{</span>u<span class="s1">&#39;volume_backend_name&#39;</span>: u<span class="s1">&#39;RBD_SATA&#39;</span><span class="o">}</span> |
</span><span class='line'>+--------------------------------------+------+---------------------------------------+
</span></code></pre></td></tr></table></div></figure>


<p>Then restart cinder services:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>sudo restart cinder-api ; sudo restart cinder-scheduler ; sudo restart cinder-volume
</span></code></pre></td></tr></table></div></figure>


<p>Eventually create 2 volume type, one for each backend:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>cinder <span class="nb">type</span>-create ssd
</span><span class='line'>+--------------------------------------+------+
</span><span class='line'>|                  ID                  | Name |
</span><span class='line'>+--------------------------------------+------+
</span><span class='line'>| b1522968-e4fa-4372-8ac4-3925b7c79ee1 | ssd  |
</span><span class='line'>+--------------------------------------+------+
</span><span class='line'>
</span><span class='line'><span class="nv">$ </span>cinder <span class="nb">type</span>-create sata
</span><span class='line'>+--------------------------------------+------+
</span><span class='line'>|                  ID                  | Name |
</span><span class='line'>+--------------------------------------+------+
</span><span class='line'>| b50bf5a3-6044-4392-beeb-432302f6421c | sata |
</span><span class='line'>+--------------------------------------+------+
</span></code></pre></td></tr></table></div></figure>




<br />


<h2>II.2. Play with it</h2>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
<span class='line-number'>14</span>
<span class='line-number'>15</span>
<span class='line-number'>16</span>
<span class='line-number'>17</span>
<span class='line-number'>18</span>
<span class='line-number'>19</span>
<span class='line-number'>20</span>
<span class='line-number'>21</span>
<span class='line-number'>22</span>
<span class='line-number'>23</span>
<span class='line-number'>24</span>
<span class='line-number'>25</span>
<span class='line-number'>26</span>
<span class='line-number'>27</span>
<span class='line-number'>28</span>
<span class='line-number'>29</span>
<span class='line-number'>30</span>
<span class='line-number'>31</span>
<span class='line-number'>32</span>
<span class='line-number'>33</span>
<span class='line-number'>34</span>
<span class='line-number'>35</span>
<span class='line-number'>36</span>
<span class='line-number'>37</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>cinder create --volume_type ssd --display_name vol-ssd 1
</span><span class='line'>+---------------------+--------------------------------------+
</span><span class='line'>|       Property      |                Value                 |
</span><span class='line'>+---------------------+--------------------------------------+
</span><span class='line'>|     attachments     |                  <span class="o">[]</span>                  |
</span><span class='line'>|  availability_zone  |                 nova                 |
</span><span class='line'>|       bootable      |                <span class="nb">false</span>                 |
</span><span class='line'>|      created_at     |      2013-04-22T14:54:53.917580      |
</span><span class='line'>| display_description |                 None                 |
</span><span class='line'>|     display_name    |               vol-ssd                |
</span><span class='line'>|          id         | 4c777d96-66e4-4f85-815c-92d4503c5c8c |
</span><span class='line'>|       metadata      |                  <span class="o">{}</span>                  |
</span><span class='line'>|         size        |                  1                   |
</span><span class='line'>|     snapshot_id     |                 None                 |
</span><span class='line'>|     source_volid    |                 None                 |
</span><span class='line'>|        status       |               creating               |
</span><span class='line'>|     volume_type     |                 ssd                  |
</span><span class='line'>+---------------------+--------------------------------------+
</span><span class='line'>
</span><span class='line'><span class="nv">$ </span>cinder create --volume_type ssd --display_name vol-sata 1
</span><span class='line'>+---------------------+--------------------------------------+
</span><span class='line'>|       Property      |                Value                 |
</span><span class='line'>+---------------------+--------------------------------------+
</span><span class='line'>|     attachments     |                  <span class="o">[]</span>                  |
</span><span class='line'>|  availability_zone  |                 nova                 |
</span><span class='line'>|       bootable      |                <span class="nb">false</span>                 |
</span><span class='line'>|      created_at     |      2013-04-22T14:54:58.831327      |
</span><span class='line'>| display_description |                 None                 |
</span><span class='line'>|     display_name    |               vol-sata               |
</span><span class='line'>|          id         | 8e347bd1-2044-40a2-ae87-ee9a23cddd71 |
</span><span class='line'>|       metadata      |                  <span class="o">{}</span>                  |
</span><span class='line'>|         size        |                  1                   |
</span><span class='line'>|     snapshot_id     |                 None                 |
</span><span class='line'>|     source_volid    |                 None                 |
</span><span class='line'>|        status       |               creating               |
</span><span class='line'>|     volume_type     |                 ssd                  |
</span><span class='line'>+---------------------+--------------------------------------+
</span></code></pre></td></tr></table></div></figure>


<p>Does it work?</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>rbd -p cinder-ssd ls
</span><span class='line'>volume-8e347bd1-2044-40a2-ae87-ee9a23cddd71
</span><span class='line'>
</span><span class='line'><span class="nv">$ </span>rbd -p cinder-sata ls
</span><span class='line'>volume-4c777d96-66e4-4f85-815c-92d4503c5c8c
</span></code></pre></td></tr></table></div></figure>




<br />


<blockquote><p>It&#8217;s nice that the multi-backend came with Cinder, we are gradually getting to enjoy the full power of Ceph!</p></blockquote>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Play with Ceph - Vagrant Box]]></title>
    <link href="http://sebastien-han.fr/blog/2013/04/22/play-with-ceph-vagrant-box/"/>
    <updated>2013-04-22T11:14:00+02:00</updated>
    <id>http://sebastien-han.fr/blog/2013/04/22/play-with-ceph-vagrant-box</id>
    <content type="html"><![CDATA[<p><img class="center" src="http://sebastien-han.fr/images/vagrant-logo.png" title="Play with Ceph - Vagrant Box" ></p>

<p>Materials to start playing with Ceph. This Vagrant box contains a all-in-one Ceph installation.</p>

<!--more-->


<h1>I. Setup</h1>

<p>First <a href="http://downloads.vagrantup.com/">Download</a> and <a href="http://docs.vagrantup.com/v2/installation/index.html">Install</a> Vagrant.</p>

<p>Download the Ceph box: <a href="https://www.dropbox.com/s/hn28qgjn59nud6h/ceph-all-in-one.box">here</a>. This box contains one virtual machine:</p>

<ul>
<li>Ceph VM contains 2 OSDs (1 disk each), 1 MDS, 1 MON, 1 RGW. A modified CRUSH Map, it simply represents a full datacenter and applies a replica per OSD</li>
<li>VagrantFile for both VM client and ceph</li>
<li>Other include files</li>
</ul>


<p>Download an extra VM for the client <a href="http://dl.dropbox.com/u/1537815/precise64.box">here</a>, note that Debian and Red Hat based system work perfectly, thus it&#8217;s up to you:</p>

<ul>
<li>Client: just an Ubuntu installation</li>
</ul>


<p>Initialize the Ceph box:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>wget https://www.dropbox.com/s/hn28qgjn59nud6h/ceph-all-in-one.box
</span><span class='line'>...
</span><span class='line'>...
</span><span class='line'><span class="nv">$ </span>vagrant box add ceph-all-in-one ceph-all-in-one.box
</span><span class='line'><span class="o">[</span>vagrant<span class="o">]</span> Downloading with Vagrant::Downloaders::File...
</span><span class='line'><span class="o">[</span>vagrant<span class="o">]</span> Copying box to temporary location...
</span><span class='line'><span class="o">[</span>vagrant<span class="o">]</span> Extracting box...
</span><span class='line'><span class="o">[</span>vagrant<span class="o">]</span> Verifying box...
</span><span class='line'><span class="o">[</span>vagrant<span class="o">]</span> Cleaning up downloaded box...
</span></code></pre></td></tr></table></div></figure>


<p>Initialize the Client box:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>wget http://dl.dropbox.com/u/1537815/precise64.box
</span><span class='line'><span class="nv">$ </span>vagrant box add ubuntu-12.04.1 precise64.box
</span><span class='line'><span class="o">[</span>vagrant<span class="o">]</span> Downloading with Vagrant::Downloaders::File...
</span><span class='line'><span class="o">[</span>vagrant<span class="o">]</span> Copying box to temporary location...
</span><span class='line'><span class="o">[</span>vagrant<span class="o">]</span> Extracting box...
</span><span class='line'><span class="o">[</span>vagrant<span class="o">]</span> Verifying box...
</span><span class='line'><span class="o">[</span>vagrant<span class="o">]</span> Cleaning up downloaded box...
</span></code></pre></td></tr></table></div></figure>


<p>Check your boxes:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>vagrant box list
</span><span class='line'>ceph-all-in-one
</span><span class='line'>ubuntu-12.04.1
</span></code></pre></td></tr></table></div></figure>


<p>Import all the files from the box:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>mkdir setup
</span><span class='line'><span class="nv">$ </span>cp /Users/leseb/.vagrant.d/boxes/ceph-all-in-one/include/* setup/
</span><span class='line'><span class="nv">$ </span>mv setup/_Vagrantfile Vagrantfile
</span></code></pre></td></tr></table></div></figure>


<p>In order to make the setup easy, I assume that your working directory is <code>$HOME/ceph</code>. At the end, your tree directory looks like this:</p>

<pre><code>.
├── Vagrantfile
├── ceph-all-in-one.box
├── precise64.box
└── setup
    ├── ceph.conf
    ├── ceph.sh
    └── keyring
</code></pre>

<br />


<h1>II. Start it!</h1>

<p>Check the state of your virtual machines:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>vagrant status
</span><span class='line'>Current VM states:
</span><span class='line'>
</span><span class='line'>client                   poweroff
</span><span class='line'>ceph                     poweroff
</span><span class='line'>
</span><span class='line'>This environment represents multiple VMs. The VMs are all listed
</span><span class='line'>above with their current state. For more information about a specific
</span><span class='line'>VM, run <span class="sb">`</span>vagrant status NAME<span class="sb">`</span>.
</span></code></pre></td></tr></table></div></figure>


<p>Eventually run them:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>vagrant up ceph <span class="o">&amp;&amp;</span> vagrant up client
</span><span class='line'>...
</span><span class='line'>...
</span></code></pre></td></tr></table></div></figure>


<p>The next time, you&#8217;ll run the client, run it this way to don&#8217;t re-provision the machine:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>vagrant up --no-provision client
</span></code></pre></td></tr></table></div></figure>


<p>Eventually SSH on your client:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
<span class='line-number'>14</span>
<span class='line-number'>15</span>
<span class='line-number'>16</span>
<span class='line-number'>17</span>
<span class='line-number'>18</span>
<span class='line-number'>19</span>
<span class='line-number'>20</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>vagrant ssh client
</span><span class='line'>...
</span><span class='line'>
</span><span class='line'>vagrant@ceph:~<span class="nv">$ </span>sudo ceph -s
</span><span class='line'>   health HEALTH_OK
</span><span class='line'>   monmap e3: 1 mons at <span class="o">{</span><span class="nv">1</span><span class="o">=</span>192.168.251.100:6790/0<span class="o">}</span>, election epoch 1, quorum 0 1
</span><span class='line'>   osdmap e179: 2 osds: 2 up, 2 in
</span><span class='line'>    pgmap v724: 96 pgs: 96 active+clean; 9199 bytes data, 2071 MB used, 17906 MB / 19978 MB avail; 232B/s wr, 0op/s
</span><span class='line'>   mdsmap e54: 1/1/1 up <span class="o">{</span><span class="nv">0</span><span class="o">=</span><span class="nv">0</span><span class="o">=</span>up:active<span class="o">}</span>
</span><span class='line'>
</span><span class='line'>vagrant@ceph:~<span class="nv">$ </span>sudo ceph osd tree
</span><span class='line'><span class="c"># id  weight  type name up/down reweight</span>
</span><span class='line'>-1  2 root default
</span><span class='line'>-4  2   datacenter dc
</span><span class='line'>-5  2     room laroom
</span><span class='line'>-6  2       row larow
</span><span class='line'>-3  2         rack lerack
</span><span class='line'>-2  2           host ceph
</span><span class='line'>0 1             osd.0 up  1
</span><span class='line'>1 1             osd.1 up  1
</span></code></pre></td></tr></table></div></figure>




<br />


<h1>III. Bonus upgrade to Cuttlefish</h1>

<p>It&#8217;s fairly easy to upgrade the box to last stable version Cuttlefish. For this simply edit <code>/etc/apt/sources.list.d/ceph.list/ceph.list</code> with the following:</p>

<pre><code>deb http://ceph.com/debian-cuttlefish/ precise main
</code></pre>

<p>Then run:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>sudo apt-get update <span class="o">&amp;&amp;</span> apt-get install ceph
</span><span class='line'><span class="nv">$ </span>sudo service ceph restart
</span><span class='line'><span class="nv">$ </span>sudo ceph -v
</span><span class='line'>ceph version 0.61 <span class="o">(</span>237f3f1e8d8c3b85666529860285dcdffdeda4c5<span class="o">)</span>
</span></code></pre></td></tr></table></div></figure>




<br />


<p><span class="text_quote">R </span>Note: if for some reasons you get a status were only 1/2 OSDs are up, just restart the mon. This should do the trick :-).</p>

<br />


<blockquote><p>I use this box everyday for all my test, it&#8217;s quite handy to destroy and rebuild it within a minute. Build, destroy, build destroy, I think you got it! Hope it ;-)</p></blockquote>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Some Ceph experiments]]></title>
    <link href="http://sebastien-han.fr/blog/2013/04/17/some-ceph-experiments/"/>
    <updated>2013-04-17T22:22:00+02:00</updated>
    <id>http://sebastien-han.fr/blog/2013/04/17/some-ceph-experiments</id>
    <content type="html"><![CDATA[<p><img class="center" src="http://sebastien-han.fr/images/ceph-experiment.png" title="Some Ceph experiments" ></p>

<p>Sometimes it&#8217;s just funny to experiment the theory, just to notice &#8220;oh well it works as expected&#8221;. This is why today I&#8217;d like to share some experiments with 2 really specific flags: <code>noout</code> and <code>nodown</code>. Behaviors describe in the article are well known because of the design of Ceph, so don&#8217;t yell at me: &#8216;Tell us something we don&#8217;t know!&#8217;, simply see this article a set of exercises that demonstrate some Ceph internal functions :-).</p>

<!--more-->


<h1>I. What do they do?</h1>

<p>Flags definitions:</p>

<ul>
<li><code>noout</code>: an OSD marked as <code>out</code> means that it might be running but doesn&#8217;t actually receive any data since it&#8217;s not part of the CRUSH Map (opposite of being marked <code>in</code>). Thus the option <code>noout</code> prevents OSDs from being marked <code>out</code> of the cluster.</li>
<li><code>nodown</code>: an OSD marked as <code>down</code> means that it&#8217;s unresponsive to the health check of its peers, thus a weight of 0 is put and the OSD won&#8217;t receive any data, this prevents clients from writing to it. However, note that the OSD is still part of the CRUSH map. At the end using the <code>nodown</code> option forces all the OSD to always remain with a weight of 1 (something else, but Ceph won&#8217;t change the set value).</li>
</ul>


<br />


<h1>II. Experiments</h1>

<h2>II.1. noout on a running cluster</h2>

<p>It&#8217;s interesting to look at the PG behavior. Example sample with the PG number 3.4::</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>ceph pg dump | egrep ^3.4
</span><span class='line'>3.4 0 0 0 0 0 0 0 active+clean  2013-03-27 18:20:33.847308  0<span class="s1">&#39;0 57&#39;</span>10 <span class="o">[</span>1,4<span class="o">]</span> <span class="o">[</span>1,4<span class="o">]</span> 0<span class="s1">&#39;0 2013-03-27 18:20:33.847246  0&#39;</span>0 2013-03-27 18:20:33.847246
</span></code></pre></td></tr></table></div></figure>


<p>We can see that the primary OSD of the pg number 3.4 is the <code>osd.1</code> thanks to the field &#8216;active&#8217; with [1,4], where the first active OSD represents the primary OSD.</p>

<p>Then apply the flag:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>ceph osd <span class="nb">set </span>noout
</span><span class='line'><span class="nb">set </span>noout
</span></code></pre></td></tr></table></div></figure>


<p>You get notified from the cli:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>ceph -s
</span><span class='line'>   health HEALTH_WARN noout flag<span class="o">(</span>s<span class="o">)</span> <span class="nb">set</span>
</span><span class='line'><span class="nb">   </span>monmap e3: 3 mons at <span class="o">{</span><span class="nv">0</span><span class="o">=</span>192.168.252.10:6789/0,1<span class="o">=</span>192.168.252.11:6789/0,2<span class="o">=</span>192.168.252.12:6789/0<span class="o">}</span>, election epoch 14, quorum 0,1,2 0,1,2
</span><span class='line'>   osdmap e63: 6 osds: 6 up, 6 in
</span><span class='line'>   pgmap v285: 200 pgs: 200 active+clean; 20480 KB data, 391 MB used, 23452 MB / 23844 MB avail
</span><span class='line'>   mdsmap e1: 0/0/1 up
</span></code></pre></td></tr></table></div></figure>


<p>Just recall that OSD can have different states depending on the object:</p>

<ul>
<li><p>primary: it manages</p>

<ul>
<li>replication to secondary OSDs</li>
<li>data re-balancing</li>
<li>recovery from failure</li>
<li>data consistency (scrubbing operations)</li>
</ul>
</li>
<li><p>secondary:</p>

<ul>
<li>acts as a slave from the primary and receives order from it</li>
</ul>
</li>
</ul>


<p>Then stop the primary OSD:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>sudo service ceph stop osd.1
</span><span class='line'>
</span><span class='line'><span class="nv">$ </span>ceph pg dump | egrep ^3.4
</span><span class='line'>3.4 0 0 0 0 20971520  595 595 active+degraded 2013-03-27 18:29:52.215491  61<span class="s1">&#39;5  60&#39;</span>20 <span class="o">[</span>4<span class="o">]</span> <span class="o">[</span>4<span class="o">]</span> 0<span class="s1">&#39;0 2013-03-27 18:20:33.847246  0&#39;</span>0 2013-03-27 18:20:33.847246
</span></code></pre></td></tr></table></div></figure>


<p>Now only the OSD 4 is active and switch to primary for this PG, this one will receive all the IO operations. Under normal circonstances (and because the init script does it), the OSD should be marked as out automatically. Then a new secondary OSD will be elected.</p>

<p>Create an object and it into RADOS:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>dd <span class="k">if</span><span class="o">=</span>/dev/zero <span class="nv">of</span><span class="o">=</span>seb <span class="nv">bs</span><span class="o">=</span>10M <span class="nv">count</span><span class="o">=</span>2
</span><span class='line'>2+0 records in
</span><span class='line'>2+0 records out
</span><span class='line'>20971520 bytes <span class="o">(</span>21 MB<span class="o">)</span> copied, 0.0386746 s, 542 MB/s
</span><span class='line'>
</span><span class='line'><span class="nv">$ </span>sync
</span><span class='line'>
</span><span class='line'><span class="nv">$ </span>rados put seb seb
</span></code></pre></td></tr></table></div></figure>


<p>Yes it&#8217;s there inside osd.4 (as expected):</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>sudo ls /var/lib/ceph/osd/osd.4/current/3.4_head/
</span><span class='line'>seb__head_3E715054__3
</span></code></pre></td></tr></table></div></figure>


<p>Obviously you won&#8217;t find anything in the old primary OSD (1). Of course a pg dump confirms that we have one object (second field):</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>ceph pg dump | egrep ^3.4
</span><span class='line'>3.4 1 0 1 0 20971520  595 595 active+degraded 2013-03-27 18:29:52.215491  61<span class="s1">&#39;5  60&#39;</span>20 <span class="o">[</span>4<span class="o">]</span> <span class="o">[</span>4<span class="o">]</span> 0<span class="s1">&#39;0 2013-03-27 18:20:33.847246  0&#39;</span>0 2013-03-27 18:20:33.847246
</span></code></pre></td></tr></table></div></figure>


<p>Now restart the OSD process and unset the noout value:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>sudo service ceph start osd.1
</span><span class='line'>
</span><span class='line'><span class="nv">$ </span>ceph osd <span class="nb">unset </span>noout
</span><span class='line'><span class="nb">unset </span>noout
</span></code></pre></td></tr></table></div></figure>


<p>OSD 1 will get re-promoted as primary, OSD 4 as secondary and the object will be replicated from osd.4 to osd.1. A new pg dump can attest this:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>ceph pg dump | egrep ^3.4
</span><span class='line'>3.4 1 0 0 0 20971520  595 595 active+clean  2013-03-27 18:41:50.970358  61<span class="s1">&#39;5  62&#39;</span>20 <span class="o">[</span>1,4<span class="o">]</span> <span class="o">[</span>1,4<span class="o">]</span> 0<span class="s1">&#39;0 2013-03-27 18:20:33.847246  0&#39;</span>0 2013-03-27 18:20:33.847246
</span></code></pre></td></tr></table></div></figure>




<br />


<blockquote><p>Stop and think, ok, what did we learn from this exercise? Well you already might have guess, this flag is ideal to perform maintenance operations. Ok you can&#8217;t satisfy the wished number of replica but this is a temporary procedure and really like the flexibility that Ceph brings here. Now let&#8217;s switch to the <code>nodown</code>option which is a complete different story.</p></blockquote>

<br />


<h2>II.2. nodown on a running cluster</h2>

<p>Apply the flag:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>ceph osd <span class="nb">set </span>nodown
</span><span class='line'><span class="nb">set </span>nodown
</span><span class='line'>
</span><span class='line'><span class="nv">$ </span>ceph -s
</span><span class='line'>   health HEALTH_WARN nodown flag<span class="o">(</span>s<span class="o">)</span> <span class="nb">set</span>
</span><span class='line'><span class="nb">   </span>monmap e3: 3 mons at <span class="o">{</span><span class="nv">0</span><span class="o">=</span>192.168.252.10:6789/0,1<span class="o">=</span>192.168.252.11:6789/0,2<span class="o">=</span>192.168.252.12:6789/0<span class="o">}</span>, election epoch 14, quorum 0,1,2 0,1,2
</span><span class='line'>   osdmap e66: 6 osds: 6 up, 6 in
</span><span class='line'>   pgmap v294: 200 pgs: 200 active+clean; 20480 KB data, 395 MB used, 23448 MB / 23844 MB avail
</span><span class='line'>   mdsmap e1: 0/0/1 up
</span></code></pre></td></tr></table></div></figure>


<p>Create an object and it into RADOS:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>dd <span class="k">if</span><span class="o">=</span>/dev/zero <span class="nv">of</span><span class="o">=</span>baba <span class="nv">bs</span><span class="o">=</span>10M <span class="nv">count</span><span class="o">=</span>2
</span><span class='line'>2+0 records in
</span><span class='line'>2+0 records out
</span><span class='line'>20971520 bytes <span class="o">(</span>21 MB<span class="o">)</span> copied, 0.0428323 s, 490 MB/s
</span><span class='line'>
</span><span class='line'><span class="nv">$ </span>sync
</span><span class='line'>
</span><span class='line'><span class="nv">$ </span>rados put baba baba
</span></code></pre></td></tr></table></div></figure>


<p>Annnnnd the cluster HANG,hanging hanging hanging&#8230; !!!!!!</p>

<p>Simply because a way or another the synchronous and the atomicity of the write request can&#8217;t be satisfied. Either the primary or secondary OSD remain with a weight that makes it available to receive data from clients.</p>

<p>Eventually you end up with the following WARNINGS logs:</p>

<pre><code>osd.4 [WRN] 1 slow requests, 1 included below; oldest blocked for &gt; 30.211296 secs
osd.4 [WRN] slow request 30.211296 seconds old, received at 2013-03-27 19:01:58.127010: osd_op(client.4757.0:1 baba [writefull 0~4194304] 3.f9c3dd2e) v4 currently waiting for subops from [1]
osd.4 [WRN] 1 slow requests, 1 included below; oldest blocked for &gt; 60.235452 secs
osd.4 [WRN] slow request 60.235452 seconds old, received at 2013-03-27 19:01:58.127010: osd_op(client.4757.0:1 baba [writefull 0~4194304] 3.f9c3dd2e) v4 currently waiting for subops from [1]
</code></pre>

<p>That&#8217;s normal since the OSD is down but appears as up, so the client tries to write the object to it&#8230;</p>

<p>However it&#8217;s interesting to note that stripping model can be easily catch. Here the primary was osd.4 so ok the first 4M was written:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'>-rw-r--r--  1 root root 4.0M Mar 27 19:10 baba__head_F9C3DD2E__3
</span></code></pre></td></tr></table></div></figure>


<p>From this, it&#8217;s fairly easy to determine how writes objects. Ceph is writing 4M per 4M blocks and wait, see the following process:</p>

<pre><code>--&gt; first 4M osd.primary journal --&gt; osd.primary --&gt; osd.secondary journal --&gt; osd.secondary --&gt; second 4M osd.primary journal --&gt; osd.primary --&gt; and so on...
</code></pre>

<p>This command will change the behavior to 8M:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>rados -b 8388608 put baba baba
</span></code></pre></td></tr></table></div></figure>




<br />


<blockquote><p>Stop and think, ok, what did we learn from this exercise? Well I might have miss something, thus if one Inktank&#8217;s fellow is around, I&#8217;ll be happy to learn the idea behind this option, because I simply can&#8217;t think about a proper usage of it.</p></blockquote>

<p>Conclusion:</p>

<br />


<blockquote><p>I hope you enjoyed (maybe learn?) from those exercises. The main point of this article was to show that you can easily operate in degraded mode with the <code>noout</code> option. For a more technical depth read the <a href="http://ceph.com/docs/master/architecture/#how-ceph-clients-stripe-data">Ceph documentation</a>.</p></blockquote>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[See you at the OpenStack summit]]></title>
    <link href="http://sebastien-han.fr/blog/2013/04/11/see-you-at-the-openstack-summit/"/>
    <updated>2013-04-11T09:43:00+02:00</updated>
    <id>http://sebastien-han.fr/blog/2013/04/11/see-you-at-the-openstack-summit</id>
    <content type="html"><![CDATA[<p><img class="center" src="http://sebastien-han.fr/images/openstack-summit-portland.jpg" title="See you at the OpenStack summit" ></p>

<p>Next week is the <a href="http://www.openstack.org/summit/portland-2013/">OpenStack Summit</a> conference. <a href="http://www.enovance.com/fr/blog/5498/enovance-openstack-summit-portland">eNovance&#8217;s</a> team will be present at the event. The convention is going to be awesome, I saw plenty of amazing sessions and I really look forward to attending those talks. Speaking about session, <a href="http://my1.fr/">Emilien Macchi</a> and I will give a talk about High Availability in OpenStack. You can check the talk page <a href="http://openstacksummitapril2013.sched.org/event/b94184e87ffaf649fa15952f8553e44e#.UWgoVKtgaZ0">here</a>. Hope to see you there!</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Openstack: quickly fix mirrored queues errors]]></title>
    <link href="http://sebastien-han.fr/blog/2013/04/09/openstack-quickly-fix-mirrored-queues-errors/"/>
    <updated>2013-04-09T22:10:00+02:00</updated>
    <id>http://sebastien-han.fr/blog/2013/04/09/openstack-quickly-fix-mirrored-queues-errors</id>
    <content type="html"><![CDATA[<p><img class="center" src="http://sebastien-han.fr/images/fix-mirrored-queues.jpg" title="Openstack: quickly fix mirrored queues errors" ></p>

<p>Just started with Grizzly and already been through some minor issues :).</p>

<!--more-->


<p>After a fresh installation, while trying to start all the Nova services, I came across the following error from the logs. See below the scheduler logs:</p>

<pre><code>TRACE nova AMQPChannelException: (406, u"PRECONDITION_FAILED - inequivalent arg 'x-ha-policy'for queue 'scheduler' in vhost '/':
received the value 'all' of type 'longstr' but current is none", (50, 10), 'Channel.queue_declare')
</code></pre>

<p>This error is quite explicit and <strong>not related to OpenStack</strong>. Somehow a queue with the same name was already living in this vhost, for which the x-ha-policy was set to <code>none</code> thus while I restarted the nova-scheduler, it reclared the queue with a different policy like so <code>x-ha-policy: all</code>. At the end, we need to purge all the queues to avoid any conflicts.</p>

<p>Quick fix:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>sudo rabbitmqctl stop_app
</span><span class='line'><span class="nv">$ </span>sudo rabbitmqctl reset
</span><span class='line'><span class="nv">$ </span>sudo rabbitmqctl start_app
</span></code></pre></td></tr></table></div></figure>


<p>Bonus, this how to enable the mirrored queues in your <code>nova.conf</code>:</p>

<pre><code>rabbit_hosts = &lt;ip-rabbit-server1&gt;:5672,&lt;ip-rabbit-server2&gt;:5672
rabbit_ha_queues = True
</code></pre>

<p>RabbitMQ needs to be confirmed in clustering mode, see <a href="http://www.rabbitmq.com/clustering.html">RabbitMQ official documentation</a>.</p>

<br />


<blockquote><p>Enjoy your mirrored queues :D</p></blockquote>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Ceph Puppet Modules]]></title>
    <link href="http://sebastien-han.fr/blog/2013/03/25/ceph-puppet-modules/"/>
    <updated>2013-03-25T15:58:00+01:00</updated>
    <id>http://sebastien-han.fr/blog/2013/03/25/ceph-puppet-modules</id>
    <content type="html"><![CDATA[<p>Quite recently <a href="https://github.com/fcharlier">François Charlier</a> and I worked together on the <a href="https://puppetlabs.com/">Puppet</a>  modules for <a href="ceph.com">Ceph</a> on behalf of our employer <a href="enovance.com">eNovance</a>. In fact, François started to work on them last summer, back then he achieved the Monitor manifests. So basically, we worked on the OSD manifest. Modules are in pretty good shape thus we thought it was important to communicate to the community. That&#8217;s enough talk, let&#8217;s dive into these modules and explain what do they do. See below what&#8217;s available:</p>

<ul>
<li>Testing environment is <a href="www.vagrantup.com">Vagrant</a> ready.</li>
<li>Bobtail Debian latest stable version will be installed</li>
<li>The module only supports CephX, at least for now</li>
<li>Generic deployment for 3 monitors based on a template file examples/common.sh which respectively includes mon.sh, osd.sh, mds.sh.</li>
<li>Generic deployment for N OSDs. OSD disks need to be set from the examples/site.pp file (line 71). Puppet will format specified disks in XFS (only filesystem implemented) using these options: <code>-f -d agcount=&lt;cpu-core-number&gt; -l size=1024m -n size=64k</code> and finally mounted with: <code>rw,noatime,inode64</code>. Then it will mount all of them and append the appropriate lines in the fstab file of each storage node. Finally the OSDs will be added into Ceph.</li>
</ul>


<br />


<blockquote><p>All the necessary materials (sources and how-to) are publicly available (and for free) under AGPL license on <a href="https://github.com/enovance/puppet-ceph">eNovance&#8217;s Github</a>. Those manifests do the job quite nicely, although we still need to work on MDS (90% done, just need a validation), RGW (0% done) and a more flexible implementation (authentication and filesystem support). Obviously comments, constructive critics and feedback are more then welcome. Thus don&#8217;t hesitate to drop an email to either François (f.charlier@enovance.com) or I (sebastien@enovance.com) if you have further questions.</p></blockquote>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Grizzly availability zones]]></title>
    <link href="http://sebastien-han.fr/blog/2013/03/18/grizzly-availability-zones/"/>
    <updated>2013-03-18T00:09:00+01:00</updated>
    <id>http://sebastien-han.fr/blog/2013/03/18/grizzly-availability-zones</id>
    <content type="html"><![CDATA[<p>Short short update. I previously wrote an article about <a href="http://www.sebastien-han.fr/blog/2013/01/24/openstack-nova-play-with-availability-zones/">OpenStack and availability zones</a>, unfortunately the full potential wasn&#8217;t entirely explored in Folsom, at least clients weren&#8217;t able to see the AZ available. The command finally landed in Grizzly.</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>nova availability-zone-list
</span><span class='line'>+------+-----------+
</span><span class='line'>| Name | Status    |
</span><span class='line'>+------+-----------+
</span><span class='line'>| nova | available |
</span><span class='line'>| ssd  | available |
</span><span class='line'>| sata | available |
</span><span class='line'>+------+-----------+
</span></code></pre></td></tr></table></div></figure>

]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Ceph: change PG number on the fly]]></title>
    <link href="http://sebastien-han.fr/blog/2013/03/12/ceph-change-pg-number-on-the-fly/"/>
    <updated>2013-03-12T21:11:00+01:00</updated>
    <id>http://sebastien-han.fr/blog/2013/03/12/ceph-change-pg-number-on-the-fly</id>
    <content type="html"><![CDATA[<p><img class="center" src="http://sebastien-han.fr/images/ceph-increase-pgs-num.jpg" title="Ceph: change PG number on the fly" ></p>

<p>A Placement Group (PG) aggregates a series of objects into a group, and maps the group to a series of OSDs. A common mistake while creating a pool is to use the <code>rados</code> command which by default creates a pool of 8 PGs. Sometime you don&#8217;t properly know how to set this value thus you use the <code>ceph</code> command but put an extremely high value for it. Both case are bad and could lead to some unfortunate situations. In this article, I will explore some methods to work around this major problem.</p>

<!--more-->


<p>Short and <strong>experimental</strong> solution, you <strong>should not</strong> use, please just note that this command has just been implemented so running it could result in data loss. Currently some patches are set in review so this command <strong>is not stable</strong> yet. If you really want to try it, I suggest to play with <strong>only against a test cluster</strong>, so once again <strong>DON&#8217;T TRY THIS ON A PRODUCTION CLUSTER</strong>, merci :-).</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>ceph osd pool <span class="nb">set</span> &lt;poolname&gt; pg_num &lt;numpgs&gt; --allow-experimental-feature
</span></code></pre></td></tr></table></div></figure>


<p>See the example below:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>ceph osd pool <span class="nb">set </span>monpool pg_num 512 --allow-experimental-feature
</span><span class='line'><span class="nb">set </span>pool 512 pg_num to 512
</span></code></pre></td></tr></table></div></figure>


<p>Clean and <strong>perfectly safe</strong> work around:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>ceph osd pool create &lt;my-new-pool&gt; &lt;pg_num&gt;
</span><span class='line'><span class="nv">$ </span>rados cppool &lt;my-old-pool&gt; &lt;my-new-pool&gt;
</span><span class='line'><span class="nv">$ </span>ceph osd pool delete &lt;my-old-pool&gt;
</span><span class='line'><span class="nv">$ </span>ceph osd pool rename &lt;my-new-pool&gt; &lt;my-old-pool&gt;
</span></code></pre></td></tr></table></div></figure>




<br />


<blockquote><p>It&#8217;s one of the good feature that must be implemented since Ceph is designed to scale under the infinite, the <code>pg_num</code> could grow as the cluster does.</p></blockquote>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[MySQL and general logs]]></title>
    <link href="http://sebastien-han.fr/blog/2013/02/21/mysql-and-general-log/"/>
    <updated>2013-02-21T11:45:00+01:00</updated>
    <id>http://sebastien-han.fr/blog/2013/02/21/mysql-and-general-log</id>
    <content type="html"><![CDATA[<p>Update the <code>general_log</code> variable while MySQL runs.</p>

<!--more-->


<p>Somehow one of my MySQL server had the variable <code>general_log</code> set to 1, which is pretty anoying because the file keeps growing and growing. See the symptom below:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>sudo lsof /var/log/mysql
</span><span class='line'>COMMAND  PID  USER   FD   TYPE DEVICE    SIZE/OFF NODE NAME
</span><span class='line'>mysqld  6849 mysql    3u   REG  252,7         160   12 /var/log/mysql/mysql-bin.index
</span><span class='line'>mysqld  6849 mysql   11w   REG  252,7 19042985830   14 /var/log/mysql/mysql.log.1 <span class="o">(</span>deleted<span class="o">)</span>
</span><span class='line'>mysqld  6849 mysql   40w   REG  252,7    52876402   17 /var/log/mysql/mysql-bin.000029
</span><span class='line'>mysqld  6849 mysql   72u   REG  252,7    52876402   17 /var/log/mysql/mysql-bin.000029
</span></code></pre></td></tr></table></div></figure>


<p>Don&#8217;t forget to edit your <code>my.cnf</code> and then change the value of the global variable:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
</pre></td><td class='code'><pre><code class='sql'><span class='line'><span class="n">mysql</span><span class="o">&gt;</span> <span class="k">SET</span> <span class="k">GLOBAL</span> <span class="n">general_log</span><span class="o">=</span><span class="mi">0</span><span class="p">;</span>
</span><span class='line'><span class="n">Query</span> <span class="n">OK</span><span class="p">,</span> <span class="mi">0</span> <span class="k">rows</span> <span class="n">affected</span> <span class="p">(</span><span class="mi">13</span><span class="p">.</span><span class="mi">17</span> <span class="n">sec</span><span class="p">)</span>
</span></code></pre></td></tr></table></div></figure>


<p>This might take a couple of seconds to remove the file entirely, you can still <code>watch</code> your <code>lsof</code> command and see the size dicreasing.</p>

<br />


<blockquote><p>As always, I hope it helps :)</p></blockquote>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[OpenStack: override DHCP information sent by DNSMASQ to a VM]]></title>
    <link href="http://sebastien-han.fr/blog/2013/02/18/openstack-override-dhcp-information-send-by-dnsmasq-to-the-vm/"/>
    <updated>2013-02-18T11:21:00+01:00</updated>
    <id>http://sebastien-han.fr/blog/2013/02/18/openstack-override-dhcp-information-send-by-dnsmasq-to-the-vm</id>
    <content type="html"><![CDATA[<p><img class="center" src="http://sebastien-han.fr/images/openstack-dnsmasq-dhcp.jpg" title="OpenStack: override DHCP information sent by DNSMASQ to a VM" ></p>

<p>One of the most annoying thing is when the <code>resolv.conf</code> of your VM keeps changing because of the information sent by the DNSMASQ process. In this article, I assume that your setup has some conventions such as:</p>

<ul>
<li>One network per customer, with fixed_ips range with something like <code>10.100.$ID.0/24</code></li>
<li>One ID (number) per tenant</li>
</ul>


<!--more -->


<p>First create a template file for your initial dhclient configuration file. Using a template file makes things easier to build base images.</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>sudo cp /etc/dhcp/dhclient.conf /etc/dhcp/dhclient.conf.template
</span></code></pre></td></tr></table></div></figure>


<p>Then append the following lines in <code>/etc/dhcp/dhclient.conf.template</code>:</p>

<pre><code>supersede domain-name-servers 10.100.X.254;
supersede domain-name "template.your-super-cloud.domain";
</code></pre>

<p>The above example shows <strong>my</strong> fixed_ips range. You might have to edit it with <strong>your</strong> own range.</p>

<p>Eventually edit your <code>/etc/rc.local</code> with the following:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
<span class='line-number'>14</span>
<span class='line-number'>15</span>
<span class='line-number'>16</span>
<span class='line-number'>17</span>
<span class='line-number'>18</span>
<span class='line-number'>19</span>
<span class='line-number'>20</span>
<span class='line-number'>21</span>
<span class='line-number'>22</span>
<span class='line-number'>23</span>
<span class='line-number'>24</span>
<span class='line-number'>25</span>
<span class='line-number'>26</span>
<span class='line-number'>27</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="c">#!/bin/sh -e</span>
</span><span class='line'><span class="c">#</span>
</span><span class='line'><span class="c"># rc.local</span>
</span><span class='line'><span class="c">#</span>
</span><span class='line'><span class="c"># This script is executed at the end of each multiuser runlevel.</span>
</span><span class='line'><span class="c"># Make sure that the script will &quot;exit 0&quot; on success or any other</span>
</span><span class='line'><span class="c"># value on error.</span>
</span><span class='line'><span class="c">#</span>
</span><span class='line'><span class="c"># In order to enable or disable this script just change the execution</span>
</span><span class='line'><span class="c"># bits.</span>
</span><span class='line'><span class="c">#</span>
</span><span class='line'><span class="c"># By default this script does nothing.</span>
</span><span class='line'>
</span><span class='line'><span class="c"># Grab the tenant name</span>
</span><span class='line'><span class="nv">TENANT</span><span class="o">=</span><span class="k">$(</span>curl -f http://169.254.169.254/latest/meta-data/hostname | cut -f 1 -d <span class="s1">&#39;-&#39;</span><span class="k">)</span>
</span><span class='line'>
</span><span class='line'><span class="c"># Grab the tenant ID</span>
</span><span class='line'><span class="nv">ID</span><span class="o">=</span><span class="k">$(</span><span class="nb">echo</span> <span class="nv">$TENANT</span> | cut -c 1-2<span class="k">)</span>
</span><span class='line'>
</span><span class='line'><span class="c"># Update the DHCP conf</span>
</span><span class='line'><span class="se">\c</span>p /etc/dhcp/dhclient.conf.template /etc/dhcp/dhclient.conf
</span><span class='line'>
</span><span class='line'>sed -i s/template/<span class="nv">$TENANT</span>/ /etc/dhcp/dhclient.conf
</span><span class='line'>sed -i s/X/<span class="nv">$ID</span>/ /etc/dhcp/dhclient.conf
</span><span class='line'>
</span><span class='line'>/usr/bin/killall dhclient3
</span><span class='line'>/sbin/dhclient3 -e <span class="nv">IF_METRIC</span><span class="o">=</span>100 -pf /var/run/dhclient.eth0.pid -lf /var/lib/dhcp/dhclient.eth0.leases -1 eth0
</span></code></pre></td></tr></table></div></figure>




<br />


<blockquote><p>Enjoy!</p></blockquote>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Purge some MySQL binary logs]]></title>
    <link href="http://sebastien-han.fr/blog/2013/02/15/purge-mysql-binary-logs/"/>
    <updated>2013-02-15T11:20:00+01:00</updated>
    <id>http://sebastien-han.fr/blog/2013/02/15/purge-mysql-binary-logs</id>
    <content type="html"><![CDATA[<p>The default value of the variable <code>expire_logs_days</code> is 10 days, most of the time this is way too long. This mini how-to shows how to change this value. Fortunately <code>expire_logs_days</code> is a <a href="http://dev.mysql.com/doc/refman/5.5/en/dynamic-system-variables.html">dynamic variable</a> so we can edit it while MySQl runs, we don&#8217;t need to restart the server.</p>

<!--more -->


<p>First check your slave status:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
<span class='line-number'>14</span>
<span class='line-number'>15</span>
<span class='line-number'>16</span>
<span class='line-number'>17</span>
<span class='line-number'>18</span>
<span class='line-number'>19</span>
<span class='line-number'>20</span>
<span class='line-number'>21</span>
<span class='line-number'>22</span>
<span class='line-number'>23</span>
<span class='line-number'>24</span>
<span class='line-number'>25</span>
<span class='line-number'>26</span>
<span class='line-number'>27</span>
<span class='line-number'>28</span>
<span class='line-number'>29</span>
<span class='line-number'>30</span>
<span class='line-number'>31</span>
<span class='line-number'>32</span>
<span class='line-number'>33</span>
<span class='line-number'>34</span>
<span class='line-number'>35</span>
<span class='line-number'>36</span>
<span class='line-number'>37</span>
<span class='line-number'>38</span>
<span class='line-number'>39</span>
<span class='line-number'>40</span>
<span class='line-number'>41</span>
<span class='line-number'>42</span>
</pre></td><td class='code'><pre><code class='sql'><span class='line'><span class="n">mysql</span><span class="o">&gt;</span> <span class="k">show</span> <span class="n">slave</span> <span class="n">status</span><span class="err">\</span><span class="k">G</span><span class="p">;</span>
</span><span class='line'><span class="o">***************************</span> <span class="mi">1</span><span class="p">.</span> <span class="k">row</span> <span class="o">***************************</span>
</span><span class='line'>               <span class="n">Slave_IO_State</span><span class="p">:</span> <span class="n">Waiting</span> <span class="k">for</span> <span class="n">master</span> <span class="k">to</span> <span class="n">send</span> <span class="n">event</span>
</span><span class='line'>                  <span class="n">Master_Host</span><span class="p">:</span> <span class="mi">10</span><span class="p">.</span><span class="mi">20</span><span class="p">.</span><span class="mi">1</span><span class="p">.</span><span class="mi">51</span>
</span><span class='line'>                  <span class="n">Master_User</span><span class="p">:</span> <span class="n">replication</span>
</span><span class='line'>                  <span class="n">Master_Port</span><span class="p">:</span> <span class="mi">3306</span>
</span><span class='line'>                <span class="n">Connect_Retry</span><span class="p">:</span> <span class="mi">60</span>
</span><span class='line'>              <span class="n">Master_Log_File</span><span class="p">:</span> <span class="n">mysql</span><span class="o">-</span><span class="n">bin</span><span class="p">.</span><span class="mi">000029</span>
</span><span class='line'>          <span class="n">Read_Master_Log_Pos</span><span class="p">:</span> <span class="mi">52089474</span>
</span><span class='line'>               <span class="n">Relay_Log_File</span><span class="p">:</span> <span class="n">mysqld</span><span class="o">-</span><span class="n">relay</span><span class="o">-</span><span class="n">bin</span><span class="p">.</span><span class="mi">000036</span>
</span><span class='line'>                <span class="n">Relay_Log_Pos</span><span class="p">:</span> <span class="mi">52089620</span>
</span><span class='line'>        <span class="n">Relay_Master_Log_File</span><span class="p">:</span> <span class="n">mysql</span><span class="o">-</span><span class="n">bin</span><span class="p">.</span><span class="mi">000029</span>
</span><span class='line'>             <span class="n">Slave_IO_Running</span><span class="p">:</span> <span class="n">Yes</span>
</span><span class='line'>            <span class="n">Slave_SQL_Running</span><span class="p">:</span> <span class="n">Yes</span>
</span><span class='line'>              <span class="n">Replicate_Do_DB</span><span class="p">:</span>
</span><span class='line'>          <span class="n">Replicate_Ignore_DB</span><span class="p">:</span>
</span><span class='line'>           <span class="n">Replicate_Do_Table</span><span class="p">:</span>
</span><span class='line'>       <span class="n">Replicate_Ignore_Table</span><span class="p">:</span>
</span><span class='line'>      <span class="n">Replicate_Wild_Do_Table</span><span class="p">:</span>
</span><span class='line'>  <span class="n">Replicate_Wild_Ignore_Table</span><span class="p">:</span>
</span><span class='line'>                   <span class="n">Last_Errno</span><span class="p">:</span> <span class="mi">0</span>
</span><span class='line'>                   <span class="n">Last_Error</span><span class="p">:</span>
</span><span class='line'>                 <span class="n">Skip_Counter</span><span class="p">:</span> <span class="mi">0</span>
</span><span class='line'>          <span class="n">Exec_Master_Log_Pos</span><span class="p">:</span> <span class="mi">52089474</span>
</span><span class='line'>              <span class="n">Relay_Log_Space</span><span class="p">:</span> <span class="mi">52089820</span>
</span><span class='line'>              <span class="n">Until_Condition</span><span class="p">:</span> <span class="k">None</span>
</span><span class='line'>               <span class="n">Until_Log_File</span><span class="p">:</span>
</span><span class='line'>                <span class="n">Until_Log_Pos</span><span class="p">:</span> <span class="mi">0</span>
</span><span class='line'>           <span class="n">Master_SSL_Allowed</span><span class="p">:</span> <span class="k">No</span>
</span><span class='line'>           <span class="n">Master_SSL_CA_File</span><span class="p">:</span>
</span><span class='line'>           <span class="n">Master_SSL_CA_Path</span><span class="p">:</span>
</span><span class='line'>              <span class="n">Master_SSL_Cert</span><span class="p">:</span>
</span><span class='line'>            <span class="n">Master_SSL_Cipher</span><span class="p">:</span>
</span><span class='line'>               <span class="n">Master_SSL_Key</span><span class="p">:</span>
</span><span class='line'>        <span class="n">Seconds_Behind_Master</span><span class="p">:</span> <span class="mi">0</span>
</span><span class='line'><span class="n">Master_SSL_Verify_Server_Cert</span><span class="p">:</span> <span class="k">No</span>
</span><span class='line'>                <span class="n">Last_IO_Errno</span><span class="p">:</span> <span class="mi">0</span>
</span><span class='line'>                <span class="n">Last_IO_Error</span><span class="p">:</span>
</span><span class='line'>               <span class="n">Last_SQL_Errno</span><span class="p">:</span> <span class="mi">0</span>
</span><span class='line'>               <span class="n">Last_SQL_Error</span><span class="p">:</span>
</span><span class='line'>  <span class="n">Replicate_Ignore_Server_Ids</span><span class="p">:</span>
</span><span class='line'>             <span class="n">Master_Server_Id</span><span class="p">:</span> <span class="mi">1</span>
</span></code></pre></td></tr></table></div></figure>


<p>As we can see the current <code>Master_Log_File</code> is <code>mysql-bin.000029</code>, so we have to purge all binaries prior to this one. Then check the master binary logs status</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
</pre></td><td class='code'><pre><code class='sql'><span class='line'><span class="n">mysql</span><span class="o">&gt;</span> <span class="k">SHOW</span> <span class="nb">BINARY</span> <span class="n">LOGS</span><span class="p">;</span>
</span><span class='line'><span class="o">+</span><span class="c1">------------------+-----------+</span>
</span><span class='line'><span class="o">|</span> <span class="n">Log_name</span>         <span class="o">|</span> <span class="n">File_size</span> <span class="o">|</span>
</span><span class='line'><span class="o">+</span><span class="c1">------------------+-----------+</span>
</span><span class='line'><span class="o">|</span> <span class="n">mysql</span><span class="o">-</span><span class="n">bin</span><span class="p">.</span><span class="mi">000022</span> <span class="o">|</span> <span class="mi">104858289</span> <span class="o">|</span>
</span><span class='line'><span class="o">|</span> <span class="n">mysql</span><span class="o">-</span><span class="n">bin</span><span class="p">.</span><span class="mi">000023</span> <span class="o">|</span> <span class="mi">104857671</span> <span class="o">|</span>
</span><span class='line'><span class="o">|</span> <span class="n">mysql</span><span class="o">-</span><span class="n">bin</span><span class="p">.</span><span class="mi">000024</span> <span class="o">|</span> <span class="mi">104857898</span> <span class="o">|</span>
</span><span class='line'><span class="o">|</span> <span class="n">mysql</span><span class="o">-</span><span class="n">bin</span><span class="p">.</span><span class="mi">000025</span> <span class="o">|</span> <span class="mi">104857690</span> <span class="o">|</span>
</span><span class='line'><span class="o">|</span> <span class="n">mysql</span><span class="o">-</span><span class="n">bin</span><span class="p">.</span><span class="mi">000026</span> <span class="o">|</span> <span class="mi">104860121</span> <span class="o">|</span>
</span><span class='line'><span class="o">|</span> <span class="n">mysql</span><span class="o">-</span><span class="n">bin</span><span class="p">.</span><span class="mi">000027</span> <span class="o">|</span> <span class="mi">104858120</span> <span class="o">|</span>
</span><span class='line'><span class="o">|</span> <span class="n">mysql</span><span class="o">-</span><span class="n">bin</span><span class="p">.</span><span class="mi">000028</span> <span class="o">|</span> <span class="mi">104858421</span> <span class="o">|</span>
</span><span class='line'><span class="o">|</span> <span class="n">mysql</span><span class="o">-</span><span class="n">bin</span><span class="p">.</span><span class="mi">000029</span> <span class="o">|</span>  <span class="mi">52091311</span> <span class="o">|</span>
</span><span class='line'><span class="o">+</span><span class="c1">------------------+-----------+</span>
</span></code></pre></td></tr></table></div></figure>


<p>In my case, I want a retention of 5 days so I can start by deleting from <code>mysql-bin.000024</code> to <code>mysql-bin.000022</code>.</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
<span class='line-number'>14</span>
</pre></td><td class='code'><pre><code class='sql'><span class='line'><span class="n">mysql</span><span class="o">&gt;</span> <span class="n">PURGE</span> <span class="nb">BINARY</span> <span class="n">LOGS</span> <span class="k">TO</span> <span class="s1">&#39;mysql-bin.000025&#39;</span><span class="p">;</span>
</span><span class='line'><span class="n">Query</span> <span class="n">OK</span><span class="p">,</span> <span class="mi">0</span> <span class="k">rows</span> <span class="n">affected</span> <span class="p">(</span><span class="mi">0</span><span class="p">.</span><span class="mi">45</span> <span class="n">sec</span><span class="p">)</span>
</span><span class='line'>
</span><span class='line'><span class="n">mysql</span><span class="o">&gt;</span> <span class="k">SHOW</span> <span class="nb">BINARY</span> <span class="n">LOGS</span><span class="p">;</span>
</span><span class='line'><span class="o">+</span><span class="c1">------------------+-----------+</span>
</span><span class='line'><span class="o">|</span> <span class="n">Log_name</span>         <span class="o">|</span> <span class="n">File_size</span> <span class="o">|</span>
</span><span class='line'><span class="o">+</span><span class="c1">------------------+-----------+</span>
</span><span class='line'><span class="o">|</span> <span class="n">mysql</span><span class="o">-</span><span class="n">bin</span><span class="p">.</span><span class="mi">000025</span> <span class="o">|</span> <span class="mi">104857690</span> <span class="o">|</span>
</span><span class='line'><span class="o">|</span> <span class="n">mysql</span><span class="o">-</span><span class="n">bin</span><span class="p">.</span><span class="mi">000026</span> <span class="o">|</span> <span class="mi">104860121</span> <span class="o">|</span>
</span><span class='line'><span class="o">|</span> <span class="n">mysql</span><span class="o">-</span><span class="n">bin</span><span class="p">.</span><span class="mi">000027</span> <span class="o">|</span> <span class="mi">104858120</span> <span class="o">|</span>
</span><span class='line'><span class="o">|</span> <span class="n">mysql</span><span class="o">-</span><span class="n">bin</span><span class="p">.</span><span class="mi">000028</span> <span class="o">|</span> <span class="mi">104858421</span> <span class="o">|</span>
</span><span class='line'><span class="o">|</span> <span class="n">mysql</span><span class="o">-</span><span class="n">bin</span><span class="p">.</span><span class="mi">000029</span> <span class="o">|</span>  <span class="mi">52313463</span> <span class="o">|</span>
</span><span class='line'><span class="o">+</span><span class="c1">------------------+-----------+</span>
</span><span class='line'><span class="mi">5</span> <span class="k">rows</span> <span class="k">in</span> <span class="k">set</span> <span class="p">(</span><span class="mi">0</span><span class="p">.</span><span class="mi">00</span> <span class="n">sec</span><span class="p">)</span>
</span></code></pre></td></tr></table></div></figure>


<p>Eventually set the new value for <code>expire_logs_days</code> and don&#8217;t forget to edit your <code>my.cnf</code>:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
</pre></td><td class='code'><pre><code class='sql'><span class='line'><span class="n">mysql</span><span class="o">&gt;</span> <span class="k">SET</span> <span class="k">GLOBAL</span> <span class="n">expire_logs_days</span><span class="o">=</span><span class="mi">5</span><span class="p">;</span>
</span><span class='line'><span class="n">Query</span> <span class="n">OK</span><span class="p">,</span> <span class="mi">0</span> <span class="k">rows</span> <span class="n">affected</span> <span class="p">(</span><span class="mi">0</span><span class="p">.</span><span class="mi">00</span> <span class="n">sec</span><span class="p">)</span>
</span><span class='line'>
</span><span class='line'>
</span><span class='line'><span class="n">mysql</span><span class="o">&gt;</span> <span class="k">SHOW</span> <span class="n">VARIABLES</span> <span class="k">LIKE</span> <span class="s1">&#39;expire_logs_days&#39;</span><span class="p">;</span>
</span><span class='line'><span class="o">+</span><span class="c1">------------------+-------+</span>
</span><span class='line'><span class="o">|</span> <span class="n">Variable_name</span>    <span class="o">|</span> <span class="n">Value</span> <span class="o">|</span>
</span><span class='line'><span class="o">+</span><span class="c1">------------------+-------+</span>
</span><span class='line'><span class="o">|</span> <span class="n">expire_logs_days</span> <span class="o">|</span> <span class="mi">5</span>     <span class="o">|</span>
</span><span class='line'><span class="o">+</span><span class="c1">------------------+-------+</span>
</span><span class='line'><span class="mi">1</span> <span class="k">row</span> <span class="k">in</span> <span class="k">set</span> <span class="p">(</span><span class="mi">0</span><span class="p">.</span><span class="mi">00</span> <span class="n">sec</span><span class="p">)</span>
</span></code></pre></td></tr></table></div></figure>


<ul>
<li><a href="http://dev.mysql.com/doc/refman/5.5/en/dynamic-system-variables.html">http://dev.mysql.com/doc/refman/5.5/en/dynamic-system-variables.html</a></li>
</ul>

]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Mount a specific pool with CephFS]]></title>
    <link href="http://sebastien-han.fr/blog/2013/02/11/mount-a-specific-pool-with-cephfs/"/>
    <updated>2013-02-11T12:20:00+01:00</updated>
    <id>http://sebastien-han.fr/blog/2013/02/11/mount-a-specific-pool-with-cephfs</id>
    <content type="html"><![CDATA[<p><img class="center" src="http://sebastien-han.fr/images/cephfs-mount-pool.jpg" title="Mount a specific pool with CephFS" ></p>

<p>The title of the article is a bit wrong, but it&#8217;s certainly the easiest to understand :-).</p>

<!--more-->


<p>First let&#8217;s create a new pool, and call it <code>webdata</code>. Ideally this pool will store web content.</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>ceph osd create webdata 500
</span><span class='line'>successfully created pool webdata
</span></code></pre></td></tr></table></div></figure>


<p>Then grab the pool id:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>ceph osd dump | grep webdata
</span><span class='line'>pool 5 <span class="s1">&#39;webdata&#39;</span> rep size 2 crush_ruleset 0 object_hash rjenkins pg_num 500 pgp_num 500 last_change 12 owner 0
</span></code></pre></td></tr></table></div></figure>


<p>Eventually assign</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>ceph mds add_data_pool 5
</span><span class='line'>added data pool 5 to mdsmap
</span></code></pre></td></tr></table></div></figure>


<p>Mount the Ceph Filesystem:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>sudo mount -t ceph 172.17.1.7:/ /srv/mds/pools/webdata
</span></code></pre></td></tr></table></div></figure>


<p>Check the layout of the directory, as we can see the pool with the id 0 has been assigned by default to. This pool corresponds to the default pool called <code>data</code>. By setting a new layout, we will change the default pool by the one we created before.</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>cephfs /srv/mds/pools/webdata/ show_layout
</span><span class='line'>layout.data_pool:     0
</span><span class='line'>layout.object_size:   4194304
</span><span class='line'>layout.stripe_unit:   4194304
</span><span class='line'>layout.stripe_count:  1
</span><span class='line'>
</span><span class='line'><span class="nv">$ </span>cephfs /srv/mds/pools/webdata/lol/ set_layout -p 5 -u 4194304 -c 1 -s 4194304
</span><span class='line'>
</span><span class='line'><span class="nv">$ </span>cephfs /srv/mds/pools/webdata/lol/ show_layout
</span><span class='line'>layout.data_pool:     5
</span><span class='line'>layout.object_size:   4194304
</span><span class='line'>layout.stripe_unit:   4194304
</span><span class='line'>layout.stripe_count:  1
</span></code></pre></td></tr></table></div></figure>


<p>Unmount and mount the Ceph Filesystem:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>sudo umount /srv/mds/pools/webdata
</span><span class='line'><span class="nv">$ </span>sudo mount -t ceph 172.17.1.7:/ /srv/mds/pools/webdata
</span><span class='line'><span class="nv">$ </span>sudo touch /srv/mds/pools/webdata/marche
</span></code></pre></td></tr></table></div></figure>


<p>Oh! There are objects in there!</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>rados --pool<span class="o">=</span>webdata ls
</span><span class='line'>10000000008.00000000
</span><span class='line'>10000000008.00000001
</span><span class='line'>10000000008.00000002
</span><span class='line'>10000000008.00000003
</span><span class='line'>10000000008.00000004
</span><span class='line'>
</span><span class='line'><span class="nv">$ </span>ll /srv/mds/pools/webdata
</span><span class='line'>total 18436
</span><span class='line'>drwxr-xr-x 1 root root 18874368 Jan 11 13:51 ./
</span><span class='line'>drwxr-xr-x 3 root root     4096 Jan 11 12:19 ../
</span><span class='line'>-rw-r--r-- 1 root root 18874368 Jan 11 13:51 marche
</span></code></pre></td></tr></table></div></figure>




<br />


<p>References:</p>

<ul>
<li><a href="http://ceph.com/docs/master/man/8/mount.ceph/">http://ceph.com/docs/master/man/8/mount.ceph/</a></li>
<li><a href="http://ceph.com/docs/master/man/8/cephfs/">http://ceph.com/docs/master/man/8/cephfs/</a></li>
<li><a href="http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/6148">http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/6148</a></li>
</ul>

]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Ceph geo-replication (sort of)]]></title>
    <link href="http://sebastien-han.fr/blog/2013/01/28/ceph-geo-replication-sort-of/"/>
    <updated>2013-01-28T20:09:00+01:00</updated>
    <id>http://sebastien-han.fr/blog/2013/01/28/ceph-geo-replication-sort-of</id>
    <content type="html"><![CDATA[<p><img class="center" src="http://sebastien-han.fr/images/ceph-geo-replication.jpg" title="Ceph geo-replication" ></p>

<p>It&#8217;s fair to say that the geo-replication is one of the most requested feature by the community. This article is draft, a PoC about Ceph geo-replication.</p>

<p><strong>Disclaimer: yes this setup is tricky and I don&#8217;t guarantee that this will work for you.</strong></p>

<!--more-->




<br />


<h1>I. The idea</h1>

<p>The original idea came out from a discussion with a friend of mine Tomáš Šafranko. The problem was that wanted to deploy acccross two (really) close datacenters with very low latencies, but we got <em>only</em> 2 datacenters&#8230; Ceph monitors number has to be <strong>uneven</strong> in order to properly manage the membership. That&#8217;s make the setup even harder, since there is no asynchronous mode at the moment, we just have to deal with this design. The ideal sceanario is to have 3 datacenters and each of them hosts one monitor, which means that your data are spread accross 3 datacenters as well. An another solution could be to store your data on 2 datacenters and use a VPS (close from your 2 DCs) as a monitor. Both scenarios were not possible, just because the VPS would have brought way more latencies. So we thought, thought and dug deeper and all of the sudden the Pacemaker idea came out. As far I&#8217;m concerned, I don&#8217;t use Pacemaker to manage Ceph daemons, to be honest I&#8217;m a bit reluctant to use it. It&#8217;s not that I don&#8217;t trust it, it&#8217;s just that I try to keep as much control as possible on my daemons. Moreover I&#8217;d liked to keep away the <em>Pacemakerized everything</em> syndrom. Beside of this, I have to admit that Pacemaker can be useful here as an <em>automatic restart solution</em> after a daemon crash for instance (this is the main purpose of current RAs). But let&#8217;s be honest daemons don&#8217;t crash for nothing, they just run and if they don&#8217;t (crash) they probably have a good reason to behave like this (disk full or whatever). Basically if something goes wrong I prefer to act manually in this kind of situation, investigate and understand what happened. Anyway this can be a long debat and this is not the purpose of this introduction. Once again it&#8217;s my own opinion, I like Pacemaker, I use it everyday, simply not for everything ;-).</p>

<p>The following drawing describes the Pacemaker idea, 2 monitors are fully actives on one location and a third one runs on one side of the cluster and then move back to the other DC if the latest fails. Sounds easy right?</p>

<br />


<pre><code>                                                            ,-----------.
                                                            |  clients  |
                                                            `-----------'
                                                 ,---------.     |     ,---------.
                                                 | clients |     |     | clients |
                                                 `---------'     |     `---------'
                                                            \    |    /
                                                             \ ,---. /
                                                             ('     `)
                                                            (   WAN   )
                                                             (.     ,)
                                                             / `---' \
                                                            /         \
                                                 ,---------.           ,---------.
                                                 |  mon.0  |           |  mon.1  |
                                                 `---------'           `---------'
                                                      |                     |
                                                 ,---------.           ,---------.
                                                 |  osd.0  |           |  osd.1  |
                                                 `---------'           `---------'
                                                 |  osd.2  |           |  osd.3  |
                                                 `---------'           `---------'
                                                 |  osd.4  |           |  osd.5  |
                                                 `---------'           `---------'
                                                 |  .....  |           |  .....  |
                                                 `---------'           `---------'
                                                Data Center 1         Data Center 2
                                                           \             /
                                                            \           /
                                                     ,-----------------------.
                                                     |     Floating mon.2    |
                                                     |     Active/Passive    |
                                                     |  Managed by Pacemaker |
                                                     `-----------------------'
</code></pre>

<br />


<h1>II. How-to</h1>

<p>A bit of a technical overview:</p>

<ul>
<li>DRBD manages the Monitor data directory</li>
<li>Pacemaker manages the Monitor daemon and his IP address</li>
</ul>


<h2>II.1. DRBD setup</h2>

<p>First install DRBD:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>sudo apt-get install drbd-utils -y
</span></code></pre></td></tr></table></div></figure>




<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>sudo lvcreate vg00 -L 5G -n ceph-mon
</span><span class='line'>Logical volume <span class="s2">&quot;ceph-mon&quot;</span> created
</span><span class='line'>
</span><span class='line'><span class="nv">$ </span>sudo mkfs.ext4 /dev/mapper/vg00-ceph--mon
</span><span class='line'>...
</span><span class='line'>...
</span></code></pre></td></tr></table></div></figure>


<p>DRBD resource configuration, create a new file in <code>/etc/drbd.d/mon.res</code> and append the following content:</p>

<pre><code>resource mon {
  device    /dev/drbd0;
  disk      /dev/mapper/vg00-ceph--mon;
  meta-disk internal;
  on floating-mon-01 {
    address   10.20.1.41:7790;
  }
  on floating-mon-02 {
    address   10.20.1.42:7790;
  }
}
</code></pre>

<p>Check your configuration:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>sudo drbdadm dump all
</span><span class='line'>...
</span><span class='line'>...
</span><span class='line'><span class="nv">$ </span><span class="nb">echo</span> <span class="nv">$?</span>
</span><span class='line'>0
</span></code></pre></td></tr></table></div></figure>


<p>Wipe off the content of the logical volume:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>sudo dd <span class="k">if</span><span class="o">=</span>/dev/zero <span class="nv">of</span><span class="o">=</span>/dev/mapper/vg00-ceph--mon <span class="nv">bs</span><span class="o">=</span>1M <span class="nv">count</span><span class="o">=</span>128
</span><span class='line'>...
</span></code></pre></td></tr></table></div></figure>


<p>Bring up your resource:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>sudo drbdadm -- --ignore-sanity-checks create-md mon
</span><span class='line'>Writing meta data...
</span><span class='line'>initializing activity log
</span><span class='line'>NOT initialized bitmap
</span><span class='line'>New drbd meta data block successfully created.
</span><span class='line'>success
</span></code></pre></td></tr></table></div></figure>


<p>Activate the resource:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>sudo modprobe drbd
</span><span class='line'><span class="nv">$ </span>sudo drbdadm up mon
</span></code></pre></td></tr></table></div></figure>


<p>Put one node as the primary:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>sudo drbdadm -- --overwrite-data-of-peer primary mon
</span></code></pre></td></tr></table></div></figure>


<p>Now DRBD is syncing blocks, so simply wait until the sync is complete:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>sudo cat /proc/drbd
</span><span class='line'>version: 8.3.11 <span class="o">(</span>api:88/proto:86-96<span class="o">)</span>
</span><span class='line'>srcversion: 71955441799F513ACA6DA60
</span><span class='line'> 0: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r-----
</span><span class='line'>    ns:69504 nr:0 dw:0 dr:70168 al:0 bm:4 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:5173180
</span><span class='line'>    <span class="o">[</span>&gt;....................<span class="o">]</span> sync<span class="err">&#39;</span>ed:  1.5% <span class="o">(</span>5048/5116<span class="o">)</span>Mfinish: 1:04:20 speed: 1,336 <span class="o">(</span>1,284<span class="o">)</span> K/sec
</span></code></pre></td></tr></table></div></figure>


<p>At the end you should have something like this:</p>

<pre><code>version: 8.3.11 (api:88/proto:86-96)
srcversion: 71955441799F513ACA6DA60 
 0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
    ns:5242684 nr:0 dw:0 dr:5243348 al:0 bm:320 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
</code></pre>

<p>Format your new device and perform a quick test:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>sudo mkfs.ext4 /dev/drbd0
</span><span class='line'><span class="nv">$ </span>sudo mount /dev/drbd0 /mnt/
</span><span class='line'><span class="nv">$ </span>sudo drbd-overview
</span><span class='line'>  0:mon  Connected Primary/Secondary UpToDate/UpToDate C r----- /mnt ext4 5.0G 204M 4.6G 5%
</span><span class='line'><span class="nv">$ </span>sudo umount /mnt
</span></code></pre></td></tr></table></div></figure>


<h2>II.2. Build the third monitor</h2>

<p>Yes, for this we are about to build the third monitor from scratch:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>sudo mkdir -p /srv/ceph/mon0
</span><span class='line'><span class="nv">$ </span>sudo mount /dev/drbd0 /srv/ceph/mon0
</span></code></pre></td></tr></table></div></figure>


<p>Retrieve the current monitor key from the cluster:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>sudo ceph auth get mon. -o /tmp/monkey
</span><span class='line'>exported keyring <span class="k">for </span>mon.
</span></code></pre></td></tr></table></div></figure>


<p>Grab the current monitor map:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>sudo ceph mon getmap -o /tmp/lamap
</span><span class='line'>got latest monmap
</span></code></pre></td></tr></table></div></figure>


<p>Examine it (if you&#8217;re curious):</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>sudo monmaptool --print /tmp/lamap
</span><span class='line'>monmaptool: monmap file /tmp/lamap
</span><span class='line'>epoch 6
</span><span class='line'>fsid eb2efd30-64c7-4e8f-b2fd-81c2923e96cd
</span><span class='line'>last_changed 2013-01-22 14:37:00.654689
</span><span class='line'>created 2013-01-11 11:34:03.779220
</span><span class='line'>0: 172.17.1.11:6789/0 mon.1
</span><span class='line'>1: 172.17.1.12:6789/0 mon.2
</span></code></pre></td></tr></table></div></figure>


<p>Initialize the mon data directory:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>sudo ceph-mon -i 0 --mkfs --monmap /tmp/lamap --keyring /tmp/monkey
</span><span class='line'>...
</span><span class='line'>...
</span></code></pre></td></tr></table></div></figure>


<p>Edit your <code>ceph.conf</code> on your <code>floating-mon-01</code> and add the third monitor:</p>

<pre><code>[mon.0]
    host = floating-mon-01
    mon addr = 172.17.1.100:6789
</code></pre>

<p>Eventually on <code>floating-mon-01</code>:</p>

<pre><code>[mon.0]
    host = floating-mon-02
    mon addr = 172.17.1.100:6789
</code></pre>

<p>For the other configuration files, the <code>host</code> flag doesn&#8217;t really matter, it only matters for the node hosting the resource because this what the init script from ceph will read in the first place to manage the services. For the client, the only thing that matters is the IP address:</p>

<pre><code>[mon]
    mon data = /srv/ceph/mon$id
    mon osd down out interval = 60
[mon.0]
    mon addr = 172.17.1.100:6789
[mon.1]
    host = mon-02
    mon addr = 172.17.1.11:6789
[mon.2]
    host = mon-03
    mon addr = 172.17.1.12:6789
</code></pre>

<h2>II.3. Pacemaker setup</h2>

<p>First install Pacemaker:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>sudo apt-get install pacemaker -y
</span></code></pre></td></tr></table></div></figure>


<p>Now log into the crm shell by typing <code>crm</code> within your current shell. Edit your cluster properties:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>sudo crm configure edit
</span></code></pre></td></tr></table></div></figure>


<p>Put the following:</p>

<pre><code>    stonith-enabled="false" \
    no-quorum-policy="ignore" \
    pe-warn-series-max="1000" \
    pe-input-series-max="1000" \
    pe-error-series-max="1000" \
    cluster-recheck-interval="5min"
</code></pre>

<p>rsc_defaults $id=&#8221;rsc-options&#8221; \</p>

<pre><code>    resource-stickiness="500"
</code></pre>

<h3>II.3.1. Cluster resources</h3>

<p>DRBD resource:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'>primitive drbd_mon ocf:linbit:drbd <span class="se">\</span>
</span><span class='line'>        params <span class="nv">drbd_resource</span><span class="o">=</span><span class="s2">&quot;mon&quot;</span> <span class="se">\</span>
</span><span class='line'>        op start <span class="nv">interval</span><span class="o">=</span><span class="s2">&quot;0&quot;</span> <span class="nv">timeout</span><span class="o">=</span><span class="s2">&quot;90s&quot;</span> <span class="se">\</span>
</span><span class='line'>        op stop <span class="nv">interval</span><span class="o">=</span><span class="s2">&quot;0&quot;</span> <span class="nv">timeout</span><span class="o">=</span><span class="s2">&quot;180s&quot;</span> <span class="se">\</span>
</span><span class='line'>        op promote <span class="nv">interval</span><span class="o">=</span><span class="s2">&quot;0&quot;</span> <span class="nv">timeout</span><span class="o">=</span><span class="s2">&quot;180s&quot;</span> <span class="se">\</span>
</span><span class='line'>        op demote <span class="nv">interval</span><span class="o">=</span><span class="s2">&quot;0&quot;</span> <span class="nv">timeout</span><span class="o">=</span><span class="s2">&quot;180s&quot;</span> <span class="se">\</span>
</span><span class='line'>        op monitor <span class="nv">interval</span><span class="o">=</span><span class="s2">&quot;29s&quot;</span> <span class="nv">role</span><span class="o">=</span><span class="s2">&quot;Master&quot;</span> <span class="se">\</span>
</span><span class='line'>        op monitor <span class="nv">interval</span><span class="o">=</span><span class="s2">&quot;31s&quot;</span> <span class="nv">role</span><span class="o">=</span><span class="s2">&quot;Slave&quot;</span>
</span><span class='line'>ms ms_drbd_mon drbd_mon <span class="se">\</span>
</span><span class='line'>        meta master-max<span class="o">=</span><span class="s2">&quot;1&quot;</span> master-node-max<span class="o">=</span><span class="s2">&quot;1&quot;</span> clone-max<span class="o">=</span><span class="s2">&quot;2&quot;</span> clone-node-max<span class="o">=</span><span class="s2">&quot;1&quot;</span> <span class="nv">notify</span><span class="o">=</span><span class="s2">&quot;true&quot;</span> target-role<span class="o">=</span><span class="s2">&quot;Started&quot;</span>
</span><span class='line'>colocation col_mon_on_drbd inf: g_ceph_mon ms_drbd_mon:Master
</span><span class='line'>order ord_mon_after_drbd inf: ms_drbd_mon:promote g_ceph_mon:start
</span></code></pre></td></tr></table></div></figure>


<p>Filesystem resource:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'>primitive p_fs_mon ocf:heartbeat:Filesystem <span class="se">\</span>
</span><span class='line'>        params <span class="nv">device</span><span class="o">=</span><span class="s2">&quot;/dev/drbd/by-res/mon&quot;</span> <span class="nv">directory</span><span class="o">=</span><span class="s2">&quot;/srv/ceph/mon0&quot;</span> <span class="nv">fstype</span><span class="o">=</span><span class="s2">&quot;ext4&quot;</span> <span class="nv">options</span><span class="o">=</span><span class="s2">&quot;noatime,nodiratime&quot;</span> <span class="se">\</span>
</span><span class='line'>        op start <span class="nv">interval</span><span class="o">=</span><span class="s2">&quot;0&quot;</span> <span class="nv">timeout</span><span class="o">=</span><span class="s2">&quot;60s&quot;</span> <span class="se">\</span>
</span><span class='line'>        op stop <span class="nv">interval</span><span class="o">=</span><span class="s2">&quot;0&quot;</span> <span class="nv">timeout</span><span class="o">=</span><span class="s2">&quot;180s&quot;</span> <span class="se">\</span>
</span><span class='line'>        op monitor <span class="nv">interval</span><span class="o">=</span><span class="s2">&quot;60s&quot;</span> <span class="nv">timeout</span><span class="o">=</span><span class="s2">&quot;60s&quot;</span>
</span></code></pre></td></tr></table></div></figure>


<p>The monitor resource, note that the Ceph resource agent are a dependancie of the ceph package. You can easily find them in <code>/usr/lib/ocf/resource.d/ceph</code>:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'>primitive p_ceph_mon ocf:ceph:mon
</span><span class='line'>    op start <span class="nv">interval</span><span class="o">=</span><span class="s2">&quot;0&quot;</span> <span class="nv">timeout</span><span class="o">=</span><span class="s2">&quot;60s&quot;</span> <span class="se">\</span>
</span><span class='line'>    op stop <span class="nv">interval</span><span class="o">=</span><span class="s2">&quot;0&quot;</span> <span class="nv">timeout</span><span class="o">=</span><span class="s2">&quot;180s&quot;</span> <span class="se">\</span>
</span><span class='line'>    op monitor <span class="nv">interval</span><span class="o">=</span><span class="s2">&quot;10s&quot;</span> <span class="nv">timeout</span><span class="o">=</span><span class="s2">&quot;30s&quot;</span>
</span></code></pre></td></tr></table></div></figure>


<p>Group all the resources together:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'>group g_ceph_mon p_fs_mon p_sym_mon_var p_sym_mon_etc p_vip_mon
</span></code></pre></td></tr></table></div></figure>


<h3>II.3.2. Final design</h3>

<p>At the end, you should see:</p>

<pre><code>============
Last updated: Wed Jan 23 00:15:48 2013
Last change: Wed Jan 23 00:15:48 2013 via cibadmin on floating-mon-01
Stack: openais
Current DC: floating-mon-01 - partition with quorum
Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
2 Nodes configured, 2 expected votes
5 Resources configured.
============

Online: [ floating-mon-01 floating-mon-02 ]

 Master/Slave Set: ms_drbd_mon [drbd_mon]
     Masters: [ floating-mon-01 ]
     Slaves: [ floating-mon-02 ]
 Resource Group: g_ceph_mon
     p_fs_mon   (ocf::heartbeat:Filesystem):    Started floating-mon-01
     p_vip_mon  (ocf::heartbeat:IPaddr2):       Started floating-mon-01
     p_ceph_mon (ocf::ceph:mon):        Started floating-mon-01
</code></pre>

<h2>II.4. CRUSH Map</h2>

<p>The setup is the following:</p>

<ul>
<li>2 Datacenter</li>
<li>N OSDs, preferably an even number of OSDs on each location</li>
</ul>


<p>Let say we want to end up with the following topolgy:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
<span class='line-number'>14</span>
<span class='line-number'>15</span>
<span class='line-number'>16</span>
<span class='line-number'>17</span>
<span class='line-number'>18</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>ceph osd tree
</span><span class='line'>dumped osdmap tree epoch 621
</span><span class='line'><span class="c"># id    weight  type name   up/down reweight</span>
</span><span class='line'>-1  12  root default
</span><span class='line'>-3  12      datacenter dc-1
</span><span class='line'>-2  3           host ceph-01
</span><span class='line'>0   1               osd.0   up  1
</span><span class='line'>1   1               osd.1   up  1
</span><span class='line'>-4  3           host ceph-02
</span><span class='line'>2   1               osd.2   up  1
</span><span class='line'>3   1               osd.3   up  1
</span><span class='line'>-5            datacenter dc-2
</span><span class='line'>-6  3           host ceph-03
</span><span class='line'>4   1               osd.4   up  1
</span><span class='line'>5   1               osd.5   up  1
</span><span class='line'>-9  3           host ceph-04
</span><span class='line'>5   1               osd.6   up  1
</span><span class='line'>7   1               osd.7   up  1
</span></code></pre></td></tr></table></div></figure>


<p>Retrieve your CRUSH Map and fulfil it with all your hosts and locations:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>ceph osd getcrushmap -o ma-crush-map
</span><span class='line'><span class="nv">$ </span>crushtool -d ma-crush-map -o ma-crush-map.txt
</span></code></pre></td></tr></table></div></figure>


<p>Your CRUSH Map:</p>

<pre><code># begin crush map

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 3 osd.4
device 3 osd.5
device 3 osd.6
device 3 osd.7

# types
type 0 osd
type 1 host
type 2 rack
type 3 row
type 4 room
type 5 datacenter
type 6 root

# buckets
host ceph-01 {
    id -2       # do not change unnecessarily
    # weight 2.000
    alg straw
    hash 0  # rjenkins1
    item osd.0 weight 1.000
    item osd.1 weight 1.000
}
host ceph-02 {
    id -4       # do not change unnecessarily
    # weight 2.000
    alg straw
    hash 0  # rjenkins1
    item osd.2 weight 1.000
    item osd.3 weight 1.000
}
host ceph-03 {
    id -6       # do not change unnecessarily
    # weight 2.000
    alg straw
    hash 0  # rjenkins1
    item osd.4 weight 1.000
    item osd.5 weight 1.000
}   
host ceph-04 {
    id -9       # do not change unnecessarily
    # weight 2.000
    alg straw
    hash 0  # rjenkins1
    item osd.6 weight 1.000
    item osd.7 weight 1.000
} 
datacenter dc-1 {
    id -3          # do not change unnecessarily
    # weight 2.000
    alg straw
    hash 0  # rjenkins1
    item ceph-01 weight 2.000
    item ceph-02 weight 2.000
}
datacenter dc-2 {
    id -5          # do not change unnecessarily
    # weight 2.000
    alg straw
    hash 0  # rjenkins1
    item ceph-03 weight 2.000
    item ceph-04 weight 2.000
}

# end crush map
</code></pre>

<br />


<h3>II.4.1 Add a bucket</h3>

<p>Add a bucket for the DC:</p>

<pre><code>root default {
    id -1           # do not change unnecessarily
    # weight 4.000
    alg straw
    hash 0  # rjenkins1
    item dc-1 weight 2.000
    item dc-2 weight 2.000
}
</code></pre>

<h3>II.4.2. Add a rule</h3>

<p>Add a rule for the bucket nearly created:</p>

<pre><code># rules
rule dc {
    ruleset 0
    type replicated
    min_size 1
    max_size 10
    step take default
    step chooseleaf firstn 0 type datacenter
    step emit
}
</code></pre>

<p>Eventually recompile and inject the new CRUSH map:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>crushtool -c ma-crush-map.txt -o ma-nouvelle-crush-map
</span><span class='line'><span class="nv">$ </span>ceph osd setcrushmap -i ma-nouvelle-crush-map
</span></code></pre></td></tr></table></div></figure>


<p>Since we set the <code>rule dc</code> to <code>0</code>, every pool will by default use this one. Thus we don&#8217;t to specify a <code>crush_ruleset</code> each time we create a pool :-).</p>

<br />


<h1>III. Break it!</h1>

<p>In order to simulate a crash from one DC, the machines hosting the monitors on the active side (DC-1) have been shutdown, the resource migrated to DC-2. In the meantime a loop of <code>ceph -s</code> has been performed to check how long the interruption was.</p>

<pre><code>health HEALTH_OK
monmap e7: 3 mons at {0=172.17.1.100:6789/0,1=172.17.1.11:6789/0,2=172.17.1.12:6789/0}, election epoch 16, quorum 0,1,2 0,1,2
osdmap e142: 4 osds: 8 up, 8 in
pgmap v20300: 1576 pgs: 1576 active+clean; 1300 MB data, 2776 MB used, 394 GB / 396 GB avail
mdsmap e61: 1/1/1 up {0=0=up:active}

real    0m0.011s
user    0m0.004s
sys 0m0.004s

health HEALTH_WARN 1 mons down, quorum 0,2 0,1
monmap e7: 3 mons at {0=172.17.1.100:6789/0,1=172.17.1.11:6789/0,2=172.17.1.12:6789/0}, election epoch 18, quorum 0,2 0,1
osdmap e142: 8 osds: 8 up, 8 in
pgmap v20300: 1576 pgs: 1576 active+clean; 1300 MB data, 2776 MB used, 394 GB / 396 GB avail
mdsmap e61: 1/1/1 up {0=0=up:active}

real    0m5.336s
user    0m0.004s
sys 0m0.004s

health HEALTH_WARN 1 mons down, quorum 0,2 0,1
monmap e7: 3 mons at {0=172.17.1.100:6789/0,1=172.17.1.11:6789/0,2=172.17.1.12:6789/0}, election epoch 18, quorum 0,2 0,1
osdmap e142: 8 osds: 8 up, 8 in
pgmap v20300: 1576 pgs: 1576 active+clean; 1300 MB data, 2776 MB used, 394 GB / 396 GB avail
mdsmap e61: 1/1/1 up {0=0=up:active}

real    0m0.011s
user    0m0.000s
sys 0m0.008s

health HEALTH_WARN 1 mons down, quorum 0,2 0,1
monmap e7: 3 mons at {0=172.17.1.100:6789/0,1=172.17.1.11:6789/0,2=172.17.1.12:6789/0}, election epoch 18, quorum 0,2 0,1
osdmap e142: 8 osds: 8 up, 8 in
pgmap v20300: 1576 pgs: 1576 active+clean; 1300 MB data, 2776 MB used, 394 GB / 396 GB avail
mdsmap e61: 1/1/1 up {0=0=up:active}

real    0m0.011s
user    0m0.004s
sys 0m0.004s
</code></pre>

<br />


<blockquote><p>As you can see results are quite encouraging since they showed around 5 seconds of downtime. Abiously this wasn&#8217;t a real life scenario since no writes were running, even so I assume that these would have been delayed anyway. As a reminder <strong>please</strong> keep in mind that this setup was experimental, some of you might consider it. However it strongly recommend you to perform way more tests than I did. It was a first shot with a pretty encouraging outcome I believe. As always feel free to critic, comment and bring interesting discussions on the comments section. Of course such setup as a downside, notably on the directories that store monitor data, some files such the ones in <code>$mon_root/logm/</code> and <code>$mon_root/pgmap/</code> are actively changing, the failover might might lead to some weird issues&#8230; <strong>So once again this needs to be heavily tested</strong> ;-).</p></blockquote>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[OpenStack Nova and availability zones]]></title>
    <link href="http://sebastien-han.fr/blog/2013/01/24/openstack-nova-play-with-availability-zones/"/>
    <updated>2013-01-24T17:22:00+01:00</updated>
    <id>http://sebastien-han.fr/blog/2013/01/24/openstack-nova-play-with-availability-zones</id>
    <content type="html"><![CDATA[<p><img class="center" src="http://sebastien-han.fr/images/nova-az.jpg" title="OpenStack Nova and availability zones" ></p>

<p>Availability zone in OpenStack. The main purpose of this article is to play a bit with availability zones.</p>

<!--more-->


<p>The good thing with availability zones is that you can manage and isolate different entities in your infrastructure. For instance, if some customers need really fast VMs you can host them on your super expensive compute rack full of SSDs.Then what you can do is boot all the instances of your customer that way:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>nova boot bla bla bla --availability-zone &lt;zone&gt;:&lt;compute-node&gt;
</span></code></pre></td></tr></table></div></figure>


<p>Everything is managed by this nova flag, and the only thing that you have to do is to define a name for your zone (obviously put something relevant):</p>

<pre><code># Availability zone
node_availability_zone=le_rack_du_seb
</code></pre>

<p>You could end up with a similar output:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>sudo nova-manage service list
</span><span class='line'>Binary           Host                                 Zone                    Status     State Updated_At
</span><span class='line'>nova-cert        c2-controller-01                     le_rack_du_seb          enabled    :-<span class="o">)</span>   2012-12-11 16:22:56
</span><span class='line'>nova-scheduler   c2-controller-01                     le_rack_du_seb          enabled    :-<span class="o">)</span>   2012-12-11 16:23:32
</span><span class='line'>nova-consoleauth c2-controller-01                     le_rack_du_seb          enabled    :-<span class="o">)</span>   2012-12-11 16:23:32
</span><span class='line'>nova-network     c2-compute-01                        le_rack_du_seb-01       enabled    :-<span class="o">)</span>   2012-12-11 16:23:32
</span><span class='line'>nova-compute     c2-compute-01                        le_rack_du_seb-01       enabled    :-<span class="o">)</span>   2012-12-11 16:23:32
</span><span class='line'>nova-compute     c2-compute-02                        le_rack_du_seb-01       enabled    :-<span class="o">)</span>   2012-12-11 16:23:32
</span><span class='line'>nova-network     c2-compute-02                        le_rack_du_seb-01       enabled    :-<span class="o">)</span>   2012-12-11 16:23:32
</span><span class='line'>nova-compute     c2-compute-03                        le_rack_du_seb-02       enabled    :-<span class="o">)</span>   2012-12-11 16:23:32
</span><span class='line'>nova-network     c2-compute-03                        le_rack_du_seb-02       enabled    :-<span class="o">)</span>   2012-12-11 16:23:32
</span><span class='line'>nova-compute     c2-compute-04                        le_rack_du_seb-02       enabled    :-<span class="o">)</span>   2012-12-11 16:23:32
</span><span class='line'>nova-network     c2-compute-04                        le_rack_du_seb-02       enabled    :-<span class="o">)</span>   2012-12-11 16:23:32
</span></code></pre></td></tr></table></div></figure>


<br />


<blockquote><p>That&#8217;s all! As always I hope it helps ;-)</p></blockquote>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Ceph and memory profiling]]></title>
    <link href="http://sebastien-han.fr/blog/2013/01/17/ceph-and-memory-profiling/"/>
    <updated>2013-01-17T16:46:00+01:00</updated>
    <id>http://sebastien-han.fr/blog/2013/01/17/ceph-and-memory-profiling</id>
    <content type="html"><![CDATA[<p><img class="center" src="http://sebastien-han.fr/images/ceph-knowledge.jpg" title="Ceph and memory profiling" ></p>

<p>How to use a memory profiler to track memory usage of Ceph daemons!</p>

<!--more-->


<p>To start tracking right away during daemon&#8217; startup simply put the following variable in the <code>/etc/init.d/ceph</code> script and start your OSD daemon:</p>

<pre><code>export CEPH_HEAP_PROFILER_INIT=1
</code></pre>

<p>The position doesn&#8217;t really matter ;-).</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>sudo service ceph start osd.0
</span><span class='line'><span class="o">===</span> osd.0 <span class="o">===</span>
</span><span class='line'>Starting Ceph osd.0 on ceph-01...
</span><span class='line'>Starting tracking the heap
</span><span class='line'>starting osd.0 at :/0 osd_data /srv/ceph/osd0 /journal/journal
</span></code></pre></td></tr></table></div></figure>


<p>Start the profiler:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>ceph osd tell 0 heap start_profiler
</span><span class='line'>ok
</span></code></pre></td></tr></table></div></figure>


<p>Ceph log shows:</p>

<pre><code>osd.0 [INF] osd.0 started profiler 
</code></pre>

<p>Let the profiler running and after some hours, dump the results into a file:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>ceph osd tell 0 heap dump
</span><span class='line'>ok
</span></code></pre></td></tr></table></div></figure>


<p>Ceph log shows:</p>

<pre><code>osd.0 [INF] osd.0dumping heap profile now.
osd.0 [INF] ------------------------------------------------
osd.0 [INF] MALLOC:       10810792 (   10.3 MB) Bytes in use by application
osd.0 [INF] MALLOC: +       438272 (    0.4 MB) Bytes in page heap freelist
osd.0 [INF] MALLOC: +       172656 (    0.2 MB) Bytes in central cache freelist
osd.0 [INF] MALLOC: +       165632 (    0.2 MB) Bytes in transfer cache freelist
osd.0 [INF] MALLOC: +      2044136 (    1.9 MB) Bytes in thread cache freelists
osd.0 [INF] MALLOC: +       786432 (    0.8 MB) Bytes in malloc metadata
osd.0 [INF] MALLOC:   ------------
osd.0 [INF] MALLOC: =     14417920 (   13.8 MB) Actual memory used (physical + swap)
osd.0 [INF] MALLOC: +            0 (    0.0 MB) Bytes released to OS (aka unmapped)
osd.0 [INF] MALLOC:   ------------
osd.0 [INF] MALLOC: =     14417920 (   13.8 MB) Virtual address space used
osd.0 [INF] MALLOC:
osd.0 [INF] MALLOC:           2669              Spans in use
osd.0 [INF] MALLOC:             36              Thread heaps in use
osd.0 [INF] MALLOC:           4096              Tcmalloc page size
osd.0 [INF] ------------------------------------------------
osd.0 [INF] Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
osd.0 [INF] Bytes released to the OS take 
</code></pre>

<p>Stop the profiler:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>ceph osd tell 0 heap stop_profiler
</span><span class='line'>ok
</span></code></pre></td></tr></table></div></figure>


<p>Log shows:</p>

<pre><code>osd.0 [INF] osd.0 stopped profiler
</code></pre>

<p>Read the <code>.heap</code> file with Google heap tool:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>sudo apt-get install google-perftools -y
</span><span class='line'><span class="nv">$ </span>sudo google-pprof /usr/bin/ceph-osd -gv osd-0001.heap
</span></code></pre></td></tr></table></div></figure>


<p>Hint, if you don&#8217;t have any virtual interface you can look for a text content:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'><span class="nv">$ </span>sudo google-pprof /usr/bin/ceph-osd --text osd-0001.heap
</span></code></pre></td></tr></table></div></figure>




<br />


<blockquote><p>I had to use a memory profiler because I recently noticed some memory leaks from Ceph OSDs. This has been discussed on the <a href="http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg11000.html">Ceph Mailing List</a>.</p></blockquote>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Sound problems with Chrome 24 on Mac OS X 10.8.2]]></title>
    <link href="http://sebastien-han.fr/blog/2013/01/15/sound-problem-with-chrome-24-on-mac-os-x-10-dot-8-2/"/>
    <updated>2013-01-15T19:26:00+01:00</updated>
    <id>http://sebastien-han.fr/blog/2013/01/15/sound-problem-with-chrome-24-on-mac-os-x-10-dot-8-2</id>
    <content type="html"><![CDATA[<p>Shortest article ever written on my blog&#8230;</p>

<p>From time to time I notice that switching from my audio local source to my Airplay station (and the other way around) somehow muted the sound on Chrome. After some googling I noticed that the problem came from a flash player plugin. You just need to disable it.</p>

<p>For this go to <code>chrome://plugins</code>, Show details an then disable the PPAPI plugin located in:</p>

<pre><code>/Applications/Google Chrome.app/Contents/Versions/24.0.1312.52/Google Chrome Framework.framework/Internet Plug-Ins/PepperFlash/PepperFlashPlayer.plugin
</code></pre>

<br />


<blockquote><p>Hope it helps!
`</p></blockquote>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Disable CephX for v0.55 and higher]]></title>
    <link href="http://sebastien-han.fr/blog/2013/01/11/disable-cephx-for-v0-dot-55-and-higher/"/>
    <updated>2013-01-11T01:34:00+01:00</updated>
    <id>http://sebastien-han.fr/blog/2013/01/11/disable-cephx-for-v0-dot-55-and-higher</id>
    <content type="html"><![CDATA[<p><img class="center" src="http://sebastien-han.fr/images/disable-cephx.jpg" title="Disable CephX for v0.55 and higher" ></p>

<p>A lot of new features came with the version 0.55 of Ceph, one of them is that CephX authentication is <strong>enable by default</strong>. If you run v0.48 Argonaut without CephX and want to update to the latest Bobtail, you might run through some problems if you don&#8217;t edit your configuration file.</p>

<!--more-->


<p>In previous versions stable branch, you could simply use the following setting:</p>

<pre><code> auth supported = [cephx | none]
</code></pre>

<p>This option is now deprecated. Bobtail now integrates a new finer-grained authentication, it supports 3 new authentication methods:</p>

<ul>
<li>cluster:</li>
<li>service: internal daemons communication, for instance OSD to OSD connections</li>
<li>client: client side, machine that tries to connect to the cluster</li>
</ul>


<p>By default <strong>daemons</strong> require CephX authentication, which means that OSD, MON and MDS will now use CephX to connect to each others. ON the other side, <strong>clients</strong> will continue to connect with disabled authentication.</p>

<pre><code>[global]
...
auth cluster required = none    
auth service required = none
...
</code></pre>

<p>To disable client authentication as well:</p>

<pre><code>[global]
...
auth client required = none
...
</code></pre>

<br />


<p><span class="text_quote">W </span>Important: If your cluster does not currently have an <code>auth supported</code> line that enables authentication, you must explicitly turn it off in Bobtail using the settings below.::</p>

<br />


<blockquote><p>Et voilà !</p></blockquote>
]]></content>
  </entry>
  
</feed>
