Sébastien Han

Ceph Jewel Preview: map RBD devices on NBD

2016-04-04T14:31:00+02:00

Another feature preview for Jewel. NBD driver for RBD that allows librbd to present a kernel-level block device

NBD has numerous advantages compare to the Kernel RBD drivers:

RBD-KO developement and features have to go through the stable Kernel
RBD-KO is catching up on librbd developement which takes time and effort
NDB has been well integrated into the Kernel for many years and is part of most of nowadays kernels

The idea of rbd-nbd is to rely on the userspace implementation of librbd which is robust and stable by using the strong and well established NBD (Network Block Device) kernel module.

It is pretty simple to map a device with NDB:

$ sudo apt-get install -y rbd-nbd
$ sudo rbd create leseb -s 10G
$ sudo rbd-nbd map rbd/leseb
/dev/nbd0

$ sudo rbd-nbd list-mapped
/dev/nbd0

TADA!

Ceph Jewel Preview: Ceph RBD mirroring

2016-03-28T10:43:00+02:00

The RBD mirroring feature will be available for the next stable Ceph release: Jewel.

I. Rationale

What we are trying to solve here or at least overcome is the default nature of Ceph being synchronous. This implies that the Ceph block storage solution (called RBD) has trouble working decently over regions. As a reminder, Ceph will only consider a write as complete when all the replicas of an object are written. That’s why setting up a stretched cluster across long distance is not generally a good idea since latencies are generally high. The cluster would have to wait until all the writes are completed, it might take a considerable amount of time for the client to get its acknowledgement.

As a result, we need a mechanism that will allow us to replicated block devices between clusters stored in different regions. Such method will help us for different purposes:

Disaster recovery
Global block device distribution

II. Under the hood

II.1. A new daemon

A new daemon: rbd-mirror will be responsible for synchronising images from one cluster to another. The daemon will be configured on both sites and it will simultaneously connect to both local and remote clusters. Primarily, with the Jewel release, there will be a one to one relationship between daemons where in the future this will expend to one to N. So after Jewel, you will have the ability to configure one site with multiple backup target sites.

As a starting point, it will connect to the other Ceph cluster using a configuration file (to find the monitors), a user and a key. Using the admin user for this is just fine. The rbd-mirror daemons uses the cephx mechanism to authenticate with the monitors, usual and default method within Ceph.

In order to know each other, each daemon will have to register the other one, more precisely the other cluster. This is set at the pool level with the command: rbd mirror pool peer add. Peers info can be retrieved like so:

$ rbd mirror pool info
Enabled: true
Peers:
  UUID                                 NAME        CLIENT
    786b42ea-97eb-4b16-95f4-867f02b67289 ceph-remote client.admin

Later the UUID can be used to remove a peer if needed.

II.2. The mirroring

The RBD mirroring relies on two new RBD image features:

journaling: enables journaling for every transaction on the image
mirroring: which explicitly tells the rbd-mirror daemon to replicate this image

There will also be commands and librbd calls to enable/disable mirroring for individual images. The journal maintains a list of record for all the transactions on the image. It can be seen as another RBD image (a bunch of RADOS objects) that will live in the cluster. In general, a write would first go to the journal, return to the client, and then be written to the underlying rbd image. For performance reasons, this journal can sit on different pool from the image it is doing its journaling. Currently there is one per journal per RBD image. This will likely stay like this until we introduce consistency groups in Ceph. For those of you who are not familiar with the concept, a consistency group is an entity that manages a bunch of volumes (ie: RBD images) where they can be treated as one. This allows you to perform operations like snapshots on all the volumes present in the group. With consistency groups you get the guarantee that all the volumes are in the same consistent state. So when consistency groups will be available in Ceph we will use a single journal for all the RBD images part of this group.

So now, I know some of you are already thinking about this: “can I enable journaling on an existing image?” Yes you can! Ceph will simply take a snapshot of your image, do a RBD copy of the snapshot and will start journaling after that. It is just a background task.

The mirroring can be enabled and disabled on an entire pool or per image basis. If it enabled on a pool every images that have a journaling feature enabled will get replicated by the mirror agent. This can be enabled with the command: rbd mirror pool enable.

To configure an image you can run the following:

$ rbd create rbd/leseb --image-feature exclusive-lock,journaling

The features can be activated on the fly using rbd feature enable rbd/leseb exclusive-lock && rbd feature enable rbd/leseb journaling

III. Disaster recovery

Doing cross-sync replication is possible and it is even the default way it is implemented. This means that pools names across both locations must be the same. Thus images will get the exact same name on other cluster, this brings two challenges:

Using the same pool for your active data and backup (from the other site) can impact your performance
However, having the exact same pool name is definitely easier when performing a recovery. From an OpenStack perspective, you just need to populate database records with the volume ID.

Each image has a mirroring_directory object that contains tag about the location of the active site. The images on the local site are promoted ‘primary’ (with rbd mirror image promote) and is writable where the backup images on the remote site hold a lock. The lock means that this image can not be written (read-only mode) until it is promoted as primary and the primary demoted (with rbd mirror image demote). So this is where the promotion and demotion functionnality comes in. As soon as the backup image gets promoted primary, the original image will be promoted as secondary. This means the synchronisation will happen on the other way (the backup location becomes primary and performs the synchronisation to the secondary site that was originally primary).

If the platform is affected by a split-brain situation, the rbd-mirror daemon will not attempt any sync, so you will have to find the good image by yourself. Meaning you will have to manually force a resync from what you consider being the most up to date image. For this you can run rbd mirror image resync.

IV. OpenStack implementation

For now, the mirroring feature requires running both environment on the same L2 segment. So clusters can reach each others. Later, a new daemon could be implemented, it might be called mirroring-proxy. It will be responsible for relaying mirroring requests. Here again, we will face two new challenges:

Using a relay mechanism with the help of a daemon will allow us to only expose this daemon to the outside world
Where using a daemon will likely create more contention

So, it is a bit of a balance between security and performance.

In terms of implementation, the daemon will likely require a dedicated server to operate as it needs a lot of bandwidth. So a server with multiple network cards is ideal. Unfortunately, the initial version of the rbd-mirroring daemon will not have any high availability. The daemon will only be able to run once. So when it comes to high availability of this daemons, the ability to run it multiple times on different servers in order to provide HA and offload will likely appear in a later Jewel release.

The RBD mirroring is extremely useful while trying to implement a disaster recovery scenario for OpenStack, here is a design example:

If you want to learn more about RBD mirroring design, you can read Josh Durgin’s design draft discussion and the pad from the Ceph Online Summit.

Your first Ceph OSD backed by BlueStore with ceph-ansible

2016-03-25T11:41:00+01:00

Jewel is just around the corner and the first release candidate just came out yesterday (tagged: v10.1.0). If you are not familiar with BlueStore yet, checkout my recent article: Ceph Jewel Preview: A New Store Is Coming, BlueStore.

Start by edit your group_vars/all with the following content:

ceph_dev: true
ceph_dev_branch: v10.1.0
monitor_interface: 
public_network: 
osd_objectstore: bluestore
ceph_conf_overrides:
  global:
    enable experimental unrecoverable data corrupting features: 'bluestore rocksdb'
    bluestore fsck on mount: true
    bluestore block db size: 67108864
    bluestore block wal size: 134217728
    bluestore block size: 5368709120

Just one more step, jump into group_vars/osds and activate the fith OSD scenario using:

bluestore: true

That’s all, now run ansible as usual: ansible-playbook site.yml. Wait a little bit and you should see the following:

$ ceph -v
 ceph version 10.1.0 (96ae8bd25f31862dbd5302f304ebf8bf1166aba6)

$ ceph -s
2016-03-25 13:03:31.846668 7f313ad2b700 -1 WARNING: the following dangerous and experimental features are enabled: bluestore,rocksdb
2016-03-25 13:03:31.855052 7f313ad2b700 -1 WARNING: the following dangerous and experimental features are enabled: bluestore,rocksdb
    cluster 179c40e3-8b3e-4ed0-9153-fefd638349a2
     health HEALTH_OK
     monmap e1: 1 mons at {rbd-mirroring-b4dae55c-34e3-4eb6-a84d-1b621af31c75=192.168.0.44:6789/0}
            election epoch 3, quorum 0 rbd-mirroring-b4dae55c-34e3-4eb6-a84d-1b621af31c75
     osdmap e9: 2 osds: 2 up, 2 in
            flags sortbitwise
      pgmap v13: 64 pgs, 1 pools, 0 bytes data, 0 objects
            2052 MB used, 38705 MB / 40757 MB avail
                  64 active+clean

Ceph Jewel Preview: a new store is coming, BlueStore

2016-03-21T10:19:00+01:00

A new way to efficiently store objects using BlueStore.

As mentioned in an old article, thing are moving fast with Ceph and especially around store optimisations. A new initiative was launched earlier this year, code name: NewStore. But is it about?

The NewStore is an implementation where the Ceph journal is stored on RocksDB but actual objects remain stored on a filesystem. With the help of BlueStore, objects are now stored directly on the block device, without any filesystem interface.

I. Rationale

What was wrong with FileStore? As you may Ceph stores objects as files on a filesystem, this is basically what we call FileStore. If you want to learn more, please refer to a previous article where I explained Ceph under the hood. So now, let’s explore why the need of BlueStore emerged.

I.1. Safe transactions

Ceph is software defined storage solution, so its main purpose is to have your data stored safely and for this we need atomicity. Unfortunately there is no filesystems that provide atomic writes/updates and given that O_ATOMIC never made it into the Kernel. An attempt to fix this using Btrfs (since it provides atomic transactions) was made but did not really succeed.

Ceph developers had to find an alternative. This alternative you know it pretty well and it is the Ceph journal. However doing write-ahead journaling has a major performance cost since it basically splits performance of your disk into two (when journal and osd data share the same disk).

I.2. Objects enumeration

In the context of Ceph, storing objects as files on a POSIX filesystem is not ideal too. Ceph stores object using a hash mechanism, so object names will appear in a funky way such as: rbdudata.371e5017a72.0000000000000000__head_58D36A14__2. For various operations such as scrubbing, backfill and recovery, Ceph needs to retrieve objects and enumerate them. However, POSIX does not offer any good way to read the content of a directory in an ordered fashion. For this, Ceph developers ended up using a couple of ‘hacks’ such as sharding object directories in tiny sub directories so they could list the content, sort it and then use it. But one again in the end, it is another overhead that is being introduced.

II. Anatomy

This is an overview of the new architecture with BlueStore:

In terms of layer of abstractions, the setup and his overhead are quite minimal. This is a deep dive into BlueStore:

As we can see, BlueStore has several internal components but from a general view Ceph object (actual ‘data’ on the picture) will be written directly on the block device. As a consequence we will not need any filesystem anymore, BlueStore consumes a raw partition directly. For metadata that come with an OSD, those will be store on a RocksDB key/value database. Let’s decrypt the layers:

RocksDB, as mentioned is the global entity that contains the WAL journal and metadata (omap)
BlueRocksEnv is the interface to interact with RocksDB
BlueFS is a minimal C++ filesystem-like, that implements the rocksdb::Env interface (stores RocksDB log and sst files) Because rocksdb normally runs on top of a file system, BlueFS was created. It is a backend layer, RocksDB data are stored on the same block device that BlueStore is using to store its data.

So what do we store in RocksDB?

Objects metadata
Write-ahead log
Ceph omap data
Allocator metadata, the allocator is responsible for determining where the data should be stored. Note that this one is also pluggable.

Now, default’s BlueStore model on your disk:

Basically, we will take a disk and partition it in two:

the first tiny partition will be partitioned using either XFS or ext4. It contains Ceph files (like init system descriptor, status, id, fsid, keyring etc) and RocksDB files (RocksDB metadata and WAL journal).
the second is a raw partition without filesystem

And then an advanced BlueStore model:

What’s fascinating about BlueStore is to see how flexible it is. Every component can be stored on a different device. In this picture, RocksDB WAL and DB can be either stored on different devices or on tiny partitions too.

III. Key features

In order to summarize, let me highlight some of the best features from BlueStore:

No more double writes penalty, as it writes directly on the block device first and then updates object metadata on RocksDB that specifies its location on the drive.
Setup flexibility: as mention in the last section BlueStore can use up to 3 drives, one for data, one for RocksDB metadata and one for RocksDB WAL. So you can imagine HDD as a data drive, SSD for the RocksDB metadata and one NVRAM for RocksDB WAL.
Raw block device usage: as we write directly to a block device we do not suffer the penalty and the overhead of a traditional filesystem. Plus we avoid the redundancy of journaling metadata where filesystems already do their own internal journaling and metadata management.

As soon as Jewel is released, BlueStore will be available. If I’m not mistaken it will be available but let’s consider it as a tech preview, I’m not sure yet if we should put it in production. To be sure carefully read the release changelog as soon as it is available. As we can see, Ceph is putting more and more intelligence into the drive. BlueStore is eliminating the need of the WAL journal dedicated journal with the help of RocksDB. I haven’t run any benchmarks on BlueStore yet but it is possible that we will not any dedicated device for journaling anymore. Which brings awesome perspectives in terms of datacenter management. Each OSD is independent, so it can be easily plugged out a server into another and it will just run. This capability is not new and if I remember correctly this was introduced during Firefly cycle thanks to the udev rules on the system. Basically when you hotplug a new disk containing an OSD on a system, this will trigger a udev event which will activate the OSD on the system. BlueStore simply strengthens this mechanism given that it removes the need of a dedicated device for the Ceph journal depending on what you want to achieve with your cluster. Performance running everything on the same drive should be decent enough for at least Cost/capacity optimized scenarios and potentially throughput optimized.

Use Ceph Ansible to build and deploy Ceph from master branch

2016-03-14T19:16:00+01:00

It is really easy with ceph-ansible to deploy a Ceph cluster from its bleeding edge version (github master branch. For the purpose of this exercise, we will choose the 10.0.4 version which corresponds to the latest development branch of Jewel. Simply edit your group_vars/all and uncomment the following:

ceph_dev: true
ceph_dev_branch: v10.0.4

Then simply run vagrant up and wait a bit… After a couple of minutes you will get everything deployed and an output similar to this one:

[root@ceph-mon0 ~]$ sudo ceph -v
ceph version 10.0.4 (ea45099808051017fb05582555f126cba80567b8)

Migrate Ceph cluster from one distro to another

2016-03-07T11:28:00+01:00

One of the recent use case I had was to migrate an Ubuntu based Ceph cluster to RHEL. We had strict requirements and did not want to have any data being migrated. It is yet another beauty from Ceph and particularly OSDs, where they basically have the ability to run on any machines. Let’s say you have an OSD, you can pull out the disk and plug in on another machine seamlessly. The approach taken here was a bit different, but relies on this capability.

I built the automation using Ansible. The procedure is rather simple and could be decoupled as follow. Note that all the tasks are being serialized.

For monitors:

Compress the monitor store
Archive the monitor store
Copy the archive to the Ansible server
Stop the monitor
Stop the server and install RHEL using your own provisioning method, the playbook only calls reboot.
Once the server is up we copy the archive on the new machine (make sure Ceph is installed with the exact same version)
Archive and copy back the files
Start the monitor and wait to form a quorum with the others.
We repeat this for the remaining monitors

For OSDs:

Set noout flag, so no recovery will be triggered
Archive OSD bootstrap key and ceph.conf
Copy the archive to the Ansible server
Stop the OSDs
Stop the server and install RHEL using your own provisioning method, the playbook only calls reboot.
Once the server is up we copy the archive on the new machine (make sure Ceph is installed with the exact same version)
Archive and copy back the files
Start the OSD and wait for PGs to be clean, once their state is active+clean we start another OSD host

In practice, you simple need to:

$ git clone https://github.com/ceph/ceph-ansible.git
$ cd ceph-ansible
$ ansible-playbook cluster-os-migration.yml

Happy migration! One last note, the procedure does not trigger any data movement, so this operations runs on live, is rolling and does not impact connected clients.

See you at the Open Compute summit

2016-03-04T20:23:00+01:00

Next week I will be heading to San Jose to attend the OCP (Open Compute) Summit. I will be presenting and demoing my work on containerizing Ceph. This event is really exciting as most (to not say all) of the biggest Open Compute representatives (Facebook, Quanta etc) will be there. So if you come by the Red Hat booth, don’t forget to say hi ;)

Ceph Ansible first release

2016-03-02T12:08:00+01:00

Ceph Ansible started as a personnal project, the reason was simple I wanted to have an in-depth look at Ansible. Thus I immediatly thought, why not try to deploy Ceph with Ansible. Moreover, I have never been a huge of Puppet and ceph-deploy was a couple of months old, so to me Ansible was the right answer.

After almost 2 years of developement (first commit), I am glad to annonce that ceph-ansible will now a release cycle process. With the help of git tags, we will be providing point in time releases with new features.

From the past 2 years, ceph-ansible has seen some good contributions.

Now, let me give you some news about ceph-ansible latest capabilities:

CI to test each pull request
Roles available on Ansible Galaxy and part of the Ceph organization
Support for all the Ceph releases
Support for RHEL and RHCS
Support for Ansible v2
Improve package upgrade logic for rolling update
Support for dnf
Support for systemd
Support for more templating options (use variables to populate the ceph.conf)
Scan network ports to avoid firewalling issues

And many more!

Since Ansible has an option to run cowsay and since I like the cartoon ”Cow and Chicken” I’ve been thinking of using characters names for the releases :). Thus the first one is named ”Chicken”. There are not so much characters (apparently 20), so this won’t last long.

As a personnal project, I really see this as an achievement so I’d like to thank everyone for their support! You can check out the release on Github.

Mobile World Congress: Containerizing Ceph

2016-02-29T11:57:00+01:00

Last week I was attending the Mobile World Congress with Red Hat and I had the chance to demo my work on containerizing Ceph. You will find my presentation in the article :).

Here was my desk where I was doing my presentation:

Github get the contributor list of a repository

2016-02-11T10:44:00+01:00

Quick tip to retrieve the contributor list of a given repository. Example bellow with ceph-ansible:

$ curl -s https://api.github.com/repos/ceph/ceph-ansible/stats/contributors | grep login | awk -F ":" '{print $2}' | sed 's/"/,@/;s/,$//;s/"$//' | tr -d '\n'
 ,@bmanojlovic ,@hunter ,@guits ,@lorin ,@marmot21 ,@mcsage ,@byronmccollum ,@maethor ,@mhubig ,@laboshinl ,@fcharlier ,@bstillwell ,@lyandrew ,@jjoos ,@BjoernT ,@psy-q ,@bsanders ,@pb-it ,@eikef ,@lpabon ,@ti-mo ,@alfredodeza ,@aisrael ,@andymcc ,@Abhishekvrshny ,@gpocentek ,@Logan2211 ,@git-harry ,@darkcrux ,@nexecook ,@rootfs ,@crcceph ,@mattt416 ,@bengland2 ,@bjne ,@HanXHX ,@andrewschoen ,@matthewrees ,@xals ,@jcftang ,@flisky ,@guestisp ,@msambol ,@leseb

Now you can easily ping them.

OpenStack Summit Austin: Time To Vote

2016-02-10T11:06:00+01:00

The summit is almost there and it is time to vote for the presentation you want to see :). Here are the presentations we submitted (me and my colleagues):

Protecting the Galaxy - Multi-Region Disaster Recovery with OpenStack and Ceph

Speakers: Sébastien Han, Sean Cohen, Federico Lucifredi

Persistent Containers for Transactional Workloads

Speakers: Sébastien Han, Kyle Bader

How to seemlessly migrate CEPH with PB’s of data from one OS to other with no impact

Speakers: Sébastien Han, Shyam Bollu, Michael DeSimone

I hope to see you there ;)

Easily deploy containerized Ceph daemons with Vagrant

2016-02-08T11:13:00+01:00

This weekend I just pushed a new feature in ceph-ansible, which adds the ability to deploy containerized Ceph daemons. It is quite easy to get going with this, simply do the following:

$ git clone https://github.com/ceph/ceph-ansible
$ cd ceph-ansible
$ cp site-docker.yml.sample site-docker.yml
$ cp vagrant_variables.yml.sample vagrant_variables.yml
$ sed -i "s/docker: false/docker: true/" vagrant_variables.yml
$ sed -i "s/mon_vms: 1/mon_vms: 3/" vagrant_variables.yml
$ vagrant up

...
...

$ vagrant ssh mon0 -c "sudo docker exec ceph-mon0 ceph -s"
    cluster 5b1f9b36-c589-42ef-8d19-cb37f3a71e43
     health HEALTH_ERR
            64 pgs stuck inactive
            64 pgs stuck unclean
            no osds
     monmap e3: 3 mons at {ceph-mon0=192.168.42.10:6789/0,ceph-mon1=192.168.42.11:6789/0,ceph-mon2=192.168.42.12:6789/0}
            election epoch 6, quorum 0,1,2 ceph-mon0,ceph-mon1,ceph-mon2
     osdmap e1: 0 osds: 0 up, 0 in
      pgmap v2: 64 pgs, 1 pools, 0 bytes data, 0 objects
            0 kB used, 0 kB / 0 kB avail
                  64 creating
Connection to 127.0.0.1 closed.

Now move away from the keyboard and go grab a coffee. The process should not last long, most of the time is spent on:

booting the virtual machines
install packages
fetch docker images

More improvement coming soon :)

Use Ansible to configure containers

2016-01-06T14:32:00+01:00

Typically, when we build a container image we have 2 main files:

Dockerfile is the essence of the container, it is what the container is made of, it generally contains packages installation steps and files
entrypoint.sh is where we configure the container, during the bootstrap sequence this script will get executed. Usually the entrypoint.sh file contains bash instructions.

So the idea is, instead of relying on bash scripting when writing container’s entrypoint we could call an Ansible to configure it.

Do not forget to replace all the my-application statement with the name of your application ;).

File example for base image Dockerfile that your application will be using:. We simple install Ansible and our application here:

FROM ubuntu:14.04
MAINTAINER Sébastien Han "seb@redhat.com"

# Install prerequisites
RUN apt-get update && \
    apt-get install -y python python-dev python-pip python-yaml && \
    apt-get clean && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

# Install Ansible
RUN pip install pyyaml ansible

# Install my application
RUN apt-get install -y --force-yes my-application && \
apt-get clean && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

RUN mkdir -p /opt/ansible/my-application/
ADD site.yml /opt/ansible/my-application/site.yml

ADD entrypoint.sh /entrypoint.sh
ENTRYPOINT ["/entrypoint.sh"]

File example for a site.yml, this file will later be used by Ansible in your application container:

---
# Defines deployment design and assigns role to server groups

- hosts: my-application
  connection: local
  sudo: True
  roles:
  - { role: application1, tags: installation }
  - { role: application2, tags: configuration }

File example for an application entrypoint.sh. This is some of the brief instructions that you will find to run Ansible:

cat >/opt/ansible/my-application/inventory </opt/ansible/my-application/group_vars/all <



Now simply run docker run  and your container will get configured by Ansible :).
As always use docker logs -f  to check the bootstrap process.




Ansible power! Now it would be interesting to do a bit of profiling as Ansible might slow things down a little bit.
From some of the test I ran, this is not much but it is up to you to decide whether it is acceptable or not.



Ceph: Modification Time of RBD Images
2015-12-28T15:27:00+01:00


Get the modification time of a RBD image.




Each RADOS object does maintain a mtime that you can get via the RADOS tool.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
$ sudo rbd create leseb -s 10240
$ sudo rbd info leseb
$ sudo rbd map leseb
/dev/rbd1
$ sudo mkfs.xfs /dev/rbd1
log stripe unit (4194304 bytes) is too large (maximum is 256KiB)
log stripe unit adjusted to 32KiB
meta-data=/dev/rbd1              isize=256    agcount=17, agsize=162816 blks
         =                       sectsz=512   attr=2, projid32bit=0
data     =                       bsize=4096   blocks=2621440, imaxpct=25
         =                       sunit=1024   swidth=1024 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=2560, version=2
         =                       sectsz=512   sunit=8 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
$ sudo mount /dev/rbd1 /mnt



Now check the image properties, we are looking here for the block_name_prefix in order to identify the objects in RADOS:

1
2
3
4
5
6
$ sudo rbd info leseb
rbd image 'leseb':
        size 10240 MB in 2560 objects
        order 22 (4096 kB objects)
        block_name_prefix: rb.0.3b19.74b0dc51
        format: 1



Ok so this gives us: rb.0.3b19.74b0dc51.
Given that most of the filesystem structure is built from the start of the device we do not need to bother looking around for all the filesystem block.
The first block is enough, in a RADOS naming it is easy to find out that the first block will be named: rbd/rb.0.3b19.74b0dc51.000000000000.

Now let’s look at the properties of that object:

1
2
$ sudo rados -p rbd stat rb.0.3b19.74b0dc51.000000000000
rbd/rb.0.3b19.74b0dc51.000000000000 mtime 2015-12-21 23:08:57.000000, size 131072



And oh suprise! This gives us the mtime of this object.
So basically everytime there is an operation on the filesystem

For a block based application, this a bit more tricky because we do not know which blocks will be accessed.
So we have to go through all the objects and sort their mtime…




ZA WARUDO! TOKI YO TOMARE!



Ceph CRUSH rule: 1 copy SSD and 1 copy SATA
2015-12-21T13:38:00+01:00


Following last week article, here is another CRUSH example.
This time we want to store our first replica on SSD drives and the second copy on SATA drives.




Here is the CRUSH rule:

rule ssd-primary-affinity {
    ruleset 0
    type replicated
    min_size 2
    max_size 3
    step take ssd
    step chooseleaf firstn 1 type host
    step emit
    step take sata
    step chooseleaf firstn -1 type host
    step emit
}





Make sure that you configure your OSD with the primary affinty flag as well, for reference look at my article about primary affinity.



Ceph CRUSH two copies in one rack
2015-12-14T15:35:00+01:00


Quick CRUSH example on how to store 3 replicas, two in rack number 1 and the third one in rack number 2.




Here the CRUSH rule:

rule 3_rep_2_racks {
   ruleset 1
   type replicated
   min_size 2
   max_size 3
   step take default
   step choose firstn 2 type rack
   step chooseleaf firstn 2 type host
   step emit
}





Hope that helps ;-)



Ceph: properly remove an OSD
2015-12-11T15:21:00+01:00


Sometimes removing OSD, if not done properly can result in double rebalancing.
The best practice to remove an OSD involves changing the crush weight to 0.0 as first step.




So in the end, this will give you:

1
$ ceph osd crush reweight osd. 0.0



Then you wait for rebalance to be completed.
Eventually completely remove the OSD:

1
2
3
4
5
$ ceph osd out 
$ service ceph stop osd.
$ ceph osd crush remove osd.
$ ceph auth del osd.
$ ceph osd rm 



Et voilà !



Ceph is moving outside DevStack core to a plugin
2015-11-30T14:55:00+01:00


Ceph just moved outside of DevStack in order to comply with the new DevStack’s plugin policy.
The code can be found on github.
We now have the chance to be on OpenStack Gerrit as well and thus brings all the good things from the OpenStack infra (a CI).

To use it simply create a localrc file with the following:

enable_plugin ceph https://github.com/openstack/devstack-plugin-ceph





A more complete localrc file can be found on Github.



Ceph: find an OSD location and restart it
2015-11-27T14:54:00+01:00


When you manage a large cluster, you do not always know where your OSD are located.
Sometimes you have issues with PG such as unclean or with OSDs such as slow requests.
While looking at your ceph health details you only see where the PGs are acting or on which OSD you have slow requests.
Given that you might have tons of OSDs located on a lot of node, it is not straightforward to find and restart them.




You will find bellow a simple script that can do this for you.
In this example, I want to restart all the down OSDs on an Ubuntu operating system.

1
2
3
4
5
6
7
#!/bin/bash

for down_osd in $(ceph osd tree | awk '/down/ {print $1}')
do
  host=$(ceph osd find $down_osd | awk -F\" '$2 ~ /host/ {print $4}')
  ssh $host restart ceph-osd id=$down_osd
done






Yes the awk is ugly, if someone comes out with an easier/clearer alternative ;)



Five useful new features from Ceph Infernalis
2015-11-18T16:38:00+01:00


Infernalis has just been released a couple of weeks ago and I have to admit that I am really impressed of the work that has been done.
So I am going to present you 5 really handy things that came out with this new release.




1. Use units to create an image and for benchmark commands

Prior to this, the default unit was MB so we had to write the image size in consequence.

1
$ rbd create a -s 1G






2. Space used by an image

Since images are sparse and that discard is available for some virtualization storage controllers it is nice to have the ability to track the space used by an image.
Obviously using object-map is highly recommend here, as it will speed up the calculation time.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
$ rbd du a
warning: fast-diff map is not enabled for a. operation may be slow.
NAME PROVISIONED USED
a          1024M    0

$ rbd -p rbd bench-write a --io-size 4096 --io-threads 1 --io-total 4096 --io-pattern rand
bench-write  io_size 4096 io_threads 1 bytes 4096 pattern rand
  SEC       OPS   OPS/SEC   BYTES/SEC
elapsed:     0  ops:        1  ops/sec:    83.42  bytes/sec: 341671.93

$ rbd du a
warning: fast-diff map is not enabled for a. operation may be slow.
NAME PROVISIONED  USED
a          1024M 4096k






3. OSD performance analyser

A new command that will connect to the daemon’s socket and show some statistics.

1
2
3
4
5
6
7
8
9
10
11
12
13
$ ceph daemonperf osd.1
---objecter--- -----------osd-----------
writ read actv|recop rd   wr   lat  ops |
  0    0    0 |   0    0   53k 339   13
  0    0    0 |   0    0  753k 516  180
  0    0    0 |   0    0  843k 149  206
  0    0    0 |   0    0  507k  34  123
  0    0    0 |   0    0  630k  45  150
  0    0    0 |   0    0  626k  33  149
  0    0    0 |   0    0  573k  57  138
  0    0    0 |   0    0  172k  49   42
  0    0    0 |   0    0    0    0    0
  0    0    0 |   0    0    0    0    0






4. Enable and disable image feature on the fly

Quick example on how to enable object map after the image creation.

1
2
3
4
5
6
7
8
9
10
11
$ rbd create a -s 1G --image-feature exclusive-lock
$ rbd info a
rbd image 'a':
        size 1024 MB in 256 objects
        order 22 (4096 kB objects)
        block_name_prefix: rbd_data.856d51f8ceac
        format: 2
        features: exclusive-lock
        flags:

$ rbd feature enable a object-map






5. Default new images to format 2

Thanks to this, we do not need to specify the format during the image creation, nor add a new line into your ceph.conf.




Enjoy!