Another feature preview for Jewel. NBD driver for RBD that allows librbd to present a kernel-level block device
NBD has numerous advantages compare to the Kernel RBD drivers:
The idea of rbd-nbd is to rely on the userspace implementation of librbd which is robust and stable by using the strong and well established NBD (Network Block Device) kernel module.
It is pretty simple to map a device with NDB:
1 2 3 4 5 6 7 |
|
]]>TADA!
The RBD mirroring feature will be available for the next stable Ceph release: Jewel.
What we are trying to solve here or at least overcome is the default nature of Ceph being synchronous. This implies that the Ceph block storage solution (called RBD) has trouble working decently over regions. As a reminder, Ceph will only consider a write as complete when all the replicas of an object are written. That’s why setting up a stretched cluster across long distance is not generally a good idea since latencies are generally high. The cluster would have to wait until all the writes are completed, it might take a considerable amount of time for the client to get its acknowledgement.
As a result, we need a mechanism that will allow us to replicated block devices between clusters stored in different regions. Such method will help us for different purposes:
A new daemon: rbd-mirror
will be responsible for synchronising images from one cluster to another.
The daemon will be configured on both sites and it will simultaneously connect to both local and remote clusters.
Primarily, with the Jewel release, there will be a one to one relationship between daemons where in the future this will expend to one to N.
So after Jewel, you will have the ability to configure one site with multiple backup target sites.
As a starting point, it will connect to the other Ceph cluster using a configuration file (to find the monitors), a user and a key.
Using the admin user for this is just fine.
The rbd-mirror
daemons uses the cephx mechanism to authenticate with the monitors, usual and default method within Ceph.
In order to know each other, each daemon will have to register the other one, more precisely the other cluster.
This is set at the pool level with the command: rbd mirror pool peer add
.
Peers info can be retrieved like so:
1 2 3 4 5 |
|
Later the UUID can be used to remove a peer if needed.
The RBD mirroring relies on two new RBD image features:
rbd-mirror
daemon to replicate this imageThere will also be commands and librbd calls to enable/disable mirroring for individual images. The journal maintains a list of record for all the transactions on the image. It can be seen as another RBD image (a bunch of RADOS objects) that will live in the cluster. In general, a write would first go to the journal, return to the client, and then be written to the underlying rbd image. For performance reasons, this journal can sit on different pool from the image it is doing its journaling. Currently there is one per journal per RBD image. This will likely stay like this until we introduce consistency groups in Ceph. For those of you who are not familiar with the concept, a consistency group is an entity that manages a bunch of volumes (ie: RBD images) where they can be treated as one. This allows you to perform operations like snapshots on all the volumes present in the group. With consistency groups you get the guarantee that all the volumes are in the same consistent state. So when consistency groups will be available in Ceph we will use a single journal for all the RBD images part of this group.
So now, I know some of you are already thinking about this: “can I enable journaling on an existing image?” Yes you can! Ceph will simply take a snapshot of your image, do a RBD copy of the snapshot and will start journaling after that. It is just a background task.
The mirroring can be enabled and disabled on an entire pool or per image basis.
If it enabled on a pool every images that have a journaling feature enabled will get replicated by the mirror agent.
This can be enabled with the command: rbd mirror pool enable
.
To configure an image you can run the following:
1
|
|
The features can be activated on the fly using rbd feature enable rbd/leseb exclusive-lock && rbd feature enable rbd/leseb journaling
Doing cross-sync replication is possible and it is even the default way it is implemented. This means that pools names across both locations must be the same. Thus images will get the exact same name on other cluster, this brings two challenges:
Each image has a mirroring_directory
object that contains tag about the location of the active site.
The images on the local site are promoted ‘primary’ (with rbd mirror image promote
) and is writable where the backup images on the remote site hold a lock.
The lock means that this image can not be written (read-only mode) until it is promoted as primary and the primary demoted (with rbd mirror image demote
).
So this is where the promotion and demotion functionnality comes in.
As soon as the backup image gets promoted primary, the original image will be promoted as secondary.
This means the synchronisation will happen on the other way (the backup location becomes primary and performs the synchronisation to the secondary site that was originally primary).
If the platform is affected by a split-brain situation, the rbd-mirror
daemon will not attempt any sync, so you will have to find the good image by yourself.
Meaning you will have to manually force a resync from what you consider being the most up to date image. For this you can run rbd mirror image resync
.
For now, the mirroring feature requires running both environment on the same L2 segment.
So clusters can reach each others.
Later, a new daemon could be implemented, it might be called mirroring-proxy
.
It will be responsible for relaying mirroring requests.
Here again, we will face two new challenges:
So, it is a bit of a balance between security and performance.
In terms of implementation, the daemon will likely require a dedicated server to operate as it needs a lot of bandwidth. So a server with multiple network cards is ideal. Unfortunately, the initial version of the rbd-mirroring daemon will not have any high availability. The daemon will only be able to run once. So when it comes to high availability of this daemons, the ability to run it multiple times on different servers in order to provide HA and offload will likely appear in a later Jewel release.
The RBD mirroring is extremely useful while trying to implement a disaster recovery scenario for OpenStack, here is a design example:
]]>If you want to learn more about RBD mirroring design, you can read Josh Durgin’s design draft discussion and the pad from the Ceph Online Summit.
Start by edit your group_vars/all
with the following content:
ceph_dev: true
ceph_dev_branch: v10.1.0
monitor_interface: <your interface>
public_network: <your subnet>
osd_objectstore: bluestore
ceph_conf_overrides:
global:
enable experimental unrecoverable data corrupting features: 'bluestore rocksdb'
bluestore fsck on mount: true
bluestore block db size: 67108864
bluestore block wal size: 134217728
bluestore block size: 5368709120
Just one more step, jump into group_vars/osds
and activate the fith OSD scenario using:
bluestore: true
That’s all, now run ansible as usual: ansible-playbook site.yml
.
Wait a little bit and you should see the following:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
A new way to efficiently store objects using BlueStore.
As mentioned in an old article, thing are moving fast with Ceph and especially around store optimisations. A new initiative was launched earlier this year, code name: NewStore. But is it about?
The NewStore is an implementation where the Ceph journal is stored on RocksDB but actual objects remain stored on a filesystem. With the help of BlueStore, objects are now stored directly on the block device, without any filesystem interface.
What was wrong with FileStore? As you may Ceph stores objects as files on a filesystem, this is basically what we call FileStore. If you want to learn more, please refer to a previous article where I explained Ceph under the hood. So now, let’s explore why the need of BlueStore emerged.
Ceph is software defined storage solution, so its main purpose is to have your data stored safely and for this we need atomicity.
Unfortunately there is no filesystems that provide atomic writes/updates and given that O_ATOMIC
never made it into the Kernel.
An attempt to fix this using Btrfs (since it provides atomic transactions) was made but did not really succeed.
Ceph developers had to find an alternative. This alternative you know it pretty well and it is the Ceph journal. However doing write-ahead journaling has a major performance cost since it basically splits performance of your disk into two (when journal and osd data share the same disk).
In the context of Ceph, storing objects as files on a POSIX filesystem is not ideal too.
Ceph stores object using a hash mechanism, so object names will appear in a funky way such as: rbdudata.371e5017a72.0000000000000000__head_58D36A14__2
.
For various operations such as scrubbing, backfill and recovery, Ceph needs to retrieve objects and enumerate them.
However, POSIX does not offer any good way to read the content of a directory in an ordered fashion.
For this, Ceph developers ended up using a couple of ‘hacks’ such as sharding object directories in tiny sub directories so they could list the content, sort it and then use it.
But one again in the end, it is another overhead that is being introduced.
This is an overview of the new architecture with BlueStore:
In terms of layer of abstractions, the setup and his overhead are quite minimal. This is a deep dive into BlueStore:
As we can see, BlueStore has several internal components but from a general view Ceph object (actual ‘data’ on the picture) will be written directly on the block device. As a consequence we will not need any filesystem anymore, BlueStore consumes a raw partition directly. For metadata that come with an OSD, those will be store on a RocksDB key/value database. Let’s decrypt the layers:
So what do we store in RocksDB?
Now, default’s BlueStore model on your disk:
Basically, we will take a disk and partition it in two:
And then an advanced BlueStore model:
What’s fascinating about BlueStore is to see how flexible it is. Every component can be stored on a different device. In this picture, RocksDB WAL and DB can be either stored on different devices or on tiny partitions too.
In order to summarize, let me highlight some of the best features from BlueStore:
]]>As soon as Jewel is released, BlueStore will be available. If I’m not mistaken it will be available but let’s consider it as a tech preview, I’m not sure yet if we should put it in production. To be sure carefully read the release changelog as soon as it is available. As we can see, Ceph is putting more and more intelligence into the drive. BlueStore is eliminating the need of the WAL journal dedicated journal with the help of RocksDB. I haven’t run any benchmarks on BlueStore yet but it is possible that we will not any dedicated device for journaling anymore. Which brings awesome perspectives in terms of datacenter management. Each OSD is independent, so it can be easily plugged out a server into another and it will just run. This capability is not new and if I remember correctly this was introduced during Firefly cycle thanks to the udev rules on the system. Basically when you hotplug a new disk containing an OSD on a system, this will trigger a udev event which will activate the OSD on the system. BlueStore simply strengthens this mechanism given that it removes the need of a dedicated device for the Ceph journal depending on what you want to achieve with your cluster. Performance running everything on the same drive should be decent enough for at least Cost/capacity optimized scenarios and potentially throughput optimized.
group_vars/all
and uncomment the following:
ceph_dev: true
ceph_dev_branch: v10.0.4
Then simply run vagrant up
and wait a bit…
After a couple of minutes you will get everything deployed and an output similar to this one:
1 2 |
|
One of the recent use case I had was to migrate an Ubuntu based Ceph cluster to RHEL. We had strict requirements and did not want to have any data being migrated. It is yet another beauty from Ceph and particularly OSDs, where they basically have the ability to run on any machines. Let’s say you have an OSD, you can pull out the disk and plug in on another machine seamlessly. The approach taken here was a bit different, but relies on this capability.
I built the automation using Ansible. The procedure is rather simple and could be decoupled as follow. Note that all the tasks are being serialized.
For monitors:
reboot
.For OSDs:
noout
flag, so no recovery will be triggeredceph.conf
reboot
.active+clean
we start another OSD hostIn practice, you simple need to:
1 2 3 |
|
]]>Happy migration! One last note, the procedure does not trigger any data movement, so this operations runs on live, is rolling and does not impact connected clients.
]]>Next week I will be heading to San Jose to attend the OCP (Open Compute) Summit. I will be presenting and demoing my work on containerizing Ceph. This event is really exciting as most (to not say all) of the biggest Open Compute representatives (Facebook, Quanta etc) will be there. So if you come by the Red Hat booth, don’t forget to say hi ;)
Ceph Ansible started as a personnal project, the reason was simple I wanted to have an in-depth look at Ansible. Thus I immediatly thought, why not try to deploy Ceph with Ansible. Moreover, I have never been a huge of Puppet and ceph-deploy was a couple of months old, so to me Ansible was the right answer.
After almost 2 years of developement (first commit), I am glad to annonce that ceph-ansible will now a release cycle process. With the help of git tags, we will be providing point in time releases with new features.
From the past 2 years, ceph-ansible has seen some good contributions.
Now, let me give you some news about ceph-ansible latest capabilities:
dnf
systemd
ceph.conf
)And many more!
Since Ansible has an option to run cowsay
and since I like the cartoon ”Cow and Chicken” I’ve been thinking of using characters names for the releases :).
Thus the first one is named ”Chicken”.
There are not so much characters (apparently 20), so this won’t last long.
]]>As a personnal project, I really see this as an achievement so I’d like to thank everyone for their support! You can check out the release on Github.
Last week I was attending the Mobile World Congress with Red Hat and I had the chance to demo my work on containerizing Ceph. You will find my presentation in the article :).
Here was my desk where I was doing my presentation:
1 2 |
|
]]>Now you can easily ping them.
The summit is almost there and it is time to vote for the presentation you want to see :). Here are the presentations we submitted (me and my colleagues):
Protecting the Galaxy - Multi-Region Disaster Recovery with OpenStack and Ceph
Speakers: Sébastien Han, Sean Cohen, Federico Lucifredi
Persistent Containers for Transactional Workloads
Speakers: Sébastien Han, Kyle Bader
How to seemlessly migrate CEPH with PB’s of data from one OS to other with no impact
Speakers: Sébastien Han, Shyam Bollu, Michael DeSimone
]]>I hope to see you there ;)
This weekend I just pushed a new feature in ceph-ansible, which adds the ability to deploy containerized Ceph daemons. It is quite easy to get going with this, simply do the following:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
|
Now move away from the keyboard and go grab a coffee. The process should not last long, most of the time is spent on:
]]>More improvement coming soon :)
Typically, when we build a container image we have 2 main files:
Dockerfile
is the essence of the container, it is what the container is made of, it generally contains packages installation steps and filesentrypoint.sh
is where we configure the container, during the bootstrap sequence this script will get executed.
Usually the entrypoint.sh
file contains bash instructions.So the idea is, instead of relying on bash scripting when writing container’s entrypoint we could call an Ansible to configure it.
Do not forget to replace all the my-application statement with the name of your application ;).
File example for base image Dockerfile
that your application will be using:.
We simple install Ansible and our application here:
FROM ubuntu:14.04
MAINTAINER Sébastien Han "seb@redhat.com"
# Install prerequisites
RUN apt-get update && \
apt-get install -y python python-dev python-pip python-yaml && \
apt-get clean && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
# Install Ansible
RUN pip install pyyaml ansible
# Install my application
RUN apt-get install -y --force-yes my-application && \
apt-get clean && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
RUN mkdir -p /opt/ansible/my-application/
ADD site.yml /opt/ansible/my-application/site.yml
ADD entrypoint.sh /entrypoint.sh
ENTRYPOINT ["/entrypoint.sh"]
File example for a site.yml
, this file will later be used by Ansible in your application container:
---
# Defines deployment design and assigns role to server groups
- hosts: my-application
connection: local
sudo: True
roles:
- { role: application1, tags: installation }
- { role: application2, tags: configuration }
File example for an application entrypoint.sh
. This is some of the brief instructions that you will find to run Ansible:
cat >/opt/ansible/my-application/inventory <<EOF
[my-application]
127.0.0.1
EOF
cat >/opt/ansible/my-application/group_vars/all <<EOF
foo: bar
foo1: bar1
EOF
cd /opt/ansible/my-application
ansible-playbook -vvv -i inventory site.yml
Now simply run docker run <image>
and your container will get configured by Ansible :).
As always use docker logs -f <container-id>
to check the bootstrap process.
]]>Ansible power! Now it would be interesting to do a bit of profiling as Ansible might slow things down a little bit. From some of the test I ran, this is not much but it is up to you to decide whether it is acceptable or not.
Get the modification time of a RBD image.
Each RADOS object does maintain a mtime
that you can get via the RADOS tool.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
Now check the image properties, we are looking here for the block_name_prefix
in order to identify the objects in RADOS:
1 2 3 4 5 6 |
|
Ok so this gives us: rb.0.3b19.74b0dc51
.
Given that most of the filesystem structure is built from the start of the device we do not need to bother looking around for all the filesystem block.
The first block is enough, in a RADOS naming it is easy to find out that the first block will be named: rbd/rb.0.3b19.74b0dc51.000000000000
.
Now let’s look at the properties of that object:
1 2 |
|
And oh suprise! This gives us the mtime
of this object.
So basically everytime there is an operation on the filesystem
For a block based application, this a bit more tricky because we do not know which blocks will be accessed.
So we have to go through all the objects and sort their mtime
…
]]>ZA WARUDO! TOKI YO TOMARE!
Following last week article, here is another CRUSH example. This time we want to store our first replica on SSD drives and the second copy on SATA drives.
Here is the CRUSH rule:
rule ssd-primary-affinity {
ruleset 0
type replicated
min_size 2
max_size 3
step take ssd
step chooseleaf firstn 1 type host
step emit
step take sata
step chooseleaf firstn -1 type host
step emit
}
]]>Make sure that you configure your OSD with the primary affinty flag as well, for reference look at my article about primary affinity.
Quick CRUSH example on how to store 3 replicas, two in rack number 1 and the third one in rack number 2.
Here the CRUSH rule:
rule 3_rep_2_racks {
ruleset 1
type replicated
min_size 2
max_size 3
step take default
step choose firstn 2 type rack
step chooseleaf firstn 2 type host
step emit
}
]]>Hope that helps ;-)
Sometimes removing OSD, if not done properly can result in double rebalancing. The best practice to remove an OSD involves changing the crush weight to 0.0 as first step.
So in the end, this will give you:
1
|
|
Then you wait for rebalance to be completed. Eventually completely remove the OSD:
1 2 3 4 5 |
|
]]>Et voilà !
Ceph just moved outside of DevStack in order to comply with the new DevStack’s plugin policy. The code can be found on github. We now have the chance to be on OpenStack Gerrit as well and thus brings all the good things from the OpenStack infra (a CI).
To use it simply create a localrc
file with the following:
enable_plugin ceph https://github.com/openstack/devstack-plugin-ceph
]]>A more complete
localrc
file can be found on Github.
When you manage a large cluster, you do not always know where your OSD are located.
Sometimes you have issues with PG such as unclean
or with OSDs such as slow requests.
While looking at your ceph health details
you only see where the PGs are acting or on which OSD you have slow requests.
Given that you might have tons of OSDs located on a lot of node, it is not straightforward to find and restart them.
You will find bellow a simple script that can do this for you.
In this example, I want to restart all the down
OSDs on an Ubuntu operating system.
1 2 3 4 5 6 7 |
|
]]>Yes the
awk
is ugly, if someone comes out with an easier/clearer alternative ;)
Infernalis has just been released a couple of weeks ago and I have to admit that I am really impressed of the work that has been done. So I am going to present you 5 really handy things that came out with this new release.
Prior to this, the default unit was MB so we had to write the image size in consequence.
1
|
|
Since images are sparse and that discard is available for some virtualization storage controllers it is nice to have the ability to track the space used by an image. Obviously using object-map is highly recommend here, as it will speed up the calculation time.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
A new command that will connect to the daemon’s socket and show some statistics.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
Quick example on how to enable object map after the image creation.
1 2 3 4 5 6 7 8 9 10 11 |
|
Thanks to this, we do not need to specify the format during the image creation, nor add a new line into your ceph.conf
.
]]>Enjoy!