Sébastien Han

Stacker! Cepher! What's next?

Ceph: how to test if your SSD is suitable as a journal device?

A simple benchmark job to determine if your SSD is suitable to act as a journal device for your OSDs.


I. Testing

To give you a little bit of background when the OSD writes into his journal it uses D_SYNC and O_DIRECT. Writing with O_DIRECT bypasses the Kernel page cache, while D_SYNC ensures that the command won’t return until every single write is complete. So yes, basically the OSD forces all the writes to be flushed prior to start the next IO.

First disable the write cache on the disk:

$ sudo hdparm -W 0 /dev/hda 0

Disable the controller cache, assuming your controller is from HP, in slot 2 and your logical drive is the number 1:

$ sudo hpacucli ctrl slot=2 modify dwc=disable
$ sudo hpacucli controller slot=2 logicaldrive 1 modify arrayaccelerator=disable

Now you can start benchmarking your SSD correctly using two different methods. The FIO way:

$ sudo fio --filename=/dev/sda --direct=1 --sync=1 --rw=write --bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting --name=journal-test

Now it is important to understand the option we passed:

  • --filename: device we want to test
  • --direct: we open the device with O_DIRECT which means that we are bypassing the Kernel page cache
  • --sync: we open the device with O_DSYNC we don’t acknowledge until we are sure that the IO has been completely written
  • --rw: IO pattern, here we use write for sequential writes, journal writes are always sequential
  • --bs: block size, here we are submitting 4K IOs, this is probably the worst case scenario, so you can always change this value if you know your workload
  • --numjobs: number of threads that will be running, think this has ceph-osd daemons writing to the journal
  • --iodepth: we are submitting IO one by one.
  • --runtime: job duration in seconds
  • --time_based: run for the specified runtime duration even if the files are completely read or written
  • --group_reporting: If set, display per-group reports instead of per-job when numjobs is specified.
  • --name: name of the run


II. Ramp up

Increase --numjobs through every single new run. Here is a little example on a SSD:

  • --numjobs=1 reports bw 23418KB/s or iops=5854
  • --numjobs=2 reports bw=43697KB/s or iops=10924
  • --numjobs=3 reports bw=63592KB/s or iops=15898
  • --numjobs=4 reports bw=68500KB/s or iops=17124. My SSD is maxing out here


III. Interpret the result

Coming soon…


Bonus

If for whatever reasons fio is not available, here is the dd way:

$ sudo dd if=/dev/urandom of=randfile bs=1M count=1024 && sync
$ sudo dd if=randfile of=/dev/sda bs=4k count=100000 oflag=direct,dsync


What matters the most here is to find how the SSD is performing while using D_SYNC. At some point users reported some SSD misbehaving with DSYNC. Then you better always test your SSD prior to go in production.



Data aggregation tables

Gathering all the comments in two tables, on one side enterprise drives, on the other consumer drives:


Enterprise SSD MODEL Firmware 1 JOB 5 JOBS 10 JOBS
Netlist EV3 16GB ??? 345 MB/s 1439 MB/s 1766 MB/s
Intel P3700 400GB SSDPEDMD40 406 MB/s 926 MB/s 920 MB/s
Intel P3700 1.6TB SSDPEDMD01 360 MB/s 985 MB/s 1095 MB/s
Intel P3600 800GB 5cd2e4 328 MB/s 800 MB/s 801 MB/s
SanDisk Fusion ioMemory SX300-1300, 1.3TB ??? 174 MB/s 793 MB/s 1101.9 MB/s
Dell Express Flash NVMe XS1715 SSD 400GB ??? 110 MB/s 495 MB/s 628 MB/s
Samsung PM863 GXT3003Q 127 MB/s 324 MB/s 336 MB/s
Intel DC S3610 1.6TB ??? 96 MB/s 208 MB/s 241 MB/s
FusionIO IOdrive2 410GB ??? 85.1 MB/s ??? MB/s ??? MB/s
Samsung SM863 240GB GXM1003Q 64.7 MB/s 125 MB/s 125 MB/s
400GB SanDisk Lightning II 12Gb SAS SSD ??? 48.9 MB/s 194 MB/s 255 MB/s
HGST Ultrastar SSD1600MM 800 GB ??? 43.9 MB/s 96 MB/s 177 MB/s
Intel DC S3500 ??? 39.1 MB/s ??? MB/s ??? MB/s
Intel DC S3700 100GB ??? 35.2 MB/s ??? MB/s ??? MB/s
SanDisk Cloudspeed II Eco, 960GB ??? 34.9 MB/s 176 MB/s 185 MB/s
Micron M500DC 480 GB ??? 33.6 MB/s ??? MB/s ??? MB/s
Intel DC S3700 400GB 5DV10270 26 MB/s 44.7 MB/s 68 MB/s
Intel DC S3700 200GB ??? 22.5 MB/s ??? MB/s ??? MB/s
Intel DC S3710 200GB G2010140 23,6 MB/s ??? MB/s ??? MB/s
Micron p400e 400GB ??? 3.0 MB/s ??? MB/s ??? MB/s



Consumer SSD MODEL Firmware 1 JOB 5 JOBS 10 JOBS
Intel 750 NVMe 400GB ??? 261 MB/s 884 MB/s ??? MB/s
Samsung SSD 950 PRO 512GB NVMe ??? 245 MB/s 329 MB/s 388 MB/s
Kingston v300 120GB 603ABBF0 98 MB/s 181 MB/s 200 MB/s
LITEON ECE-200 200GB ??? 15.2 MB/s ??? MB/s ??? MB/s
Adata SP900 120GB ??? 11.3 MB/s ??? MB/s ??? MB/s
Kingston v300 60GB 505ABBF0 9.2 MB/s 22 MB/s 39 MB/s
Intel 520 60GB 400i 9 MB/s 22.3 MB/s 40 MB/s
Intel 520 180GB (FW - 400i) connected to (Dell C2100 Onboard SATA ICH10 - 3Gbps) ??? 8.7 MB/s 22 MB/s 40 MB/s
SanDisk Ultra II 120G X31200RL 7.6 MB/s 28.7 MB/s 40 MB/s
SanDisk Ultra Plus 256GB X2306RL 6 MB/s 19 MB/s 33 MB/s
Intel 510 ??? 4.2 MB/s ??? MB/s ??? MB/s
Crucial MX200 ??? 3.7 MB/s ??? MB/s ??? MB/s
Plextor M6e 120GB ??? 2.7 MB/s ??? MB/s ??? MB/s
PLEXTOR PX-128M5 ??? 2.6 MB/s ??? MB/s ??? MB/s
Samsung XP941 256GB ??? 2.5 MB/s 5 MB/s ??? MB/s
Adata SP920 ??? 2.2 MB/s ??? MB/s ??? MB/s
Samsung 850 evo 250GB ??? 1.9 MB/s ??? MB/s ??? MB/s
Samsung 840 evo 250GB ??? 1.9 MB/s ??? MB/s ??? MB/s
Samsung 850 Pro 256GB ??? 1.5 MB/s 4 MB/s 6.7 MB/s
Samsung 850 Pro 128GB ??? 1.2 MB/s ??? MB/s ??? MB/s
Toshiba OCZ VT180 960GB ??? 1.0 MB/s 1.7 MB/s 3.3 MB/s
Adata SP900 ??? 0.8 MB/s ??? MB/s ??? MB/s
Crucial m550 ??? 0.8 MB/s ??? MB/s ??? MB/s
INTEL 535 SSDSC2BW240H6 240GB ??? 401 kB/s ??? MB/s ??? MB/s

Comments