tl;dr - Finally, the results of the benchmarking. You can find the code on GitLab. There are some issues with the benchmarks but there was enough decent data to make a decision for me at least. As far as which storage plugins I’m going to run, I’m actually going to run both OpenEBS Mayastor and Ceph via Rook on LVM. I look forward to emails from users/corporations/devrel letting me know how I misused their products if I did – please file an issue on GitLab!

UPDATE (04/14/2021)

Thanks to /u/joshimoo who noted on reddit that I missed Longhorn's support for RWX volumes along with Harvester which supports live migration with KubeVirt. I've added a section that basically says exactly what's in this update to the post.

NOTE: This a multi-part blog-post!

Part 1 - Intro & Cloud server wrangling
Part 2 - Installing storage plugins
Part 3 - Installing more storage plugins
Part 4 - Configuring the tests
Part 5 - The results (you are here)

Context

In part 4 we worked through getting the testing tools set up, figuring out how to get our results back out, and now we can finally make use of that beautiful data!

Thanks for sticking with me (those of you that did) all the way here – I’m sure it must have felt like I was doing some sort of “growth hack” drip-feed but I just didn’t want to drop 6 posts all at the same time once I realized how big this series was going to have to be. No one wants to see a scroll bar nub the size of an atom.

NOTE: No you’re not crazy, this post series was reduced from 6 parts to 5, I combined Part 6 into this post just so I could be done quicker. I need to actually get back to shipping software.

Results

To keep the tables narrow I used some acronyms, so a legend is required:

RR = Random Read
RW = Random Write
BW = Bandwidth (measured in MiB/s)
RL = Read latency (measured in usec)
??? = I don’t know (depending on the configuration to fio, some values are un-knowable)
SR = Sequential Read (measured in MiB/s)
SW = Sequential Write (measured in MiB/s)

Remember, “RW” does not stand for “read/write”!

No replication (“JBOD”)

This round of testing is on a single node this means a single node with a single copy – i.e. your data is not safe but it probably got there fast. As you might expect, the theoretical limit for this a hostPath volume, and the storage plugin that should be closest to the upper bound is OpenEBS LocalPV Hostpath.

`fio`

JBOD fio random read IOPS

JBOD fio random write IOPS

JBOD fio average read latency

JBOD fio sequential read

JBOD fio sequential write

If you’re into tabular data (powered by Simple DataTables):

Plugin (JBOD, no replicas)	RR IOPS	RR BW	RW IOPS	RW BW	Avg RL	Avg WL	SR	SW	Mixed RR IOPS	Mixed RW IOPS
OpenEBS HostPath (`O_DIRECT`)	295,000	1428	318,000	2652	52.69	13.53	3378	2721	252,000	83,900
OpenEBS Mayastor ISCSI (`O_DIRECT`)	128,000	1399	76,800	2028	89.18	78.72	111	2532	78,100	26,300
OpenEBS Mayastor NVMe-oF (`O_DIRECT`)	90,300	1398	78,500	2053	88.55	86.66	2660	2105	64,700	21,600
Rook Ceph LVM (`O_DIRECT`)	51,600	4141	5307	499	203.46	???	5419	2190	4726	1582
OpenEBS Hostpath	19,200	6079	517,000	4135	210.88	???	9943	3718	18,500	6194
LINSTOR drbd9	17,600	6074	473,000	3860	230.79	???	9815	3389	17,200	5708
OpenEBS Mayastor ISCSI	12,300	4582	522,000	5210	323.12	???	105	5645	11,800	3921
OpenEBS Jiva (`O_DIRECT`)	6129	235	6883	324	682.68	570.70	409	466	4470	1491
OpenEBS Jiva	5564	218	456,000	4046	738	???	1169	3681	3486	1157
OpenEBS cStor (`O_DIRECT`)	5454	159	4983	180	895.86	991.47	57.3	156	3942	1311
Rook Ceph LVM	4883	268	507,000	4007	828.36	???	7103	3416	4544	1519
OpenEBS cStor	2190	34.2	487,000	4142	1854.37	???	118	3687	923	304
OpenEBS Mayastor NVMe-oF	11.800	4640	530,000	5196	341.55	???	10,200	5662	11,700	3666

`pgbench`

JBOD pgbench transactions per second

And if you’re into tabular data:

Plugin (JBOD, no replicas)	# clients	# threads	transactions/client	Latency Avg (ms)	tps w/ establish
LINSTOR drbd9	1	1	10	2.482	402.97
OpenEBS Jiva	1	1	10	2.156	463.85
OpenEBS LocalPV HostPath	1	1	10	2.216	451.18
OpenEBS Mayastor ISCSI	1	1	10	2.218	450.82
OpenEBS Mayastor (NVMe-oF)	1	1	10	2.228	448.90
Rook Ceph LVM	1	1	10	2.182	458.23

Single copy replication (RAID1)

As previously mentioned this round of testing is single node, so “RAID1” is the traditional meaning of RAID1 – duplicated data – in this case, with 2 disks. The closest we might get to perfection here would be mirrored LVM, but I don’t think we have a benchmark that is quite in that area. The closest thing we have is OpenEBS LocalPV ZFS because the underlying ZFS pool is mirrored but ZFS has a heavier overhead than LVM so let’s just see what the results shape up as.

`fio`

RAID1 fio random read IOPS

RAID1 fio random write IOPS

RAID1 fio average read latency

RAID1 fio sequential read

RAID1 fio sequential write

And if you’re into tabular data:

Plugin (RAID1, 2 replicas 1 node)	RR IOPS	RR BW	RW IOPS	RW BW	Avg RL	Avg WL	SR	SW	Mixed RR IOPS	Mixed RW IOPS
OpenEBS cStor	2205	35.30	493,000	4039	1856.88	???	125	3422	920	303
OpenEBS cStor (`O_DIRECT`)	5441	159	4979	185	910.12	983.81	60.90	167	3979	1315
OpenEBS LocalPV ZFS	148,000	2304	41,500	646	27.51	105	5098	629	46,700	15,600
OpenEBS LocalPV ZFS (`O_DIRECT`)	146,000	2298	30,500	473	27.31	121.63	5239	634	46,800	15,600
Rook Ceph LVM	4837	250	358,000	3404	840.48	???	6978	2555	4730	1581
Rook Ceph LVM (`O_DIRECT`)	50,700	4076	4132	370	206.70	???	5330	1124	2790	939

`pgbench`

RAID1 pgbench transactions per second

And if you’re into tabular data:

Plugin (RAID1, 2 replicas 1 node)	# clients	# threads	transactions/client	Latency Avg (ms)	tps w/ establish
OpenEBS cStor	1	1	10	2.193	455
OpenEBS LocalPV ZFS	1	1	10	3.958	252.65
Rook Ceph LVM	1	1	10	2.257	443.038

I won’t get into too much analysis just yet, but it looks like this is the difference between the speed of LVM RAID1 (at the lower level) and ZFS-based mirrored setups. It looks like ZFS performance is about 55% of the performance – which is what you might expect from having to write twice the amount of data. I wonder if LVM writes data initially as striped then does some settling later? It’s a bit curious that only ZFS has this penalty.

Note also that the ZFS (cStor is also ZFS) results might be skewed due to how ZFS writes to memory (“asynchronous write”) O_DIRECT results are probably a little better to use for eyeball comparisons.

Some light analysis

Here are some of the thoughts that I shared with my mailing list:

JBOD (single disk) results

A few thoughts on the single disk results and anomalies:

No surprise OpenEBS HostPath with direct write was the fastest for random read! NVMe sure is fast.
Mayastor doing better than host path in random write is a bit curious, but could be just noise
It’s a bit weird that Ceph Jiva and Mayastor did so bad on O_DIRECT random write, not sure what’s happening there
Average Read latency is terribed on cStor, and best on hostpath followed by Mayastor
It looks like Mayastor was the general winner in JBOD disk stats
PGBench is a wash which is great – you want to see the database perform the same across all the disks. I don’t think 450 tx/second is particularly good but I haven’t tuned anything and that doesn’t seem terrible, I certainly don’t have any apps doing 400 tx/second anyway.

JBOD (single disk) results

And a few results on the disk-level RAID1 setup:

LocalPV ZFS absolutely blew away the competition – I think that’s ZFS hitting it’s in-memory caches more than anything. I’d probably think the same is happening with Ceph as well. ZFS is much better at it though.
cStor evidently somehow decided it was time to shine during the RAID1 random writes. It’s performed like dogshit every other time but this one time it absolutely takes it away. I have no idea what’s happening there. Maybe cStor just lies it’s ass off about writes and acks them before anything hits any disks?
LocalPV ZFS has some nice read latency – pretty shit at writing though
cStor also does super well on sequential write for some reason, but generally sequential ops are done best by Ceph (I’m not sure I trust cStor’s results…)
PGBench finally showed something interesting! ZFS’s TPS is basically cut in half (which is what you might expect, since it is mirrored after all), but what’s weird is that cStor and LVM don’t seem to change at all… Very peculiar.

There’s a lot of testing I haven’t done

There’s a large amount of things left to test/re-run. Getting all this automated is a good first step (so it will be easier to iterate on testing any individual one of these later), but I’ve left out the following variations, just to start:

fio block sizes
fio worker thread count
resource limits storage plugin
tuning on individual storage solutions/tuning levels

All of these things could have huge impact and could change the skew of these results – its’ a huge matrix of possibilities. What I need is a framework like the one I built back during the PM2 cluster sizing experiments. I may have to reach over and use that in the future, but for now I’ll take these results as just one cubby hole in the matrix.

Sidenote: Making graphs is harder than it should be

Repeating it here, but someone needs to make a site where you can paste a table and get a simple bar/line graph out. For these graphs I used the venerable gnuplot. Though I did see some nice options:

Plotly’s Online Chart Maker (Plotly.js is F/OSS)
GNUPlot online
Json2Plot a tool that uses Plotly

Looks like I’ve got a new side project to do at some point – if this was automated it would be a lot easier to run all the tests, and I could spend more time on only analysis.

Not completed: Clustered deployments

There’s obviously some more to be done here – I need to try out and get some decent numbers on clustered deployments. Multi-node setups are the obvious day 2 for better disaster recovery and availability, but this analysis was only single-node.

What if: btrfs

ZFS and btrfs are really similar – there are differences, but btrfs is actually somewhat easier to manage in some ways (heterogeneous disks, etc), and while people used to often site questions of reliability, Facebook is well known to have switched to btrfs and btrfs is the default filesystem of Fedora 33. And after all, btrfs has made it into the mainline linux kernel.

In the past I have ran into some posts on ZFS’s NVMe performance that show it’s defnitely not a free lunch and some care has to be taken when using ZFS in new places (it seems like performance was bad because trying to write to ZIL/SLOG was actually causing writes to go slower because of just how fast NVMe is). While I don’t really have the problem of NVMe drives on any of the machines I run right now it’s worth keeping an eye on how the area evolves.

NVMe aside, ZFS’s performance penalty seems to be kind of high – there’s a paper out there which shows some pretty favorable numbers for BTRFS though it’s back from 2015. In general ZFS’s performance cost seems to be kinda high.

ZFS has some really important features though, in particular:

Absolute crash-safe data integrity (via WAL and checksumming, as you’d use in a database)
Copy-on-Write (CoW) and the features that enables (“instant” snapshotting/cloning)
Journaling

One thing you might ask is why you’d need Copy-on-Write and journaling – well ZFS has a feature called SLOG that lets you keep the Write-Ahead-Log on a completely separate drive! I may need to test ZFS with and without the ZIL in the future.

But getting back to btrfs – I’ve heard some mixed reviews over the years about btrfs, but it’s important to note that it’s in the kernel already. I found a paper with btrfs doing quite well compared to ext4 and xfs and some levelheaded helpful reviews.

Maybe it’s worth giving it a shot? Are there any provisioners for btrfs PVCs? Should I just format some local loopback mounted disks with btrfs (ex. OpenEBS rawfile LocalPV)? The easiest to start with would likely be the latter, especially for testing – loopback drives shouldn’t suffer too much of a penalty to at least make a comparison.

One thing I’ve always wanted/needed in btrfs is synchronous remote writes (kind of like ceph). There’s quite a performance hit, but being able to ensure writes are persisted on a remove drive before continuing for certain workloads would be really useful. Unfortuantely brtfs doesn’t support this either. btrfs does have some very nice features:

btrfs has better disk size flexibility
btrfs has no downtime pool expanding
btrfs has volume shrinking
btrfs has automatic data redistribution

ZFS seems to be better at raidzX but I know for most dedicated servers I’m going to have 2x some drive so I don’t think I’m too worried about RAID5/Z5.

The last thing to worry about is whether btrfs is stable. The FUD is quite persistent on this front, but there’s a whole section on it in the kernel wiki, and Facebook happens to use it. It’s also the default for OpenSUSE’s enterprise distribution so honestly it’s good enough for me.

ZFS: Do we need ZIL?

One area of optimization that we could do on storage is considering whether we need ZIL for certain workloads on ZFS. ZFS is already copy on write, so it makes sense to disable similar features at the application level (and various scrubbing/checksumming features) if we know that we’re getting rock solid data protectionf rom ZFS. It looks like sometimes it’s actually better to have it for performance, but it’s worth considering.

A Generate a searchable/static results page

Would be cool to have a simple static results page that we could throw up on GitLab pages and use to share the results. Maybe using some simple markdown-driven site generator would be a good idea/enough. If we wanted to get really fancy, we could even schedule the tests to be run every week or month. I’ll leave that for the next time I touch this project (which will probably be when I add clustered testing as well)

If you’re going to have lots of OSDs, watch out for `aio-max-nr`

While reading some stuff around the internet I took some notes on a presentation given by some people pushing the boundaries of storage setup in Japan. This slide seemed like something worth remembering, maybe even worth setting this value to something high right off the bat on nodes that are going to be storage-focused.

Looks like Ceph has some documentation on this as well

QoS support?

PVC-central Quality of Service limitations/features are lacking on all the baremetal hobbyist to enterprise I’ve discussed here. Generally if you want to scale to platform level service, you’re probably going to want to make sure you can avoid/limit the noisy neighbor problem. At the time of this post Kubernetes itself doesn’t even support hard-drive QOS quite yet, and there’s a KEP out.

Longhorn? Nope

Longhorn doesn’t support QoS (and OpenEBS Jiva since it’s based on it)

ZFS (Ceph/ZFS or OpenEBS localpv-zfs) ? Nope

QoS features are in Oracle ZFS, but you know what they say about Oracle, not even once. OpenZFS doesn’t seem to have this feature right now so unless I’m willing to jump in and implement it (or someone else does) it’s not there for now.,

Ceph/Rook? Yes!

QoS is supported by Ceph but not yet supported or easily modifiable via Rook and not by ceph-csi either. It would be possible to set up some sort of admission controller or initContainers to set the information on PVCs via raw Ceph commands after creation though so I’m going to leave this as possible.

After setting up a Block you can clearly see the QOS settings:

QOS settings in Ceph Dashboard

kernel-level support? Yup (with v2 cgroups)

Maybe we coudl limit IOPS with containerd & v2 cgroups? We know that v1 cgroups couldn’t limit properly if you use the linux writeback cache (i.e. you weren’t using O_DIRECT which postgres doesn’t but mysql does for example), but v2 looks to have some promise – there’s an awesome blog post out there by Andre Carvalho outlining how to do it with v2 cgroups.

Even though the kernel can do it, looks like Kubernetes doesn’t have the ability just yet:

containerd (always the innovator, I remember when it was the first to get alternative non-runc container engines underneath) looks to support v2 cgroups and io limiting thanks to hard work from Akihiro Suda and others).

Limiting IOPS is a crucial feature of being able to provide storage as a service, up until now I thought the best way was to do it at the storage driver layer but it looks like the kernel subsystems might be just as good a place and might work for everything at the same time. Hopefully sometime in teh future I can get aroudn to testing this.

Read Write Many (RWX)

Read Write Many is another feature of Kubernetes storage plugins that isn’t talked about so much (since it’s not available that often) but it’s a really nice feature in my mind. A couple reasons you might want this:

Geo-distributed webpages (set up an NGINX instance in every region hooked up to a RWX PVC and theoretically you have an instant CDN!)
Cloud Drive aaS (Set up a file viewer in every region with a RWX PVC and theoretically you have instant a GDrive clone)
Live migration of KubeVirt VMs

RWX via Rook + CephFS? Possible!

Ceph is very robust software so it actually has a way to do block storage, object storage and file storage. For Ceph, CephFS is the piece of the puzzle that supports RWX drives. Since Rook is managing “our” Ceph instance, the question of whether it’s supported by Rook also comes up so there are a few issues in the rook repo about this:

https://github.com/rook/rook/issues/543 (always surreal to see yourself in the comments)
https://github.com/rook/rook/issues/5936
https://github.com/rook/rook/issues/1125

It looks like CephFS has been fucntional since Rook v1.1, though there was one ticket that suggests it may have been broken intermittently between releases.

Someday I’d love to circle back and give this a try – hopefully sometime soon in the future!

RWX via OpenEBS + NFS? Possible

Another way to achieve RWX would be to use OpenEBS’s in combination with NFS. The OpenEBS documentation details setting up RWX PVCs basedon NFS very well (also detailed on their blog back in 2018) and if you want to go straight to the code, openebs/dynamic-nfs-provisioner is the place to start.

This path seems really easy, but a blog post on it would make things crystal clear, I’m not sure if many people have taken this route.

UPDATE: RWX via Longhorn? Yes

Another good option for the RWX problem is Longhorn – as I’ve mentioned this is what OpenEBS Jiva is based on, and it’s worth mentioning that the ease with which you can set up OpenEBS Jiva is definitely attributable in part to Longhorn’s hard work. I personally have interacted with their team and they’re very capable (obviously, longhorn works so well it was extended), and gracious as well, even when I was lodging what might be seen as a complaint.

Originally I missed this point, and for that I have to thank /u/joshimoo on reddit who pointed it out two salient points:

At this point we have enough of these solutions that support RWX, it looks like next time I’m going to need to add longhorn to the stable of solutions, and start running tests on RWX as well, as soon as I find out how you’re supposed to test it in the first place (somehow fio doesn’t seem like the right choice). According to an SO post it looks like I can just check the NFS wiki for decent testing tools, and dbench (I used this last time I did some benchmarking)or bonnie might be good bets. Maybe I’ll be the first one to get some proper NFS vs CephFS performance comparisons. Even dd and iozone seem to be fine for benchmarking NFS so maybe I don’t need to think about it too hard.

Resources on RWX via NFS/CephFS

The question of whether to use NFS with some not-ceph solution and CephFS is a big one. I’ve only been able to find one resource that seems to compare the two:

https://archive.fosdem.org/2018/schedule/event/cephfs_gateways/ (PDF)

But even this is more about running samba on top of them, which is a bit different from a knock-down drag-out benchmark.

OK, but what are you actually going to run?

The numbers I’ve gotten aren’t all that good. They’re pretty good in seeing some trends, but pretty bad in other ways, some of the various biases of certain providers are leaking through, and making it somewhat hard to get a good read. As I get more time to go over the benchmarks (or get some nice contributions) I think I’ll work on them more and may revisit, but for now I’m making the following determination:

Given a system with more than one NVMe disk, I’m going to install both OpenEBS Mayastor and Ceph, and use them both, with the bigger empty disk going to Ceph and the smaller partition going to OpenEBS Mayastor.

There are a few reasons I’ve arrived at this decision:

Same-node disk failure domain is good, but if I have two nodes I’m OK with achieving that over the network instead.
High availability for geographically close nodes is important to me
Mayastor has shown impressive performance in the JBOD (i.e. single disk) tests – I would have expected all of them to do that, but outside of Mayastor ceph was closest
I have more faith in CephFS and Rados Gateway than trying to run my ability to maintain NFS and Minio
Ceph is the industry standard (if it’s good enough for CERN, it’s good enough for me)
Mayastor was the easiest to install and maintain, and is very much built for NVMe
Mayastor’s shortcomings (not offering snapshots & clones for example) can be covered by Ceph via Rook
Mayastor’s strong suits (being able to make a memory pooled disks) is really valuable and they’re an innovative player in the field

I really want to have ZFS underneath/as a part of my system (and maybe I still can by running LocalPV ZFS on top of mayastor or something), and have it’s amazing data safety, but the drastic drop in write speed (which you can see in the TPS reductions) and the addition of overhead when I can get some pretty decent availability improvements from the other two solutions just makes it a non-starter. We’ll see if I live to regret this decision, because if NVMe is 2x faster than SSD, then maybe it makes sense to use ZFS and go back to ~SSD performance with the benefit of essentially absolute data confidence and a really easy to configure/use system.

Wrapup

Well this started as one blog post and turned into a 5 part series! It’s been grueling but fun yak shave and hopefully a worthwile post to put together (along with the development on the back end though).

I’m glad I finally got some time to do some proper testing of these solutions and hopefully the analysis will help others out there. If this article ends up being anywhere as useful as the CNI benchmark posts have been, I’ll be more than thrilled.

VADOSWARE

Living in a yak shaver's paradise.

K8s storage provider benchmarks round 2, part 5

Categories

Table of Contents

UPDATE (04/14/2021)

Context

Results

No replication (“JBOD”)

`fio`

`pgbench`

Single copy replication (RAID1)

`fio`

`pgbench`

Some light analysis

JBOD (single disk) results

JBOD (single disk) results

There’s a lot of testing I haven’t done

Sidenote: Making graphs is harder than it should be

Not completed: Clustered deployments

What if: btrfs

ZFS: Do we need ZIL?

A Generate a searchable/static results page

If you’re going to have lots of OSDs, watch out for `aio-max-nr`

QoS support?

Longhorn? Nope

ZFS (Ceph/ZFS or OpenEBS localpv-zfs) ? Nope

Ceph/Rook? Yes!

kernel-level support? Yup (with v2 cgroups)

Read Write Many (RWX)

RWX via Rook + CephFS? Possible!

RWX via OpenEBS + NFS? Possible

UPDATE: RWX via Longhorn? Yes

Resources on RWX via NFS/CephFS

OK, but what are you actually going to run?

Wrapup

Categories

Table of Contents

UPDATE (04/14/2021)

Context

Results

No replication (“JBOD”)

fio

pgbench

Single copy replication (RAID1)

fio

pgbench

Some light analysis

JBOD (single disk) results

JBOD (single disk) results

There’s a lot of testing I haven’t done

Sidenote: Making graphs is harder than it should be

Not completed: Clustered deployments

What if: btrfs

ZFS: Do we need ZIL?

A Generate a searchable/static results page

If you’re going to have lots of OSDs, watch out for aio-max-nr

QoS support?

Longhorn? Nope

ZFS (Ceph/ZFS or OpenEBS localpv-zfs) ? Nope

Ceph/Rook? Yes!

kernel-level support? Yup (with v2 cgroups)

Read Write Many (RWX)

RWX via Rook + CephFS? Possible!

RWX via OpenEBS + NFS? Possible

UPDATE: RWX via Longhorn? Yes

Resources on RWX via NFS/CephFS

OK, but what are you actually going to run?

Wrapup

`fio`

`pgbench`

`fio`

`pgbench`

If you’re going to have lots of OSDs, watch out for `aio-max-nr`