Comparing OpenEBS and Hostpath

Categories
OpenEBS logo + k8s logo + Postgres logo

tl;dr - I ran some cursory tests (dd, iozone, sysbench) to measure and compare IO operation on OpenEBS provisioned storage in comparison with hostPath volumes in my small Kubernetes cluster. Feel free to skip to the results. OpenEBS’s jiva-engine backed volumes have about half the throughput for single large writes (gigabytes) and slightly outperformed hostPath for many small writes. In my opinion, the simplicity and improved ergonomics/abstraction offered by OpenEBS is well worth it. Code is available in the related gitlab repo

I’ve written a bit in the past about my switch to OpenEBS, but I never took the time to do any examination on performance. While I can’t say I’ve ever really needed to maximize disk performance for a database on my tiny Kubernetes cluster (most applications I maintain are doing just fine with SQLite and decent caching), I did want to eventually get a feel for just how much performance is lost when picking a more robust solution like OpenEBS over something like using hostPath (simple but less than ideal from management and security perspectives) or local volumes (safer but still somewhat cumbersome to manage). In this post I’m going to run through some quick tests that will hopefully make it somewhat clearer the cost of the robustness that OpenEBS provides.

I use OpenEBS because it’s easier to install than Rook (which I used previously), for my specific infrastructure, which is running on Heztner dedicated servers which do not expose an easy interface with which to give rook a hard-drive to manage. I’ve written about it before so feel free to check that out for a more detailed explanation. The short version is that I’ve found that undoing Hetzner’s RAID setup has had disastrous consequences whenever I inadvertently updated grub (due to lack of proper support for RAID to start with on the GRUB side), and it’s much easier for me to just stop trying to undo the set up (and go with a robust hard drive volume-provisioning that works with just space on disk). The Rook/Ceph documentation is a little vague, but I’m fairly convinced that Rook (i.e. Ceph underneath) can’t operate over just a folder on disk – of course there are the obvious control issues (linux filesystems can’t really limit folder size easily) to contend with but the only other alternative seems to be creating some loopback virtual drives, but if I don’t want to have to do that provisioning step. Rook also does many things these days – before they were primarily worried about automating ceph clusters and providing storage, but now it can run Cassandra, Minio, etc – at this point I’m really only worried about getting “better than hostPath/local volume” robustness on my tiny cluster.

Anyway, all that out of the way, let’s get into it. The first thing I’m going to do is update my installation of OpenEBS.

The OpenEBS set-up

The last time I wrote about OpenEBS, I installed OpenEBS using these versions of the control plane:

Control plane component Image Version
openebs-provisioner quay.io/openebs/openebs-k8s-provisioner:0.8.0 0.8.0
snapshot-controller quay.io/openebs/snapshot-controller:0.8.0 0.8.0
api-server quay.io/openebs/m-apiserver:0.8.0 0.8.0

As you can see, version 0.8.0 across the board (check out the previous OpenEBS post for more specifics on the resource definitions used).

At the time of tis post’s creation, a pre-release version of 0.8.1 is available (Check out the OpenEBS releases page for more info). I’m not going to pursue updating to 0.8.1 for this post, but once the release is formally cut, I’ll upgrade and run these tests again (and also detail the upgrade process). The 0.8.1 release has some pretty big fixes and improvements which are pretty amazing, but we’ll leave testing that release until the release actually happens.

Unfortunately, Halfway through cstor volumes stopped working properly 0.8.0 – the following error showed up in the events for any cstor PVCs:

Events:
Type       Reason                Age                     From                                                                                                    Message
----       ------                ----                    ----                                                                                                    -------
Normal     ExternalProvisioning  4m12s (x26 over 9m52s)  persistentvolume-controller                                                                             waiting for a volume to be created, either by external provisioner "openebs.io/provisioner-iscsi" or manually created by system administrator
Normal     Provisioning          2m37s (x6 over 9m52s)   openebs.io/provisioner-iscsi_openebs-provisioner-77dd68645b-tv98t_e39a79e7-5ab5-11e9-a260-2ec80b0e29c2  External provisioner is provisioning volume for claim "storage-testing/1c-2gb-dd-openebs-cstor-data"
Warning    ProvisioningFailed    2m37s (x6 over 9m52s)   openebs.io/provisioner-iscsi_openebs-provisioner-77dd68645b-tv98t_e39a79e7-5ab5-11e9-a260-2ec80b0e29c2  failed to provision volume with StorageClass "openebs-cstor-1r": Internal Server Error: failed to create volume 'pvc-0bdabcc7-5ded-11e9-a847-8c89a517d15e': response: unable to parse requirement: found '<', expected: identifier
Mounted By:  1c-2gb-dd-openebs-cstor-x9s69

I couldn’t find where the configuration could have been wrong (and exactly one job did seem to go through with cstor early on), but to prevent this blog post from getting unnecessarily long I’m going to just skip cStor for now and focus on hostpath vs Jiva (maybe when I upgrade to 0.8.1, I’ll revisit). This is a bit disappointing because cStor has a bunch of features that Jiva doesn’t (and seems to be where the OpenEBS project is going in the future) but for

Testing Tools

I’m no linux file system/IO subsystems expert, so my first step was to do some googling, and see what’s out there for testing filesystems. In particular I found the following links particularly concise and helpful:

More importantly, these resources lead me to the better-than-dd test suites – iozone and sysbench, and filebench. I think sysbench is the most established of these in the linux space, but either way I’m going to run all three and see what they tell me. I’m also going to ignore filebench in this case due to the wiki noting that the pre-included stuff isn’t good for modern RAM amounts. Since in my case I really only care about disk IO performance, I think dd, iozone and sysbench are more than enough.

Setup/Methodology

While I won’t be doing much to dampen the load on the system, I will be doing multiple runs to try and lessen the effect of random load spikes on my rented hardware @ Hetzner.

Machine Specs

  • Intel Core i7-990x (~3.46Ghz)
  • 2x SSD 240 GB SATA
  • 6x RAM 4096 MB DDR3 (~24GB RAM)

The basic idea is to run a series of Kubernetes Jobs that will run the relevant tests, and in the job I’ll only be changing the volumes from hostPath to PersistentVolumeClaim (PVC)s supported by OpenEBS provisioned PersistentVolume (PV)s.

To help reproducibility I’ll also be making available a GitLab repository with the mostly automated code I ran – You’ll likely need a Kubernetes cluster to follow along. All runs were performed 10 times and summarized from there (average, std-dev, etc).

dd-based testing

The DD tests are very basic, pieced together from a few reference explanations of decent ways to use dd for this purpose:

We have to be careful to write either disable Linux’s in-memory disk cache or write enough data to ensure we go past what can be cached – I’ve chosen the latter here, so we’ll be writing roughly 2x the amount of memory limits that are given to each pod.

While testing, as you might expect, I ran into OOM-killing on my process:

$ k get pods -n storage-testing
NAME                                                            READY   STATUS      RESTARTS   AGE
1c-1gb-dd-openebs-jiva-fpn4v                                    0/1     OOMKilled   0          4m48s

This has to do with the container not actually flusing to disk fast enough:

Here’s a nice SO post about LXC containers – the idea is that the linux disk write cache is aborbing the single 2GB block (2x the amount of RAM I gave the container) and getting OOM killed. There’s also an LXC mailing list thread which explains the problem nicely – but of course, we’re not using LXC, we’re using containerd

To solve this, we need to tell the kernel to flush more often, but the problem is that the kernel file system (where we might normally echo <value> > /proc/sys/vm/dirty_expire_centisecs) is read-only inside a docker container, which is definitely a good idea security-wise. This means I had to spend some time looking up how to set sysctlss for pods in Kubernetes (I also had to reference the kubelet command line reference). Unfortunately, that lead to another issue – namely that my kubelet didn’t allow modifying that paritcular sysctl:

$ k get pods -n storage-testing
NAME                                                            READY   STATUS            RESTARTS   AGE
1c-1gb-dd-openebs-jiva-rxlc8                                    0/1     SysctlForbidden   0          7s

Unfortunately, the vm.dirty_expire_centisecs does not look to be namespaced yet, so I’m going to change the strategy and avoid writing less than the RAM contents to disk. This seems like a pretty drastic change, but I guess if I find that all 3 methods have the exact same performance profile it will be pretty easy to discount the results, as I’m expecting OpenEBS to impose some performance penalty at least. Maybe this is a good point to make disclaimer that these tests will not do anything for people looking to write 16GB files to disk all at the same time.

Since I need to make some decision, the strategy will be to write half the available memory instead – here are the dd commands I ended up with:

  • Throughput => dd if=/dev/zero of=/data/output.img bs=<half of available memory> count=1 oflag=dsync
  • Latency => dd if=/dev/zero of=/data/output.img bs=<half of available memory / 1000> count=1000 oflag=dsync

These commands work on the version of dd in Alpine’s coreutils package (not the Busybox version that is there by default). Here’s an example pod container spec:

      containers:
        - name: test
          image: alpine:latest
          command: ["/bin/ash"]
          args:
            - -c
            - |
              apk add coreutils;
              dd if=/dev/zero of=/data/throughput.img bs=512M count=1 oflag=dsync 2>&1 | tee /results/$(date +%s).throughput.dd.output;
              dd if=/dev/zero of=/data/latency.img bs=512K count=1000 oflag=dsync 2>&1 | tee /results/$(date +%s).throughput.dd.output;

To see the full configuration, please check out the GitLab repository.

As you might be able to guess, the bs=512M signifies that this was for some 1GB of RAM test (there’s only one, so 1CPU/1GB), and I just approximated 512M/1000 (~.512M) to be ~= 512K. I also left the CPU resourcve limits in, despite the fact that they should not really affect file IO so much – I did want it to be as close to a “real” Pod as normal (and setting resource limits is certainly a best practice for pods).

iozone-based testing

Following an excellent quick reference to iozone, I’m using the following iozone command:

$ iozone -e -s <half-ram> -a > /results/$(date +%s).iozone.output;

Some notes on this setup:

  • -s <half-ram> corresponds to half the available memory (so for the 2c-4gb case that would be 2g).
  • -e includes the fsync timing, which is important since otherwise we’d just be testing the linux disk cache
  • -a is automatic mode (which tests a variety of block sizes)

The iozone tests took over half a day to run in all (all variations, which means different machine configs and hostpath/openebs-jiva), this is likely because for every configuration (ex. 2 cores, 4gb RAM, w/ an openebs volume), it ran through various block sizes for the file size I gave it (<half-ram>).

sysbench-based testing

sysbench is another tool that is pretty commonly used to test file system performance. In trying to figure out a reasonable usage I needed to consult the github documentation as well as calling --help on the command line a few times. I used the severalnines/sysbench docker image, and spent some time in a container trying to figure out how to properly use sysbench.

For example here is what happens if you try to run a fileio test:

root@0058ea7d7215:/# sysbench --threads=1 fileio run
sysbench 1.0.17 (using bundled LuaJIT 2.1.0-beta2)

FATAL: Missing required argument: --file-test-mode

fileio options:
--file-num=N                  number of files to create [128]
--file-block-size=N           block size to use in all IO operations [16384]
--file-total-size=SIZE        total size of files to create [2G]
--file-test-mode=STRING       test mode {seqwr, seqrewr, seqrd, rndrd, rndwr, rndrw}
--file-io-mode=STRING         file operations mode {sync,async,mmap} [sync]
--file-async-backlog=N        number of asynchronous operatons to queue per thread [128]
--file-extra-flags= [LIST,...] list of additional flags to use to open files {sync,dsync,direct} []
--file-fsync-freq=N           do fsync () after this number of requests (0 - don't use fsync ()) [100]
--file-fsync-all [=on|off]     do fsync () after each write operation [off]
--file-fsync-end [=on|off]     do fsync () at the end of test [on]
--file-fsync-mode=STRING      which method to use for synchronization {fsync, fdatasync} [fsync]
--file-merged-requests=N      merge at most this number of IO requests if possible (0 - don't merge) [0]
--file-rw-ratio=N             reads/writes ratio for combined test [1.5]

This lead me to the following commands (for the 1C 2GB example):

$ sysbench --threads=1 --file-test-mode=<mode> --file-fsync-all=on --file-total-size=1G fileio prepare
$ sysbench --threads=1 --file-test-mode=<mode> --file-fsync-all=on --file-total-size=1G fileio run > /results/$(date +%s).seqwr.sysbench.output

file-test-mode can be a few different values so the easiest way to come up with was just to run them all. I was also a little unsure about the difference between fsync and fdatasync and came across a very useful blog post that cleared it up. The sysbench results will be used mostly for latency measurements – at this point I’m getting lazy (this post has taken a while to write), so I’m going to just take the min,avg,max metrics as they’re provided in each test that is run and compare those, I think we’ve got a good enough idea of thoughput with the dd and iozone tests.

Results

At-a-glance (graphs)

Here are some cherry-picked graphs that should give a rough overview of the results (higher is better except noted otherwise):

DD dd 4c/4gb throughput dd 4c/4gb latency

iozone iozone 4c/4gb graph

sysbench sysbench 4c/4gb graph

Overall – it looks like for large writes openebs-jiva only does half as well as hostPath, but it keeps up and slightly improves on hostPath performance for many small writes.

I couldn’t be bothered with gnuplot (though I originally intended to use it) so I used the first bar graph maker I found online. Hopefully someday a future me will go back and automate this.

Below are the tabulated results for each type of test – please refer to the Setup/Methodology section above if you are interested in methodology/how I arrived at these numbers.

dd-based testing

The results for the dd tests are basically the metrics pulled from dd run output and processed with GNU datamash (which I found out about from a an SO question). Here’s an example of the commands I ran:

# cat /var/storage-testing/1c-2gb/dd/hostpath/*throughput* | grep "MB/s" | awk '//{print $10}' | datamash min 1 max 1 mean 1 median 1
195     208     201.54545454545 201
# cat /var/storage-testing/1c-2gb/dd/openebs-jiva/*throughput* | grep "MB/s" | awk '//{print $10}' | datamash min 1 max 1 mean 1 median 1
103     107     104.27272727273 104
# cat /var/storage-testing/1c-2gb/dd/hostpath/*latency* | grep "MB/s" | awk '//{print $10}' | datamash min 1 max 1 mean 1 median 1
30.3    31.7    31.281818181818 31.4
# cat /var/storage-testing/1c-2gb/dd/openebs-jiva/*latency* | grep "MB/s" | awk '//{print $10}' | datamash min 1 max 1 mean 1 median 1
46.9    48.5    47.572727272727 47.6

1CPU/2GB

One big write:

Test Resources Method min max mean median
dd 1CPU/2GB hostPath 195 MB/s 208 MB/s 201.54 MB/s 201 MB/s
dd 1CPU/2GB openebs-jiva 103 MB/s 107 MB/s 104.27 MB/s 104 MB/s

Many smaller writes:

Test Resources Method min max mean median
dd 1CPU/2GB hostPath 30.3 MB/s 31.7 MB/s 31.28 MB/s 31.4 MB/s
dd 1CPU/2GB openebs-jiva 46.9 MB/s 48.5 MB/s 47.57 MB/s 47.6 MB/s

2CPU/4GB

One big write:

Test Resources Method min max mean median
dd 2CPU/4GB hostPath 207 MB/s 212 MB/s 210 MB/s 210 MB/s
dd 2CPU/4GB openebs-jiva 103 MB/s 109 MB/s 106.36 MB/s 107 MB/s

Many smalller writes:

Test Resources Method min max mean median
dd 2CPU/4GB hostPath 53 MB/s 54.1 MB/s 53.76 MB/s 53.9 MB/s
dd 2CPU/4GB openebs-jiva 66.4 MB/s 67.7 MB/s 67.04 MB/s 66.9 MB/s

4CPU/8GB

One big write:

Test Resources Method min max mean median
dd 4CPU/8GB hostPath 208 MB/s 214 MB/s 210.72 MB/s 211 MB/s
dd 4CPU/8GB openebs-jiva 100 MB/s 109 MB/s 106.10 MB/s 107 MB/s

Many smaller writes:

Test Resources Method min max mean median
dd 4CPU/8GB hostPath 81.7 MB/s 84.1 MB/s 83.41 MB/s 83.6 MB/s
dd 4CPU/8GB openebs-jiva 83.3 MB/s 88.3 MB/s 87.2 MB/s 87.5 MB/s

Notes

  • In the tests I’ve chosen to scale the amounts being written and chunks with the resources available, which may not be reasonable for your usecase as how fast you write or how much you write depends on the actual software running. I did, however, prefer this approach to just picking a random workload profile to use across all the three machine sizes.
  • At really small sizes (1c/2gb is very very small), it looks like openebs actually does better than normal hostPath volumes as far as latency goes, this is likely due to some more efficient batching at the service level
  • Throughput generally drops by about half with openebs versus a regular hostPath

iozone-based testing

iozone based tests produce their own kind of format which I had to parse through to get (so I could stuff it into GNU datamash). Here’s an example of the end of one of the earlier tests I ran:

... lots more text ...
Run began: Sun Apr 14 15:26:48 2019

File size set to 1048576 kB
Auto Mode
Command line used: iozone -s 1g -a
Output is in kBytes/sec
Time Resolution = 0.000001 seconds.
Processor cache size set to 1024 kBytes.
Processor cache line size set to 32 bytes.
File stride size set to 17 * record size.
                                                     random    random     bkwd    record    stride
     kB  reclen    write  rewrite    read    reread    read     write     read   rewrite      read   fwrite frewrite    fread  freread
1048576       4  1329738  2813520  4233952  3946288  3077393  2076758  3464453   3490923   3256488  2634143  2692730  3543277  3499406
1048576       8  1561101  3373710  5766887  5155135  4481634  2815542  4800863   2426742   4613565  3273976  3307341  5053257  5056133
1048576      16  1678212  3809775  6104643  5288415  3495546  3314228  5758966   5703528   5542858  3721749  3719832  5968188  5816561
1048576      32  1757391  4112077  6687042  5796988  3899517  3788562  6580170   6359981   6120172  4091842  4011308  6292985  6274204
1048576      64  1788938  4209306  6705475  5928278  6584258  4051337  6597592   6875774   6345739  4148240  4148091  6636559  6761432
1048576     128  1805999  4165196  6155308  5838684  6396760  4115678  6482856   6781762   6326242  4183743  4154816  6460363  6386013
1048576     256  1807961  4135854  6117575  5545584  6019692  4059085  6415899   6427817   6258356  4215802  4237648  6055181  6130700
1048576     512  1783413  4081699  6191590  5281271  6005005  4120950  6207530   5946739   6154627  4174613  4215818  6100164  6036605
1048576    1024  1782428  4168907  6318471  5503034  6424230  4246604  3874917   6156721   6201657  4208766  3658356  6180271  6202059
1048576    2048  1816527  3876123  6212677  5528421  6232398  4229103  6337071   6150264   6398444  3843415  4020535  6434042  6306276
1048576    4096  1486736  3909344  6186504  5507659  5287112  3457475  6028728   5573350   5965411  3987299  3960445  5936328  3704585
1048576    8192  1600195  3202268  4887669  4442795  3305338  2691645  4917695   3482010   4871250  3126979  3274171  4805396  4746020
1048576   16384  1629628  3118255  4773326  4235028  4689994  2852128  4680765   3217477   4653983  2748085  3168286  4757711  4734897

iozone test complete.

The first two columns are the amount of kb written and the block size resptectively – the rest of the columns are various iozone specific metrics, expressed in kB/s. To roll up the iozone results, I’ll be averaging the measurements across block sizes. This may or may not be relevant to your usecase, but it should give an overall idea when comparing hostPath to openebs performance at least.

Of course, trying to actual get the right information out of these files was somewhat of an undertaking (really just more unix-fu practice). First I started with trying to get the table of numbers out:

$ cat 1c-2gb/iozone/hostpath/1555323673.iozone.output | tail -n -15 | head -n 13
1048576       4   203992   227340  4135205  3849086  2963134   216528  3273281   2407060   2937104   217431   222678  3623451  3601148
1048576       8   214913   232188  4847535  4831389  4156547   218169  4583161   4464746   4239310   224655   225858  5003922  4923260
1048576      16   211075   225661  5412561  5381974  5146864   227503  5406201   4966963   4935592   229125   230148  5872763  5784507
1048576      32   216727   231412  5928078  5705259  5902049   223881  5524255   4156677   5458490   227918   228691  6836464  6671391
1048576      64   211837   232842  6174743  5847938  6524364   228451  5573498   4162948   5951124   228776   226410  7006061  7053477
1048576     128   220116   231463  5885652  5613692  6070534   231023  5672696   5841911   5551667   229371   230316  6022041  6545990
1048576     256   210254   230382  5545031  5303657  5695934   231303  5082626   5099634   5612911   226474   230220  5871579  5815184
1048576     512   222408   230690  5636752  5320454  5722638   229017  5523651   5124502   5480890   230370   228140  6318970  6409186
1048576    1024   214652   229361  5860063  5364408  5936640   224247  5570956   4046775   5580435   227235   227044  6516697  6395671
1048576    2048   216388   232562  4782295  5232544  5944440   230731  5440448   4988874   5560673   227781   230524  6427779  6276908
1048576    4096   212442   228526  4988852  5343960  5839793   229451  5995116   4532952   5420146   229946   228397  6264115  6253169
1048576    8192   213020   220035  4447040  4400494  4845009   224134  3032650   2994194   4454517   219029   228215  4740654  4904905
1048576   16384   219433   224104  4256744  4114369  4482835   226294  4157746   2561086   4631088   226311   227024  4803810  4760265

Then to sum up those numbers column wise (so across reclen, AKA block size), we can feed this data into datamash (-W broadens whitespace handling, and we get min/max/mean/median for the 3rd column of values):

$ cat 1c-2gb/iozone/hostpath/1555323673.iozone.output | tail -n -15 | head -n 13 | datamash -W min 3 max 3 mean 3 median 3
203992  222408  214404.38461538 214652

And just to confirm that I’m not using datamash wrong, the minimum write column value is clearly 203992, the max value is clearly 222408, and the mean is ~214404. All that’s needed is to repeat this for the other columns and files (with a tiny bit more bash magic to combine the file excerpts):

$ for f in 1c-2gb/iozone/hostpath/*.iozone.output; do cat $f | tail -n -15 | head -n 13; done > /tmp/1c-2gb-combined.txt
# lots of output in that /tmp file, 130 lines for 10 test runs with 13 lines each

And feed the output of that into datamash:

$ for f in 1c-2gb/iozone/hostpath/*.iozone.output; do cat $f | tail -n -15 | head -n 13; done | datamash -W mean 3

I’ve chosen here to take averages of the numbers because I think it gives a decent sense of each test type that iozone but I’m sure I’m wrong (and can’t wait to see the emails telling me just how much :). The below measurements are in KB/s.

1CPU/2GB

Test Resources Method write rewrite read reread random read random write bkwd read record rewrite stride read fwrite frewrite fread freread
iozone 1CPU/2GB hostPath 214045.8 228291.5 5119468.5 4865668.9 5097556.4 223181.9 4835166.3 4099065.3 4813667.8 223755 225049.4 5484053.3 5488995.7
iozone 1CPU/2GB openebs-jiva 1112719.3 119869.7 5522036.8 5048935 5324095.5 118863.5 5345608.9 4496415.1 5188153.3 112645.2 120023.6 5477231.1 5482449.5

2CPU/4GB

Test Resources Method write rewrite read reread random read random write bkwd read record rewrite stride read fwrite frewrite fread freread
iozone 2CPU/4GB hostPath 223080.3 229733 5486789.8 5100119 5260126.5 226252.5 5292155.5 4674197.7 5060957.2 228253.3 229182.1 5690462.3 5701447.4
iozone 2CPU/4GB openebs-jiva 109808.2 110455.8 5639117.1 5198092.4 5467786.2 110561.6 5494932.7 4625088.5 5282346 111316 110935 5662448.7 5650025

4CPU/8GB

Test Resources Method write rewrite read reread random read random write bkwd read record rewrite stride read fwrite frewrite fread freread
iozone 4CPU/8GB hostPath 238619.8 243632 5655604.9 5190182.3 5460749.8 233658.5 5446474.5 4832785.1 5389630.8 237220.2 237409.5 5734645.4 5826861.1
iozone 4CPU/8GB openebs-jiva 113276.4 113386 5773971.9 5297572.3 5507035.9 108030.1 5543742.1 4927512.1 5463138.2 113765.4 113682.7 5783143.5 5788586.3

Notes

  • Running iozone took much longer than the dd based tests, as it runs through lots more steps (and lots more output)
  • Processing iozone output was pretty painful (though to be fair I should have scripted)
  • The results are simliar to the dd based tests – write, fwrite, frewrite are all about 50% with openebs-jiva versus hostPath usage

sysbench-based testing

Sysbench results are yet another format, which looks something like this:

sysbench 1.0.15 (using bundled LuaJIT 2.1.0-beta2)

Running the test with following options:
Number of threads: 1
Initializing random number generator from current time


Extra file open flags: (none)
128 files, 8MiB each
1GiB total file size
Block size 16KiB
Number of IO requests: 0
Read/Write ratio for combined random IO test: 1.50
Calling fsync () after each write operation.
Using synchronous I/O mode
Doing random r/w test
Initializing worker threads...

Threads started!


File operations:
reads/s:                      51.14
writes/s:                     34.16
fsyncs/s:                     34.16

Throughput:
read, MiB/s:                  0.80
written, MiB/s:               0.53

General statistics:
total time:                          10.0094s
total number of events:              854

Latency (ms):
min:                                    0.00
avg:                                   11.72
max:                                   93.90
95th percentile:                       36.89
sum:                                10006.59

Threads fairness:
events (avg/stddev):           854.0000/0.00
execution time (avg/stddev):   10.0066/0.00

The sysbench results will be used mostly for latency measurements – at this point I’m getting lazy (this post has taken a while to write), so I’m going to just take the min,avg,max metrics as they’re provided in each test that is run and compare those, I think we’ve got a good enough idea of thoughput with the dd and iozone tests. Since I’m using the pre-aggregated measurements (the provided min, max, avg, etc) I will be more explicit about which tests were which.

1CPU/2GB

Test Resources Method Test Latency min (ms) Latency avg (ms) Latency max (ms) Latency 95th percentile
sysbench 1CPU/2GB hostPath Sequential Read (seqrd) 0.00 0.00 100.16 0.00
sysbench 1CPU/2GB openebs-jiva Sequential Read (seqrd) 0.00 0.00 4.10 0.00
sysbench 1CPU/2GB hostPath Sequential Write (seqwr) 21.98 26.43 65.16 38.94
sysbench 1CPU/2GB openebs-jiva Sequential Write (seqwr) 9.58 12.23 105.48 22.28
sysbench 1CPU/2GB hostPath Sequential ReWrite (seqrewr) 22.20 27.13 67.80 41.10
sysbench 1CPU/2GB openebs-jiva Sequential ReWrite (seqrewr) 1.92 12.08 107.15 22.28
sysbench 1CPU/2GB hostPath Random Read (rndrd) 0.00 0.00 100.14 0.00
sysbench 1CPU/2GB openebs-jiva Random Read (rndrd) 0.00 0.00 100.16 0.00
sysbench 1CPU/2GB hostPath Random Write (rndwr) 22.09 27.64 68.94 43.39
sysbench 1CPU/2GB openebs-jiva Random Write (rndwr) 1.75 12.03 58.33 22.28
sysbench 1CPU/2GB hostPath Random ReWrite (rndrw) 0.00 10.86 68.93 34.33
sysbench 1CPU/2GB openebs-jiva Random ReWrite (rndrw) 0.00 4.88 83.76 11.24

2CPU/4GB

Test Resources Method Test Latency min (ms) Latency avg (ms) Latency max (ms) Latency 95th percentile
sysbench 2CPU/4GB hostPath Sequential Read (seqrd) 0.00 0.00 100.12 0.00
sysbench 2CPU/4GB openebs-jiva Sequential Read (seqrd) 0.00 0.00 4.14 0.00
sysbench 2CPU/4GB hostPath Sequential Write (seqwr) 22.18 27.35 81.72 40.37
sysbench 2CPU/4GB openebs-jiva Sequential Write (seqwr) 6.52 12.14 53.16 22.28
sysbench 2CPU/4GB hostPath Sequential ReWrite (seqrewr) 22.03 26.00 59.44 37.56
sysbench 2CPU/4GB openebs-jiva Sequential ReWrite (seqrewr) 8.51 11.97 58.01 22.69
sysbench 2CPU/4GB hostPath Random Read (rndrd) 0.00 0.00 96.08 0.00
sysbench 2CPU/4GB openebs-jiva Random Read (rndrd) 0.00 0.00 4.12 0.00
sysbench 2CPU/4GB hostPath Random Write (rndwr) 22.63 51.70 97.32 68.05
sysbench 2CPU/4GB openebs-jiva Random Write (rndwr) 3.24 13.82 112.46 33.72
sysbench 2CPU/4GB hostPath Random ReWrite (rndrw) 0.00 12.23 126.44 42.61
sysbench 2CPU/4GB openebs-jiva Random ReWrite (rndrw) 0.00 5.30 96.86 21.89

4CPU/8GB

Test Resources Method Test Latency min (ms) Latency avg (ms) Latency max (ms) Latency 95th percentile
sysbench 4CPU/8GB hostPath Sequential Read (seqrd) 0.00 0.00 47.14 0.01
sysbench 4CPU/8GB openebs-jiva Sequential Read (seqrd) 0.00 0.01 75.85 0.01
sysbench 4CPU/8GB hostPath Sequential Write (seqwr) 22.13 30.91 126.20 53.85
sysbench 4CPU/8GB openebs-jiva Sequential Write (seqwr) 8.83 12.51 124.40 22.69
sysbench 4CPU/8GB hostPath Sequential ReWrite (seqrewr) 21.98 35.52 76.10 59.99
sysbench 4CPU/8GB openebs-jiva Sequential ReWrite (seqrewr) 4.61 12.18 64.30 22.69
sysbench 4CPU/8GB hostPath Random Read (rndrd) 0.00 0.00 4.06 0.01
sysbench 4CPU/8GB openebs-jiva Random Read (rndrd) 0.00 0.00 4.08 0.01
sysbench 4CPU/8GB hostPath Random Write (rndwr) 22.66 55.85 128.62 78.60
sysbench 4CPU/8GB openebs-jiva Random Write (rndwr) 4.38 16.45 133.18 45.79
sysbench 4CPU/8GB hostPath Random ReWrite (rndrw) 0.00 18.19 127.80 64.47
sysbench 4CPU/8GB openebs-jiva Random ReWrite (rndrw) 0.00 6.39 174.27 23.10

These numbers are kind of all over the place (in some places the latency hostPath is better, and in others openebs-jiva does better), but a few things seem to stand out:

  • hostPath has a bad (but weirdly consistent) max latency for sequential reads (100ms versus jiva’s 4ms), this might have a very direct cause (possibly misconfiguration on my part or some specific test settings that are triggering behavior).
  • openebs-jiva seems to be ~2-3x faster at sequential writes & rewrites
  • openebs-jiva seems to be ~2x faster at random writes & rewrites
  • random reads seem to have the same perf characteristics as sequential reads which is very suspicious
  • With much bigger total file size, openebs-jiva seems to have an increase in max latency for reads but performs similarly to hostPath.
  • latency was relatively unaffected by resources (which is what you’d expect if it’s possible to saturate the resource w/ only 1 core – more cores shouldn’t yield more requests)

Blind spots/Room for improvement

These tests weren’t particularly rigorous and there is much that could be improved upon but I think it’s still worth noting some areas that could be improved – if you notice any glaring errors in methodology please feel free to email me and I can make them more explicit in this section (or if severe enough, warn people of the flaws up top).

Primarily the biggest drawback of these experiments was not automating more of the results processing – while it wouldn’t have taken super long to write some scripts to parse the output and generate some structured output (likely JSON), I didn’t take the time to do that, and opted for bash magic instead, relying heavily on datamash and unix shell-fu.

One thing I didn’t really properly explore were the failure modes of OpenEBS – There are differences between the Jiva and CStor file system options, as well as how live systems shift when nodes go down. I don’t remember where I read it (some random blog post?), but I’ve heard some bad things about some failure modes of openebs, for now all I can find to back this up is:

Another thing that is conspicuously missing is a proper test of distant servers behavior – it would be much more interesting to test on a bigger cluster and simulate a failure of a replica closest to the actual compute that’s doing the file IO – how does the system perform if the OpenEBS data-plane components it’s talking to are across a continent or an ocean? This is something that likely wouldn’t really happen in practice unless something was seriously wrong, but it would be nice to know.

Originally I meant to also include some pgbench runs as well, but considering how long it took to write this post, I decided to skip or now – I think the tests do enough to describe file I/O throughput and latency, so we’ve got a good idea how a more realistic workload like postgres would run. In the future, maybe I’ll do another separate measurement of pgbench (likely when version 0.8.1 of OpenEBS actually lands).

Wrapup

It was pretty interesting to dip in to the world of hard-drive testing and to get a chance to compare OpenEBS to the easiest (but mosy unsafe/fragile) way of provisioning volumes on k8s. Glad to see how OpenEBS was able to perform – it only lagged behind in large file writes, and actually did slightly better than hostPath for lots of small writes which is great. I only tested the jiva engine, so eventually I expect much greater things from cStor.

Looks like OpenEBS is still my go-to for easy-to-deploy dynamic local-disk provisioning and management (and probably will be for some time).

Did you find this read beneficial? Send me questions/comments/clarifciations.
Want my expertise on your team/project? Send me interesting opportunities!