K8s storage provider benchmarks round 2, part 4

Categories
Kuberentes logo + OpenEBS logo / Rook logo / LINBIT logo

tl;dr - I explain the YAML and Makefile scripts that power the fio and pgbench (oltpbench) tests I’m going to run.

UPDATE (04/10/2021)

Turns out I was mistaken -- OpenEBS Mayastor doesn't support single-node disk-level failure domains. It's very well described on their website in the FAQ, but I somehow missed and/or forgot that, so the tests for Mayastor will only represent JBOD setup (no replication).
On a different but related note, cStor supports cross disk replication (mirroring on the Pool object), but it does not replicate at the pool level, so all written data will be RAID1. THe type of pools are also So essentially, there will be no JBOD results for cStor as there are no RAID1 tests for Mayastor.
Similarly, for OpenEBS Jiva there is no RAID cases, since I have one storage pool per-disk. This means that there will only be JBOD for Jiva. It's looking like I'll have to do the cluster testing sooner rather than later.

NOTE: This a multi-part blog-post!

  1. Part 1 - Intro & Cloud server wrangling
  2. Part 2 - Installing storage plugins
  3. Part 3 - Installing more storage plugins
  4. Part 4 - Configuring the tests (you are here)
  5. Part 5 - The results

Context

In part 3 we worked through installing the various storage plugins and now we’ve got a fully repeatable process for provisioning the machine, installing pre-requisites, kubernetes, and a given storage plugin. A Makefile at the top of the repo orchestrates everything and runs ansible and kubectl (and some other tools) as appropriate to get it all done. Now let’s design some tests

Cluster operators will probably gain the most insight (if any) here, though Sysadmins (and maybe some ops-curious devs) might like to find more benchmarking tools that are usable in any context. Note that you’ll have to know about Kubernetes primitives like Jobs, PersistentVolumes, and PersistentVolumeClaims as they won’t be explained here, I’ll just be using them and pasting some YAML.

The axes needed for testing

From all the work that’s been done up until now, we have a few different axes to test over (some which were discovered as we configured), all conveniently exposed as ENV variables:

  • STORAGE_PLUGIN (the storage plugin – openebs-mayastor, rook-ceph-lvm, etc)
  • TEST (which test to run – fio, etc)
  • REPLICATION_STRATEGY (how the storage is configured – raid1, jbod)

While working on this I discovered a new one – fio can be convinced to do writes using O_DIRECT, AKA “Direct IO” (ScyllaDB has a good summary on this). One of the big differences between postgres and mysql is that MySQL works with Direct IO while postgres relies on/expects the linux page cache. Though things work, this leads to quirks in Postgres, for example the fsync “surprise” from back in 2018.

How should we template?

I have a lot of options on how to do the templating along these axes with Make:

Well first of all, ksonnet is evidently deprecated (back in 2019?) – the authors no longer work on it. It sucks for all the people who believed it was the one true solution or liked it a lot, but also great that the authors knew when to step away and encourage people to move to other tools that had “won” in some way or another.

Anyway, I won’t even pretend that I’m not biased against Helm & yq and all those other options – I’m a firm believer in keeping it as simple as possible and kustomize fits that bill. It’s annoying that it doesn’t support entering environment-based or varible templating but the more I get used to it, the more I like it. I use kustomize in my production environments along with git-crypt and it works wonderfully – I keep my secrets right in the repo (encrypted) and symlink to the folder or just include them from outside (since kustomize supports that now), and I have folders for staging, production, and other environments.

Here’s what a kustomize-driven folder structure looks like for me for fio:

$ tree tests
tests
└── fio
    ├── base
    │   ├── fio.job.yaml
    │   └── fio.pvc.yaml
    ├── kustomization.yaml
    ├── Makefile
    └── overlays
        ├── .... other STORAGE_PLUGINs ...linstor
        └── rook-ceph-lvm
            ├── jbod
            │   ├── fio.pvc.yaml
            │   └── kustomization.yaml
            └── raid1
                ├── fio.pvc.yaml
                └── kustomization.yaml

It’s a bit long-winded/repetitive and the paradigm is stretching a bit, but it is straight-forward. It’s actually a bit longer if you take into account the FIO_DIRECT_IO axis. That said, people should be able to quickly understand that overlays/rook-ceph-lvm/jbod/direct-write-off is the code that powers the Rook tests on a JBOD (no RAID) PVC, with fio direct writes turned off.

SIDETRACK: kustomize’s official documentation could be better

I don’t know why but I was a bit irked by kustomize’s documentation:

  • Weird navigation where you click on something and you’re almost always met with a listing of things you might want to know about
  • They never show you what files are supposed to look like, just talk about the terms
  • Why is there this weird split between the kubectl kustomize bit and the “core” docs? I’ve come to the kustomize site, it’s not important how I’ve run it, it’s the same thing. kubectl is a sub-page of your site, not the other way around. IMO this is a direct result of projects being too closely coupled – the thinking is too kubectl-centric/defers to Kubernetes too much. Think of a tool like Helm – would they ever set their website up like this?

All that said, I’m going to use kustomize instead of envsubst, because it’s the choice of the community – less people will have problems understanding how the repo works.

Retrieving results

Since we’re running a Job (and waiting for it to finish), I think I can get away with just pulling out the contents of the logs like so:

collect-results:
    @echo "Writing job log output to [./results/$(JOB_LOG_FILE_PATH)]..."
    $(KUBECTL) logs job/fio > $(JOB_LOG_FILE_PATH)

Normally I’d do some more scripting to extract the values I really want into some JSON or something, but I’ll leave that for another time, this blog post series is too long already.

Test setup

leeliu/dbench (fio-based)

This test setup is wonderfully simple – all I need is a Job and a PersistentVolumeClaim.

YAML

Here are the base resources:

base/fio.pvc.yaml:

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: fio
spec:
  storageClassName: unknown
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 8Gi

base/fio.job.yaml:

# see: https://github.com/longhorn/dbench
---
apiVersion: batch/v1
kind: Job
metadata:
  name: fio
spec:
  backoffLimit: 4
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: dbench
        image: longhornio/dbench:latest
        imagePullPolicy: Always
        # privilege needed to invalid the fs cache
        securityContext:
          privileged: true
        env:
          - name: FIO_SIZE
            value: 8G
          - name: DBENCH_MOUNTPOINT
            value: /data
          - name: FIO_DIRECT
            value: "0"
          # - name: DBENCH_QUICK
          #   value: "yes"
          # - name: FIO_OFFSET_INCREMENT
          #   value: 256M
        volumeMounts:
          - name: disk
            mountPath: /data
      volumes:
        - name: disk
          persistentVolumeClaim:
            claimName: fio

Here’s what base/kustomization.yaml looks like:

---
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
  - fio.pvc.yaml
  - fio.job.yaml

commonLabels:
  app: fio

And I won’t show all of them but here’s an example of the files in an overlay:

overlays/linstor-drbd9/jbod/direct-write-off/fio.job.yaml

# see: https://github.com/longhorn/dbench
---
apiVersion: batch/v1
kind: Job
metadata:
  name: fio
spec:
  template:
    spec:
      containers:
      - name: dbench
        env:
          - name: FIO_DIRECT
            value: "0"

overlays/linstor-drbd9/jbod/direct-write-off/fio.pvc.yaml

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: fio
spec:
  storageClassName: jbod

overlays/linstor-drbd9/jbod/direct-write-off/kustomization.yaml

bases:
  - ../../../../base

patchesStrategicMerge:
  - fio.pvc.yaml
  - fio.job.yaml

Makefile

And the Makefile to make it all go:

.PHONY: all run \
        pvc-yaml pvc \
        job job-wait \
        collect-results

KUBIE ?= kubie
KUBECTL_BIN ?= kubectl
KUBECTL ?= $(KUBECTL_BIN) --kubeconfig=$(KUBECONFIG_PATH)
KUSTOMIZE ?= $(KUBECTL) kustomize

EXPECTED_ADMIN_CONFIG_PATH ?= $(shell realpath ../../../ansible/output/**/var/lib/k0s/pki/admin.conf)
KUBECONFIG_PATH ?= $(EXPECTED_ADMIN_CONFIG_PATH)

RESULTS_DIR_PATH ?= $(shell realpath ../../../results)
JOB_LOG_FILE_NAME ?= $(STORAGE_PLUGIN)-$(TEST)-$(STORAGE_CLASS)-direct-write-$(FIO_DIRECT_WRITE).log
JOB_LOG_FILE_PATH ?= $(RESULTS_DIR_PATH)/$(JOB_LOG_FILE_NAME)

OVERLAY_PATH ?= overlays/$(STORAGE_PLUGIN)/$(STORAGE_CLASS)/direct-write-$(FIO_DIRECT_WRITE)

STORAGE_PLUGIN ?= rook-ceph-lvm
STORAGE_CLASS ?= jbod
TEST ?= fio

FIO_DIRECT_WRITE ?= off

ifeq ("on","$(FIO_DIRECT_WRITE)")
    FIO_DIRECT = 1
else
    FIO_DIRECT = 0
endif

all: job job-wait collect-results

job:
    $(KUBECTL) apply -k $(OVERLAY_PATH)

job-uninstall:
    $(KUBECTL) delete -k $(OVERLAY_PATH)

job-wait:
    @echo "Waiting for job to finish (timeout 30m)..."
    $(KUBECTL) wait job/fio --for=condition=complete --timeout=30m

collect-results:
    @echo "Writing job/fio log output to [./results/$(JOB_LOG_FILE_PATH)]..."
    $(KUBECTL) logs job/fio > $(JOB_LOG_FILE_PATH)

Log result example

Here’s some examples of each of the kinds of output we get. From fio:

...... lots of output ......

All tests complete.

==================
= Dbench Summary =
==================
Random Read/Write IOPS: 17.5k/474k. BW: 6020MiB/s / 3862MiB/s
Average Latency (usec) Read/Write: 233.07/
Sequential Read/Write: 9826MiB/s / 3513MiB/s
Mixed Random Read/Write IOPS: 16.8k/5567

pgbench

[pgbench is a little more complicated than simple fio since it will actually do database queries against a postgres database. To orchestrate this test we’ll run a Job which will run the pgbenchbinary against an existing Pod which is already runnign postgres. So roughly the flow will be

  • Create a appropriate PVC
  • Create a postgres Pod using the PVC
  • Run the (appropriately configured) pgbench Job
  • Save/Report the results
  • Remove the pod and the PVC

YAML

Here’s the base YAML to make this work:

kubernetes/tests/pgbench/base/kustomization.yaml:

---
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
  - postgres.pvc.yaml
  - postgres.svc.yaml
  - postgres.pod.yaml
  - pgbench.job.yaml
  - pgbench.serviceaccount.yaml
  - pgbench.rbac.yaml

commonLabels:
  app: pgbench

Well that’s more files than you would think! Well I needed the ServiceAccount to get some permissions to enable the Job to wait on the postgres pod to be up. Rather than trying to explain it, the code should show it.

kubernetes/tests/pgbench/base/postgres.pod.yaml:

# see: https://github.com/longhorn/dbench
---
apiVersion: batch/v1
kind: Job
metadata:
  name: pgbench
spec:
  backoffLimit: 0
  activeDeadlineSeconds: 600
  template:
    spec:
      serviceAccount: pgbench
      restartPolicy: Never
      initContainers:
        - name: wait-for-pg
          image: bitnami/kubectl
          imagePullPolicy: IfNotPresent
          command:
            - /bin/sh
            - -exc
            - |
              echo -e "[info] waiting for ${POD_NAME} pod in namespace [${POD_NAMESPACE}] to exist ..."
              n=0
              until kubectl get pod ${POD_NAME} -n ${POD_NAMESPACE} ; do
                  echo -n "[info] waiting 10 seconds before trying again..."
                  sleep 10
              done

              echo -e "[info] taking brief wait because sometimes kubectl wait fails for no reason during pod state transition"
              sleep 10

              echo -e "[info] waiting for ${POD_NAME} pod in namespace [${POD_NAMESPACE}] to be ready..."
              kubectl wait pod ${POD_NAME} -n ${POD_NAMESPACE} --for=condition=ready || true              
          env:
            - name: POD_NAME
              value: postgres
            - name: POD_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace

        # Initialize pg bench database
        - name: pgbench-initialize
          image: postgres:13.2-alpine
          imagePullPolicy: IfNotPresent
          command:
            - pgbench
            - --initialize
            - --foreign-keys
            - --scale=100
          env:
            - name: PGHOST
              value: postgres
            - name: PGUSER
              value: postgres

      containers:
      - name: pgbench
        image: postgres:13.2-alpine
        imagePullPolicy: IfNotPresent
        resources:
          requests:
            cpu: 4
            memory: 8Gi
          limits:
            cpu: 4
            memory: 8Gi
        command:
          - pgbench
          - --jobs=4 # TODO: if resources.requests.limit.cpu is a non millicore nubmer we pass that through ENV and use it directly
        env:
          - name: PGHOST
            value: postgres
          - name: PGUSER
            value: postgres

There’s some complexity for you! One of the funnest parts was finding out that kubectl wait just fails sometims for apparently no reason.

Makefile

The Makefile for this was pretty straight forward so I’ll paste it here

kubernetes/tests/pgbench/Makefile:

all: clean resources job-wait collect-results resources-uninstall

# NOTE: this include has to go *AFTER* EXPECTED_ADMIN_CONFIG_PATH has been set
include ./../../../kubernetes/kubie.mk

clean:
	@echo -e "\n[info] *** clearing out resources that may or may not exist ***"
	$(KUBECTL) delete -k $(OVERLAY_PATH) || true

#############
# Resources #
#############

resources:
	$(KUBECTL) apply -k $(OVERLAY_PATH)

resources-uninstall:
	$(KUBECTL) delete -k $(OVERLAY_PATH)

job-wait:
	@echo "Waiting for job to finish (timeout 30m)..."
	$(KUBECTL) wait job/pgbench --for=condition=complete --timeout=30m

collect-results:
	@echo "Writing job/pgbench log output to [./results/$(JOB_LOG_FILE_PATH)]..."
	$(KUBECTL) logs job/pgbench > $(JOB_LOG_FILE_PATH)

Pretty easy, and it all runs reliably (after some tweaking and testing from me of course).

The results

Here’s the totality of what pgbench produces:

starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 100
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 10
number of transactions actually processed: 10/10
latency average = 2.234 ms
tps = 447.663503 (including connections establishing)
tps = 499.280487 (excluding connections establishing)

Nice and easy to read!

oltpbench (quickly abandoned)

[oltpbench][oltpbench] is a project with an impressive amount of benchmarking tools and a common way to configure and run them. Unfortunately though, after building the image I ran into an error, and I have absolutely zero desire to debug Java to get to the bottom of it. If the error was reasonable, sure, it seems to be coming from a file I didn’t even submit, a file that represents (from what I can tell) static information. It must work great for someone out there but that person itsn’t me.

Here’s the StringIndexOutOfBoundsException (ugh) stack trace if you’re into that sort of thing:

Exception in thread "main" java.lang.ExceptionInInitializerError
        at org.apache.commons.configuration.ConfigurationUtils.toURL(ConfigurationUtils.java:739)
        at org.apache.commons.configuration.ConfigurationUtils.locate(ConfigurationUtils.java:518)
        at org.apache.commons.configuration.AbstractFileConfiguration.load(AbstractFileConfiguration.java:213)
        at org.apache.commons.configuration.AbstractFileConfiguration.load(AbstractFileConfiguration.java:197)
        at org.apache.commons.configuration.AbstractHierarchicalFileConfiguration.load(AbstractHierarchicalFileConfiguration.java:164)
        at org.apache.commons.configuration.AbstractHierarchicalFileConfiguration.<init>(AbstractHierarchicalFileConfiguration.java:91)
        at org.apache.commons.configuration.XMLConfiguration.<init>(XMLConfiguration.java:243)
        at com.oltpbenchmark.DBWorkload.main(DBWorkload.java:87)
Caused by: java.lang.StringIndexOutOfBoundsException: begin 0, end 3, length 2
        at java.base/java.lang.String.checkBoundsBeginEnd(String.java:3751)
        at java.base/java.lang.String.substring(String.java:1907)
        at org.apache.commons.lang.SystemUtils.getJavaVersionAsFloat(SystemUtils.java:1133)
        at org.apache.commons.lang.SystemUtils.<clinit>(SystemUtils.java:818)
        ... 8 more

If it wasn’t bad enough that I had to configure things with XML (instead of fully ENV variables), the thought that I would have to debug their Java code was it for me. I did give it a tiny look and of course, the error is coming from a file I didn’t even add:

       // create the command line parser
        CommandLineParser parser = new PosixParser();
        XMLConfiguration pluginConfig=null;
        try {
            pluginConfig = new XMLConfiguration("config/plugin.xml"); // <---------- error is here
        } catch (ConfigurationException e1) {
            LOG.info("Plugin configuration file config/plugin.xml is missing");
            e1.printStackTrace();
        }

That was enough for me to nope out, so… I’m dropping oltpbenchfrom this round.

BONUS: Why not kastenhq/kubestr?

I actually didn’t know about kastenhq/kubestr until relatively recently – I found out about it thank to The Kubernetes Podcast. It only runs fio which isn’t that hard so I think I’ll skip it for now, I’m happy with my fio, pgbench and oltpbench pods.

Wrapup

OK this was a pretty quick post – since I want to make a post with only the results available (that’s the highest value post in the bunch!), I’ll leave it here!

UPDATE: Turns out Mayastor doesn’t support same-node disk-level replication

Well this was a bit of a surprise – I realized when the raid1 tests for Mayastor were failing that actually Mayastor doesn’t support (or plan to support) same-node disk failure domains. Looks like the mayastor tests will have to be JBOD only – it sure would have been nice to include this, so that even single-node cluster operators can get some assurance but I guess the move here is to just make the MaystorPool object with the 2 disks in it to begin with…? Nope Mayastor doesn’t do RAID (which is probably a good idea on their part, want to keep it light). So the move here if you want redundancy on the lower disk is to add your software RAID via LVM or ZFS and use that.

All in all not a terrible trade-off – Mayastor doesn’t support compression or dedup, and it doesn’t support snapshots or clones in the GA. Have to say I’m not as disappointed as I was with other tech being tested for not being able to get same-node disk failure domains – they’ve made sure to clearly spell it out, without 5 pages of “buy our consulting” to wade through first. A nice pointed FAQ goes a long way. At this point I have to wonder if I’m just biased for OpenEBS – they have had the easiest-to-get-started robust solution for a long time, they use Rust and realize it’s value (also says something about the team’s technical chops), and I’ve even had some email exchanges with their CEO, I just like their vibe.

Anyway, lucky for you dear reader, the benchmark is coded up so you can reproduce it yourself, you do not have to trust me not to be biased.

cStor doesn’t do same-node disk failure domains

So cStor doesn’t do it either, which is reasonable I guess? But what it does do is allow you create what I assume maps to a ZFS mirrored pool, and use that. So in this case what I’ve defined as “JBOD” (a single disk with no replication) is really single-ndoe RAID1, but underneath. Since this is the case (and how I have it coded right now) what I’ll do is rename the jbod storageclass to raid1 and when I test with bigger clusters in the future I will make a raid1-ha that represent mult-node with RAID1 on the disks underneath (in it’s various forms – whether done by LVM/ZFS or the orchestrator).

Jiva doesn’t reallly do RAID1

Jiva is excited to take a space (folder) on a disk, so in a somewhat different way it actually doesn’t do RAID1, but only because what I’ve given it is a StoragePool for each disk. This is what I’m running right now (this very blog post, at time of writing) because it was the easiest way to get up and running without disturbing the enforced software RAID. What I’ve done here too is to simply delete the raid1 StorageClass since what I’m really testing with the current code is “JBOD”.

Like what you're reading? Get it in your inbox