tl;dr - I explain the YAML and Makefile scripts that power the fio
and pgbench
() tests I’m going to run.oltpbench
Turns out I was mistaken -- OpenEBS Mayastor doesn't support single-node disk-level failure domains. It's very well described on their website in the FAQ, but I somehow missed and/or forgot that, so the tests for Mayastor will only represent JBOD setup (no replication).
On a different but related note, cStor supports cross disk replication (mirroring on the Pool object), but it does not replicate at the pool level, so all written data will be RAID1. THe type of pools are also So essentially, there will be no JBOD results for cStor as there are no RAID1 tests for Mayastor.
Similarly, for OpenEBS Jiva there is no RAID cases, since I have one storage pool per-disk. This means that there will only be JBOD for Jiva. It's looking like I'll have to do the cluster testing sooner rather than later.
NOTE: This a multi-part blog-post!
In part 3 we worked through installing the various storage plugins and now we’ve got a fully repeatable process for provisioning the machine, installing pre-requisites, kubernetes, and a given storage plugin. A Makefile
at the top of the repo orchestrates everything and runs ansible
and kubectl
(and some other tools) as appropriate to get it all done. Now let’s design some tests
Cluster operators will probably gain the most insight (if any) here, though Sysadmins (and maybe some ops-curious devs) might like to find more benchmarking tools that are usable in any context. Note that you’ll have to know about Kubernetes primitives like Job
s, PersistentVolume
s, and PersistentVolumeClaim
s as they won’t be explained here, I’ll just be using them and pasting some YAML.
From all the work that’s been done up until now, we have a few different axes to test over (some which were discovered as we configured), all conveniently exposed as ENV variables:
STORAGE_PLUGIN
(the storage plugin – openebs-mayastor
, rook-ceph-lvm
, etc)TEST
(which test to run – fio
, etc)REPLICATION_STRATEGY
(how the storage is configured – raid1
, jbod
)While working on this I discovered a new one – fio
can be convinced to do writes using O_DIRECT
, AKA “Direct IO” (ScyllaDB has a good summary on this). One of the big differences between postgres
and mysql
is that MySQL works with Direct IO while postgres
relies on/expects the linux page cache. Though things work, this leads to quirks in Postgres, for example the fsync
“surprise” from back in 2018.
I have a lot of options on how to do the templating along these axes with Make:
envsubst
+ kubectl
(MakeInfra pattern v1)kustomize
(readers might recognize this as MakeInfra pattern v2)yq
, ksonnet
, jsonnet, etcWell first of all, ksonnet
is evidently deprecated (back in 2019?) – the authors no longer work on it. It sucks for all the people who believed it was the one true solution or liked it a lot, but also great that the authors knew when to step away and encourage people to move to other tools that had “won” in some way or another.
Anyway, I won’t even pretend that I’m not biased against Helm & yq
and all those other options – I’m a firm believer in keeping it as simple as possible and kustomize
fits that bill. It’s annoying that it doesn’t support entering environment-based or varible templating but the more I get used to it, the more I like it. I use kustomize
in my production environments along with git-crypt
and it works wonderfully – I keep my secrets right in the repo (encrypted) and symlink to the folder or just include them from outside (since kustomize
supports that now), and I have folders for staging
, production
, and other environments.
Here’s what a kustomize
-driven folder structure looks like for me for fio
:
$ tree tests
tests
└── fio
├── base
│ ├── fio.job.yaml
│ └── fio.pvc.yaml
├── kustomization.yaml
├── Makefile
└── overlays
├── .... other STORAGE_PLUGINs ...linstor
└── rook-ceph-lvm
├── jbod
│ ├── fio.pvc.yaml
│ └── kustomization.yaml
└── raid1
├── fio.pvc.yaml
└── kustomization.yaml
It’s a bit long-winded/repetitive and the paradigm is stretching a bit, but it is straight-forward. It’s actually a bit longer if you take into account the FIO_DIRECT_IO
axis. That said, people should be able to quickly understand that overlays/rook-ceph-lvm/jbod/direct-write-off
is the code that powers the Rook tests on a JBOD (no RAID) PVC, with fio
direct writes turned off.
kustomize
’s official documentation could be betterI don’t know why but I was a bit irked by kustomize
’s documentation:
kubectl kustomize
bit and the “core” docs? I’ve come to the kustomize
site, it’s not important how I’ve run it, it’s the same thing. kubectl
is a sub-page of your site, not the other way around. IMO this is a direct result of projects being too closely coupled – the thinking is too kubectl
-centric/defers to Kubernetes too much. Think of a tool like Helm – would they ever set their website up like this?All that said, I’m going to use kustomize
instead of envsubst
, because it’s the choice of the community – less people will have problems understanding how the repo works.
Since we’re running a Job
(and waiting for it to finish), I think I can get away with just pulling out the contents of the logs like so:
collect-results:
@echo "Writing job log output to [./results/$(JOB_LOG_FILE_PATH)]..."
$(KUBECTL) logs job/fio > $(JOB_LOG_FILE_PATH)
Normally I’d do some more scripting to extract the values I really want into some JSON or something, but I’ll leave that for another time, this blog post series is too long already.
leeliu/dbench
(fio
-based)This test setup is wonderfully simple – all I need is a Job
and a PersistentVolumeClaim
.
Here are the base resources:
base/fio.pvc.yaml
:
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: fio
spec:
storageClassName: unknown
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 8Gi
base/fio.job.yaml
:
# see: https://github.com/longhorn/dbench
---
apiVersion: batch/v1
kind: Job
metadata:
name: fio
spec:
backoffLimit: 4
template:
spec:
restartPolicy: Never
containers:
- name: dbench
image: longhornio/dbench:latest
imagePullPolicy: Always
# privilege needed to invalid the fs cache
securityContext:
privileged: true
env:
- name: FIO_SIZE
value: 8G
- name: DBENCH_MOUNTPOINT
value: /data
- name: FIO_DIRECT
value: "0"
# - name: DBENCH_QUICK
# value: "yes"
# - name: FIO_OFFSET_INCREMENT
# value: 256M
volumeMounts:
- name: disk
mountPath: /data
volumes:
- name: disk
persistentVolumeClaim:
claimName: fio
Here’s what base/kustomization.yaml
looks like:
---
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- fio.pvc.yaml
- fio.job.yaml
commonLabels:
app: fio
And I won’t show all of them but here’s an example of the files in an overlay:
overlays/linstor-drbd9/jbod/direct-write-off/fio.job.yaml
# see: https://github.com/longhorn/dbench
---
apiVersion: batch/v1
kind: Job
metadata:
name: fio
spec:
template:
spec:
containers:
- name: dbench
env:
- name: FIO_DIRECT
value: "0"
overlays/linstor-drbd9/jbod/direct-write-off/fio.pvc.yaml
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: fio
spec:
storageClassName: jbod
overlays/linstor-drbd9/jbod/direct-write-off/kustomization.yaml
bases:
- ../../../../base
patchesStrategicMerge:
- fio.pvc.yaml
- fio.job.yaml
And the Makefile to make it all go:
.PHONY: all run \
pvc-yaml pvc \
job job-wait \
collect-results
KUBIE ?= kubie
KUBECTL_BIN ?= kubectl
KUBECTL ?= $(KUBECTL_BIN) --kubeconfig=$(KUBECONFIG_PATH)
KUSTOMIZE ?= $(KUBECTL) kustomize
EXPECTED_ADMIN_CONFIG_PATH ?= $(shell realpath ../../../ansible/output/**/var/lib/k0s/pki/admin.conf)
KUBECONFIG_PATH ?= $(EXPECTED_ADMIN_CONFIG_PATH)
RESULTS_DIR_PATH ?= $(shell realpath ../../../results)
JOB_LOG_FILE_NAME ?= $(STORAGE_PLUGIN)-$(TEST)-$(STORAGE_CLASS)-direct-write-$(FIO_DIRECT_WRITE).log
JOB_LOG_FILE_PATH ?= $(RESULTS_DIR_PATH)/$(JOB_LOG_FILE_NAME)
OVERLAY_PATH ?= overlays/$(STORAGE_PLUGIN)/$(STORAGE_CLASS)/direct-write-$(FIO_DIRECT_WRITE)
STORAGE_PLUGIN ?= rook-ceph-lvm
STORAGE_CLASS ?= jbod
TEST ?= fio
FIO_DIRECT_WRITE ?= off
ifeq ("on","$(FIO_DIRECT_WRITE)")
FIO_DIRECT = 1
else
FIO_DIRECT = 0
endif
all: job job-wait collect-results
job:
$(KUBECTL) apply -k $(OVERLAY_PATH)
job-uninstall:
$(KUBECTL) delete -k $(OVERLAY_PATH)
job-wait:
@echo "Waiting for job to finish (timeout 30m)..."
$(KUBECTL) wait job/fio --for=condition=complete --timeout=30m
collect-results:
@echo "Writing job/fio log output to [./results/$(JOB_LOG_FILE_PATH)]..."
$(KUBECTL) logs job/fio > $(JOB_LOG_FILE_PATH)
Here’s some examples of each of the kinds of output we get. From fio
:
...... lots of output ......
All tests complete.
==================
= Dbench Summary =
==================
Random Read/Write IOPS: 17.5k/474k. BW: 6020MiB/s / 3862MiB/s
Average Latency (usec) Read/Write: 233.07/
Sequential Read/Write: 9826MiB/s / 3513MiB/s
Mixed Random Read/Write IOPS: 16.8k/5567
pgbench
[pgbench
is a little more complicated than simple fio
since it will actually do database queries against a postgres database. To orchestrate this test we’ll run a Job
which will run the pgbench
binary against an existing Pod
which is already runnign postgres
. So roughly the flow will be
postgres
Pod
using the PVCpgbench
Job
Here’s the base YAML to make this work:
kubernetes/tests/pgbench/base/kustomization.yaml
:
---
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- postgres.pvc.yaml
- postgres.svc.yaml
- postgres.pod.yaml
- pgbench.job.yaml
- pgbench.serviceaccount.yaml
- pgbench.rbac.yaml
commonLabels:
app: pgbench
Well that’s more files than you would think! Well I needed the ServiceAccount
to get some permissions to enable the Job
to wait on the postgres
pod to be up. Rather than trying to explain it, the code should show it.
kubernetes/tests/pgbench/base/postgres.pod.yaml
:
# see: https://github.com/longhorn/dbench
---
apiVersion: batch/v1
kind: Job
metadata:
name: pgbench
spec:
backoffLimit: 0
activeDeadlineSeconds: 600
template:
spec:
serviceAccount: pgbench
restartPolicy: Never
initContainers:
- name: wait-for-pg
image: bitnami/kubectl
imagePullPolicy: IfNotPresent
command:
- /bin/sh
- -exc
- |
echo -e "[info] waiting for ${POD_NAME} pod in namespace [${POD_NAMESPACE}] to exist ..."
n=0
until kubectl get pod ${POD_NAME} -n ${POD_NAMESPACE} ; do
echo -n "[info] waiting 10 seconds before trying again..."
sleep 10
done
echo -e "[info] taking brief wait because sometimes kubectl wait fails for no reason during pod state transition"
sleep 10
echo -e "[info] waiting for ${POD_NAME} pod in namespace [${POD_NAMESPACE}] to be ready..."
kubectl wait pod ${POD_NAME} -n ${POD_NAMESPACE} --for=condition=ready || true
env:
- name: POD_NAME
value: postgres
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
# Initialize pg bench database
- name: pgbench-initialize
image: postgres:13.2-alpine
imagePullPolicy: IfNotPresent
command:
- pgbench
- --initialize
- --foreign-keys
- --scale=100
env:
- name: PGHOST
value: postgres
- name: PGUSER
value: postgres
containers:
- name: pgbench
image: postgres:13.2-alpine
imagePullPolicy: IfNotPresent
resources:
requests:
cpu: 4
memory: 8Gi
limits:
cpu: 4
memory: 8Gi
command:
- pgbench
- --jobs=4 # TODO: if resources.requests.limit.cpu is a non millicore nubmer we pass that through ENV and use it directly
env:
- name: PGHOST
value: postgres
- name: PGUSER
value: postgres
There’s some complexity for you! One of the funnest parts was finding out that kubectl wait
just fails sometims for apparently no reason.
The Makefile for this was pretty straight forward so I’ll paste it here
kubernetes/tests/pgbench/Makefile
:
all: clean resources job-wait collect-results resources-uninstall
# NOTE: this include has to go *AFTER* EXPECTED_ADMIN_CONFIG_PATH has been set
include ./../../../kubernetes/kubie.mk
clean:
@echo -e "\n[info] *** clearing out resources that may or may not exist ***"
$(KUBECTL) delete -k $(OVERLAY_PATH) || true
#############
# Resources #
#############
resources:
$(KUBECTL) apply -k $(OVERLAY_PATH)
resources-uninstall:
$(KUBECTL) delete -k $(OVERLAY_PATH)
job-wait:
@echo "Waiting for job to finish (timeout 30m)..."
$(KUBECTL) wait job/pgbench --for=condition=complete --timeout=30m
collect-results:
@echo "Writing job/pgbench log output to [./results/$(JOB_LOG_FILE_PATH)]..."
$(KUBECTL) logs job/pgbench > $(JOB_LOG_FILE_PATH)
Pretty easy, and it all runs reliably (after some tweaking and testing from me of course).
Here’s the totality of what pgbench
produces:
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 100
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 10
number of transactions actually processed: 10/10
latency average = 2.234 ms
tps = 447.663503 (including connections establishing)
tps = 499.280487 (excluding connections establishing)
Nice and easy to read!
oltpbench
(quickly abandoned)[oltpbench
][oltpbench] is a project with an impressive amount of benchmarking tools and a common way to configure and run them. Unfortunately though, after building the image I ran into an error, and I have absolutely zero desire to debug Java to get to the bottom of it. If the error was reasonable, sure, it seems to be coming from a file I didn’t even submit, a file that represents (from what I can tell) static information. It must work great for someone out there but that person itsn’t me.
Here’s the StringIndexOutOfBoundsException
(ugh) stack trace if you’re into that sort of thing:
Exception in thread "main" java.lang.ExceptionInInitializerError
at org.apache.commons.configuration.ConfigurationUtils.toURL(ConfigurationUtils.java:739)
at org.apache.commons.configuration.ConfigurationUtils.locate(ConfigurationUtils.java:518)
at org.apache.commons.configuration.AbstractFileConfiguration.load(AbstractFileConfiguration.java:213)
at org.apache.commons.configuration.AbstractFileConfiguration.load(AbstractFileConfiguration.java:197)
at org.apache.commons.configuration.AbstractHierarchicalFileConfiguration.load(AbstractHierarchicalFileConfiguration.java:164)
at org.apache.commons.configuration.AbstractHierarchicalFileConfiguration.<init>(AbstractHierarchicalFileConfiguration.java:91)
at org.apache.commons.configuration.XMLConfiguration.<init>(XMLConfiguration.java:243)
at com.oltpbenchmark.DBWorkload.main(DBWorkload.java:87)
Caused by: java.lang.StringIndexOutOfBoundsException: begin 0, end 3, length 2
at java.base/java.lang.String.checkBoundsBeginEnd(String.java:3751)
at java.base/java.lang.String.substring(String.java:1907)
at org.apache.commons.lang.SystemUtils.getJavaVersionAsFloat(SystemUtils.java:1133)
at org.apache.commons.lang.SystemUtils.<clinit>(SystemUtils.java:818)
... 8 more
If it wasn’t bad enough that I had to configure things with XML (instead of fully ENV variables), the thought that I would have to debug their Java code was it for me. I did give it a tiny look and of course, the error is coming from a file I didn’t even add:
// create the command line parser
CommandLineParser parser = new PosixParser();
XMLConfiguration pluginConfig=null;
try {
pluginConfig = new XMLConfiguration("config/plugin.xml"); // <---------- error is here
} catch (ConfigurationException e1) {
LOG.info("Plugin configuration file config/plugin.xml is missing");
e1.printStackTrace();
}
That was enough for me to nope out, so… I’m dropping oltpbench
from this round.
kastenhq/kubestr
?I actually didn’t know about kastenhq/kubestr
until relatively recently – I found out about it thank to The Kubernetes Podcast. It only runs fio
which isn’t that hard so I think I’ll skip it for now, I’m happy with my fio
, pgbench
and oltpbench
pods.
OK this was a pretty quick post – since I want to make a post with only the results available (that’s the highest value post in the bunch!), I’ll leave it here!
Well this was a bit of a surprise – I realized when the raid1
tests for Mayastor were failing that actually Mayastor doesn’t support (or plan to support) same-node disk failure domains. Looks like the mayastor tests will have to be JBOD only – it sure would have been nice to include this, so that even single-node cluster operators can get some assurance but I guess the move here is to just make the MaystorPool
object with the 2 disks in it to begin with…? Nope Mayastor doesn’t do RAID (which is probably a good idea on their part, want to keep it light). So the move here if you want redundancy on the lower disk is to add your software RAID via LVM or ZFS and use that.
All in all not a terrible trade-off – Mayastor doesn’t support compression or dedup, and it doesn’t support snapshots or clones in the GA. Have to say I’m not as disappointed as I was with other tech being tested for not being able to get same-node disk failure domains – they’ve made sure to clearly spell it out, without 5 pages of “buy our consulting” to wade through first. A nice pointed FAQ goes a long way. At this point I have to wonder if I’m just biased for OpenEBS – they have had the easiest-to-get-started robust solution for a long time, they use Rust and realize it’s value (also says something about the team’s technical chops), and I’ve even had some email exchanges with their CEO, I just like their vibe.
Anyway, lucky for you dear reader, the benchmark is coded up so you can reproduce it yourself, you do not have to trust me not to be biased.
So cStor doesn’t do it either, which is reasonable I guess? But what it does do is allow you create what I assume maps to a ZFS mirrored pool, and use that. So in this case what I’ve defined as “JBOD” (a single disk with no replication) is really single-ndoe RAID1, but underneath. Since this is the case (and how I have it coded right now) what I’ll do is rename the jbod
storageclass to raid1
and when I test with bigger clusters in the future I will make a raid1-ha
that represent mult-node with RAID1 on the disks underneath (in it’s various forms – whether done by LVM/ZFS or the orchestrator).
Jiva is excited to take a space (folder) on a disk, so in a somewhat different way it actually doesn’t do RAID1, but only because what I’ve given it is a StoragePool
for each disk. This is what I’m running right now (this very blog post, at time of writing) because it was the easiest way to get up and running without disturbing the enforced software RAID. What I’ve done here too is to simply delete the raid1
StorageClass
since what I’m really testing with the current code is “JBOD”.