Awesome FOSS Logo
Discover awesome open source software
Launched 🚀🧑‍🚀

Starting 2022 with a Bang: Ceph on ZFS

Categories
OpenZFS Logo + Ceph Logo

tl;dr - Ceph (Bluestore) (via Rook) on top of ZFS (ZFS on Linux) (via OpenEBS ZFS LocalPV) on top of Kubernetes. It’s as wasteful as it sounds – 200TPS on pgbench compared to ~1700TPS with lightly tuned ZFS and stock Postgres. The setup is at least usable and can get up to 1000TPS (2 replica ceph block pool) with synchronous_commit=off and some other less risky optimizations. Check out the numbers at the end in context and since this post is very Kubernetes heavy, if you don’t run Kubernetes maybe check out the the guide /u/darkz0r2 was kind enough to share.

UPDATE (2022/06/12)

After some feedback from a reader I've added a section on the commands I used for pgbench! Note that all tests were run on Hetzner, and this post isn't really meant for reproducibility (otherwise I would have included a repo), but I mostly wanted to share at least some of the results I saw!

Disclaimer: you almost certainly shouldn’t be doing this.

Yeah so basically, don’t do this. There are lots of reasons to not mix Ceph and ZFS – there are large overlaps in the featuresets of both these technologies. They’re built for different things but the combination of the two can lead to sever performance degradation (if you thought write amplification on one of those solutions was bad…).

As most bad ideas go I set out to drum up some interest on Reddit, and the responses were intriguing – some advice, many well wishers and many faces pre-emptively in palms.

Why you shouldn’t do this

Let’s go into more specifics on why one shouldn’t do what I’m about to do.

Bluestore and ZFS are basically substitute goods in the economic sense for most practical uses. They overlap so much in functionality that it doesn’t make sense to use them together:

  • checksumming
  • compression
  • snapshotting
  • raid0/1/5 setups
  • advanced caching

So why these two systems distinct in my use case?

  • ZFS: Copy on Write (“CoW”)
  • ZFS: local-only performance compared to High availability solutions
  • Ceph: High availability via multi-node synchronous writes
  • Ceph: Advanced features: Cross-site/region mirroring, CephFS, Ceph Object Gateways

The most important place they don’t overlap is high availability – Ceph is the obvious choice as it is synchronously written for both high durability and availability.

Why I’m going to do it anyway

I’ve been on a bit of a F/OSS distributed/networked storage system kick lately. Right now I’m running Longhorn on top of a pre-provisioned, hard-coded per-node ZFS ZVol. This means the DaemonSets that are part of the Longhorn application are using a pre-existing ZVOL, mounted to a specific point on disk (ex. /var/data/longhorn), which Longhorn is interacting with as if it were a normal drive, but is actually a synthentic volume powered by ZFS (mediated by OpenEBS ZFS LocalPV).

Longhorn has generally been great to set up, came with great utilities, is run by a great set of maintainers but it’s undeniable that Ceph is currently the industry leader. CERN trusts Ceph, Cloud providers like DigitalOcean run their object storage and block storage (Hetzner) with Ceph, etc. I’m a bit more interested in the object storage capabilities of ceph (and of course the general storage is great too), so I need to see this for myself and see how Ceph would compare to Longhorn, both running on top of ZFS.

Basically I’m chasing a world where I can get the features of ZFS and the HA of Ceph, without a complete collapse in performance, and I want to see what that world is like. As noted previously, the problem is that Ceph’s new awesome file storage engine (Bluestore) overlaps quite heavily with ZFS’s features, so I’m already doing something that’s somewhat verboten and it will be up to me to configure the systems accordingly where possible to make this bad decision a little easier to swallow.

So basically I plan to:

  • Use ZFS for higher performance/data safety workloads, usually with replication at the application level (ex. Postgres)
  • Use Ceph when I want to power less demanding workloads with high durability/availability that should easily survive the death of a single node
  • Use Ceph’s advanced features (ex. CephFS, Ceph Object Gateway)

Before we start: what about Filestore?

Filestore is the old Ceph storage, and it can actually run much easier on ZFS, and clashes mush less functionality wise. this came up in the reddit thread, but I think it’s better that I avoid building on filestore – I don’t know how much longer filestore is going to be around but I considered running it as a part of the testing but it just isn’t worth it. Bluestore seems to be strictly better than Filestore and the is almost certainly the way moving forward.

The basic plan

The early plan was to do the following:

  • (optional) Pre-provision ZVOLs on ZFS
  • Expose ZVOLs to Rook (auto provisioned by OpenEBS ZFS LocalPV)
  • Run Rook using it’s OSD-on-PVC support

I think I don’t have to pre-provision the ZVOLs (like with Longhorn) because in this case Rook supports running on Block PVCs if they’re present, but the setup above is pretty obvious – we’re gonna make some PVCs, and then get Rook to use them instead of raw disks.

The write path for RAID1 (mirroring) with 2 copies should look something like this (I’ve fudged lots of corners, so don’t quote me!):

graph TB; A(Postgres) -->| fsync | B(PVC) B -->| primary | C(Ceph OSD A) C -->| replication | D(Ceph OSD B) D -->| replication | C(Ceph OSD A) C -->| fsync | E(ZFS ZVOL) E -->| write | F(ZFS Dataset) F -->| write | G(ZFS ZPool) G -->| write | H(ZFS VDEV) H -->| write | I(SSD)

That’s alot of indirection/abstraction layers to shuffle through. Remember when we talked about why this is a bad idea? If you didn’t believe me before I hope you do now.

Addressing feature overlap between ZFS and Ceph

We’ve covered the feature overlap between ZFS and Ceph, but how would we actually address it? Well surely one of the sides can just… be configured to do less work, if it’s due to be redone later?

Disabling some functionality – Ceph or ZFS?

The easiest solution is to just disable the parts that are overlapping where they can be, at the higher abstraction level. Both Ceph and ZFS are configurable so we have a different conundrum – which thing to disable?

  • Hard requirement: ZFS is RAID1-ing both my drives underneath
  • Soft opinion: ZFS should probably do the checksumming
  • Soft opinion: ZFS should probably do the compression
  • Soft opinion: ZFS should probably do the caching (it’s somewhat difficult/confusing to control ARC)

I haven’t mentioned it until now much but this can actually go both ways – Ceph can run on ZFS and ZFS could run on Ceph. In my case Ceph on ZFS makes more sense because I want ZFS to be the abstraction right above my disks, but I’m open to passionate diatribes about the benefits of having Ceph be the piece of storage infrastructure with access to the metal. I believe that having ZFS down there gives me flexibility – I can use ZFS for direct local manipulation that doesn’t require remote writes, and as far as I remember there’s not really a reasonable way to force ceph to write writes locally (pools with no replicas trigger a warning, so it’s arguable that it’s not really what Ceph is for).

At this point since ZFS is below Ceph, Ceph is probably going to be the easiest to disable overlapping features on. Disabling features on Ceph will be basically forcing it to blindly trust the underlying storage (in this case, ZFS).

Finding the existing literature

Turns out there’s a guide on this! OpenZFS documentation has a small section dedicated to ZFS & Ceph, which I came across thanks to a comment on a GitHub issue. As you might expect there are much more warnings in there about just why doing what I’m about do is a bad idea:

Use a separate journal device. Do not don’t collocate CEPH journal on ZFS dataset if at all possible, this will quickly lead to terrible fragmentation, not to mention terrible performance upfront even before fragmentation (CEPH journal does a dsync for every write).

Use a SLOG device, even with a separate CEPH journal device. For some workloads, skipping SLOG and setting logbias=throughput may be acceptable.

Use a high-quality SLOG/CEPH journal device. A consumer based SSD, or even NVMe WILL NOT DO (Samsung 830, 840, 850, etc) for a variety of reasons. CEPH will kill them quickly, on-top of the performance being quite low in this use. Generally recommended devices are [Intel DC S3610, S3700, S3710, P3600, P3700], or [Samsung SM853, SM863], or better.

If using a high quality SSD or NVMe device (as mentioned above), you CAN share SLOG and CEPH Journal to good results on single device. A ratio of 4 HDDs to 1 SSD (Intel DC S3710 200GB), with each SSD partitioned (remember to align!) to 4x10GB (for ZIL/SLOG) + 4x20GB (for CEPH journal) has been reported to work well.

I’m running afoul of just about every single one of these warnings. I don’t have any of the attached SSDs partitioned for the ZIL/SLOG or ceph journal, they’re all essentially consumer grade, and I’m colocating the Ceph journal and the ZFS dataset. Since this is aproof of concept

Slight more relevant to me, /u/darkz0r2 was kind enough to share a script on Reddit on how he sets up Ceph on ZFS. This is a great guide for those out there who want to try a setup like this but do it properly. Feel free to ignore the Gluster stuff at the bottom!

Getting it done

Since I’m doing this on Kubernetes (the quick and dirty way) there are only a few files that are really interesting/different from the typical Ceph setup as provisioned/managed by Rook:

📜️ ceph-on-zfs.cephcluster.yaml (click to expand)
#
# See:
# - https://github.com/rook/rook/blob/release-1.8/deploy/examples/cluster-on-pvc.yaml
# - https://github.com/rook/rook/blob/master/design/ceph/ceph-config-updates.md
#
---
apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
  name: ceph-on-zfs
spec:
  dataDirHostPath: /var/lib/rook

  mon:
    # Set the number of mons to be started. Must be an odd number, and is generally recommended to be 3.
    count: 3

    # The mons should be on unique nodes. For production, at least 3 nodes are recommended for this reason.
    # Mons should only be allowed on the same node for test environments where data loss is acceptable.
    allowMultiplePerNode: false

    # A volume claim template can be specified in which case new monitors (and
    # monitors created during fail over) will construct a PVC based on the
    # template for the monitor's primary storage. Changes to the template do not
    # affect existing monitors. Log data is stored on the HostPath under
    # dataDirHostPath. If no storage requirement is specified, a default storage
    # size appropriate for monitor data will be used.
    volumeClaimTemplate:
      spec:
        storageClassName: localpv-zfs-ceph
        resources:
          requests:
            storage: 10Gi

  cephVersion:
    image: quay.io/ceph/ceph:v16.2.7
    allowUnsupported: false

  skipUpgradeChecks: false

  continueUpgradeAfterChecksEvenIfNotHealthy: false

  mgr:
    count: 1
    modules:
      - name: pg_autoscaler
        enabled: true

  dashboard:
    enabled: true
    # ssl: true

  crashCollector:
    disable: false

  resources:
    prepareosd:
      limits:
        cpu: "200m"
        memory: "200Mi"
      requests:
        cpu: "200m"
        memory: "200Mi"

  disruptionManagement:
    managePodBudgets: true
    osdMaintenanceTimeout: 30
    pgHealthCheckTimeout: 0
    manageMachineDisruptionBudgets: false
    machineDisruptionBudgetNamespace: openshift-machine-api

  storage:
    # when onlyApplyOSDPlacement is false, will merge both placement.All() and storageClassDeviceSets.Placement.
    onlyApplyOSDPlacement: false

    storageClassDeviceSets:
      - name: ceph-on-zfs
        # The number of OSDs to create from this device set
        count: 3

        # IMPORTANT: If volumes specified by the storageClassName are not portable across nodes
        # this needs to be set to false. For example, if using the local storage provisioner
        # this should be false.
        portable: false

        # Certain storage class in the Cloud are slow
        # Rook can configure the OSD running on PVC to accommodate that by tuning some of the Ceph internal
        # Currently, "gp2" has been identified as such
        tuneDeviceClass: false # local ZFS is pretty fast...

        # Certain storage class in the Cloud are fast
        # Rook can configure the OSD running on PVC to accommodate that by tuning some of the Ceph internal
        # Currently, "managed-premium" has been identified as such
        tuneFastDeviceClass: true

        # whether to encrypt the deviceSet or not
        encrypted: false

        # Since the OSDs could end up on any node, an effort needs to be made to spread the OSDs
        # across nodes as much as possible. Unfortunately the pod anti-affinity breaks down
        # as soon as you have more than one OSD per node. The topology spread constraints will
        # give us an even spread on K8s 1.18 or newer.
        placement:
          topologySpreadConstraints:
            - maxSkew: 1
              topologyKey: kubernetes.io/hostname
              whenUnsatisfiable: ScheduleAnyway
              labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values:
                      - rook-ceph-osd

        # (still under storage class deviceset)
        preparePlacement:
          podAntiAffinity:
            preferredDuringSchedulingIgnoredDuringExecution:
              - weight: 100
                podAffinityTerm:
                  labelSelector:
                    matchExpressions:
                      - key: app
                        operator: In
                        values:
                          - rook-ceph-osd
                      - key: app
                        operator: In
                        values:
                          - rook-ceph-osd-prepare
                  topologyKey: kubernetes.io/hostname
          topologySpreadConstraints:
            - maxSkew: 1
              # IMPORTANT: If you don't have zone labels, change this to another key such as kubernetes.io/hostname
              #topologyKey: topology.kubernetes.io/zone
              topologyKey: kubernetes.io/hostname
              whenUnsatisfiable: DoNotSchedule
              labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values:
                      - rook-ceph-osd-prepare

        # (still under stoarge class deviceset)
        # These are the OSD daemon limits. For OSD prepare limits, see the separate section below for "prepareosd" resources
        resources:
          limits:
            cpu: "500m"
            memory: "4Gi"
          requests:
            cpu: "500m"
            memory: "4Gi"

        volumeClaimTemplates:
          - metadata:
              name: ceph-osd-data
              # if you are looking at giving your OSD a different CRUSH device class than the one detected by Ceph
              # annotations:
              #   crushDeviceClass: hybrid
            spec:
              # IMPORTANT: Change the storage class depending on your environment
              storageClassName: localpv-zfs-ceph
              resources:
                requests:
                  storage: 25Gi
              volumeMode: Block
              accessModes:
                - ReadWriteOnce

Note that the code above assumes Rook v1.8. Another file I spent a lot of time editing/tweaking was the

📜️ (click to expand)
#
# This StorageClass aims to be optimal for Ceph-on-ZFS usage
#
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: localpv-zfs-ceph
provisioner: zfs.csi.openebs.io
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer # required for Ceph mon startup!
parameters:
  volblocksize: "4k" # underlying hard drive
  recordsize: "128k"
  ## Compression implicitly is 'zstd' due to older version of OpenEBS ZFS LocalPV which doesn't support specifying 'zstd'
  # compression: "lz4"
  dedup: "off"
  atime: "off"
  relatime: "on"
  xattr: "sa"
  # Running the FSType as ZFS underneath will give us all the goodies of ZFS without needing to add yet-another layer (ZVOL)
  csi.storage.k8s.io/fstype: "xfs"
  poolname: "tank"
allowedTopologies:
  - matchLabelExpressions:
      - key: storage.zfs
        values:
          - "true"

As far as hiccups I ran into:

  • You need Block-mode (not Filesystem mode) PVCs, and xfs is preferred by Ceph so that’s what we use (even though it’s ZFS underneath at the end of the day).
  • WaitForFirstConsumerwas required for the PVCs (the StorageClass above)

What it looks like when you’re done

There were a few hiccups but filling out the resources and messing with the configuration was basically mostly mechanical for me at this point (I had a good place to start as this isn’t my first time running Rook on Kubernetes). Once you’ve got everything to steady state, the rook-ceph namespace (or another namespace if you changed it) looks something like this:

$ k get pods
NAME                                                            READY   STATUS      RESTARTS   AGE
csi-cephfsplugin-6mslw                                          3/3     Running     0          10h
csi-cephfsplugin-7gv45                                          3/3     Running     0          10h
csi-cephfsplugin-8npp2                                          3/3     Running     0          10h
csi-cephfsplugin-klmps                                          3/3     Running     0          10h
csi-cephfsplugin-provisioner-f6f4cc497-76d5z                    6/6     Running     0          10h
csi-cephfsplugin-provisioner-f6f4cc497-89ccq                    6/6     Running     3          10h
csi-rbdplugin-5rkmf                                             3/3     Running     0          10h
csi-rbdplugin-82sgc                                             3/3     Running     0          10h
csi-rbdplugin-fvmjq                                             3/3     Running     0          10h
csi-rbdplugin-provisioner-88c8769c7-7pzh8                       7/7     Running     0          10h
csi-rbdplugin-provisioner-88c8769c7-bwt65                       7/7     Running     4          10h
csi-rbdplugin-vnh9s                                             3/3     Running     0          10h
rook-ceph-crashcollector-node-2.eu-central-1-54b59f4f94-r9xhh   1/1     Running     0          7m22s
rook-ceph-crashcollector-node-5.eu-central-1-865fffcb9c-kchkl   1/1     Running     0          7m17s
rook-ceph-crashcollector-node-6.eu-central-1-7987549b5d-55dd2   1/1     Running     0          7m21s
rook-ceph-mgr-a-679d945bff-w2jtv                                1/1     Running     0          8m3s
rook-ceph-mon-a-7d98f6c475-l246s                                1/1     Running     0          8m35s
rook-ceph-mon-b-68bccf9b9b-xg7tz                                1/1     Running     0          8m36s
rook-ceph-mon-c-59cb5c777c-jdksn                                1/1     Running     0          8m34s
rook-ceph-operator-7bc4985c8d-56z87                             1/1     Running     0          10m
rook-ceph-osd-0-56f4b7b866-hxz2t                                1/1     Running     0          7m22s
rook-ceph-osd-1-75f5cbfbcd-jhqk4                                1/1     Running     0          7m21s
rook-ceph-osd-2-7675f76dcb-j7d5t                                1/1     Running     0          7m17s
rook-ceph-osd-prepare-7250dbb9c4227d092934944df92922c6-rsrf6    0/1     Completed   0          7m42s
rook-ceph-osd-prepare-8430ee747b12c036d92b437f2a7cd8dd-ktbf5    0/1     Completed   0          7m41s
rook-ceph-osd-prepare-e72a7e5dd9c98f7594cf0ed017f13e4a-pjsnt    0/1     Completed   0          7m42s

And if you start the toolbox:

$ make toolbox-bash
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- /bin/bash
[rook@rook-ceph-tools-7bbd566fd9-jp4dn /]$ ceph status
  cluster:
    id:     225675a7-bfed-4640-b581-edb7dfcbd5ca
    health: HEALTH_WARN
            1 pool(s) have no replicas configured

  services:
    mon: 3 daemons, quorum a,b,c (age 16m)
    mgr: a(active, since 15m)
    osd: 3 osds: 3 up (since 15m), 3 in (since 16m)

  data:
    pools:   4 pools, 97 pgs
    objects: 3 objects, 57 B
    usage:   17 MiB used, 75 GiB / 75 GiB avail
    pgs:     97 active+clean

That warning is about the one pool that I have configured with zero replication called rook-jbod. After removing it:

$ make toolbox-bash
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- /bin/bash
[rook@rook-ceph-tools-7bbd566fd9-jp4dn /]$ ceph status
  cluster:
    id:     225675a7-bfed-4640-b581-edb7dfcbd5ca
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum a,b,c (age 20m)
    mgr: a(active, since 19m)
    osd: 3 osds: 3 up (since 19m), 3 in (since 19m)

  data:
    pools:   3 pools, 65 pgs
    objects: 2 objects, 38 B
    usage:   17 MiB used, 75 GiB / 75 GiB avail
    pgs:     65 active+clean

And if you check the dashboard:

Ceph dashboard healthy cluster

Let’s check if the checksumming is off…

Ceph dashboard showing configuration where checksum is on

Uh oh looks like the configuration didn’t take quite as we wanted to. But let’s get back to this later! I’m excited to see how pgbench will run.

pgbench configuration

Here’s the configuration I used for pgbench, excuse that it’s wrapped up in a Kubernetes Job resource:

📜️ PGBench output (click to expand)
# see: https://github.com/longhorn/dbench
---
apiVersion: batch/v1
kind: Job
metadata:
  name: pgbench
  namespace: default
spec:
  backoffLimit: 0
  # activeDeadlineSeconds: 3600
  template:
    spec:
      restartPolicy: Never
      automountServiceAccountToken: false
      initContainers:
        # Wait for postgres to show up
        - name: wait
          image: busybox:latest
          imagePullPolicy: IfNotPresent
          command: ['sh', '-c', 'until nc -vz ${POD_NAME}.${POD_NAMESPACE} 5432; do echo "Waiting for postgres..."; sleep 3; done;']
          env:
            - name: POD_NAME
              value: postgres
            - name: POD_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace

        # Initialize pg bench database
        - name: init
          image: postgres:14.1-alpine
          imagePullPolicy: IfNotPresent
          command:
            - pgbench
            - --initialize
            - --foreign-keys
            - --scale=1000
          env:
            - name: PGHOST
              value: postgres
            - name: PGUSER
              value: postgres
            - name: PGPASSWORD
              value: postgres

      containers:
        - name: pgbench
          image: postgres:14.1-alpine
          imagePullPolicy: IfNotPresent
          resources:
            requests:
              cpu: 4
              memory: 8Gi
            limits:
              cpu: 4
              memory: 8Gi
          command:
            - pgbench
            - --report-latencies
            - --jobs=4
            - --client=40
            - --time=300
          env:
            - name: PGHOST
              value: postgres
            - name: PGUSER
              value: postgres
            - name: PGPASSWORD
              value: postgres

First pgbench run

I set up some k8s resources to run pgbench using a PVC supplied by Ceph (via Rook). The StorageClass used will be RAID1 setup (2 copies) and see just how fast it runs! The test was easy to watch from the dashboard:

Ceph dashboard performance metrics during first pgbench run

While watching the dashboard move around I saw as “high” as 803 IOPS (7xx reads, <100 writes). Not great (you can’t even get an RDS cluster provisioned IOPS less than 1000 IOPS) but with the inherent waste of the setup I’m glad to see it hit at least that. We also haven’t yet looked into removing the double checksumming/trying to optimize so not a bad place to start. Of course, we’re very far away (way, way below) from the actual capabilities of the underlying hardware.

Heres the end of the output for initialization of pgbench:

...
99800000 of 100000000 tuples (99%) done (elapsed 458.84 s, remaining 0.92 s)
99900000 of 100000000 tuples (99%) done (elapsed 459.22 s, remaining 0.46 s)
100000000 of 100000000 tuples (100%) done (elapsed 459.47 s, remaining 0.00 s)
vacuuming...
creating primary keys...
creating foreign keys...
done in 892.50 s (drop tables 0.00 s, create tables 0.04 s, client-side generate 464.44 s, vacuum 209.43 s, primary keys 180.31 s, foreign keys 38.27 s).

And the output once pgbench actually ran:

📜️ PGBench output (click to expand)
$ k logs -f job/pgbench
pgbench (14.1)
starting vacuum...end.
pgbench: error: client 10 script 0 aborted in command 5 query 0: ERROR:  could not read block 96283 in file "base/13756/16404": read only 4224 of 8192 bytes
pgbench: error: client 34 script 0 aborted in command 5 query 0: ERROR:  could not read block 261441 in file "base/13756/16404.1": Bad address
pgbench: error: client 1 script 0 aborted in command 5 query 0: ERROR:  could not read block 66980 in file "base/13756/16404": Bad address
pgbench: error: client 33 script 0 aborted in command 5 query 0: ERROR:  could not read block 103761 in file "base/13756/16404": Bad address
pgbench: pgbench: error: error: client 9 script 0 aborted in command 5 query 0: ERROR:  could not read block 272240 in file "base/13756/16404.2": Bad address
client 19 script 0 aborted in command 5 query 0: ERROR:  could not read block 146609 in file "base/13756/16404.1": Bad address
pgbench: error: client 37 script 0 aborted in command 5 query 0: ERROR:  could not read block 242121 in file "base/13756/16404.1": Bad address
pgbench: error: pgbench: error: client 15 script 0 aborted in command 5 query 0: ERROR:  could not read block 69293 in file "base/13756/16404": Bad address
client 3 script 0 aborted in command 5 query 0: ERROR:  could not read block 64169 in file "base/13756/16404": Bad address
pgbench: error: client 36 script 0 aborted in command 5 query 0: ERROR:  could not read block 88786 in file "base/13756/16404": Bad address
pgbench: error: client 16 script 0 aborted in command 5 query 0: ERROR:  could not read block 254039 in file "base/13756/16404.1": Bad address
pgbench: error: pgbench: error: client 39 script 0 aborted in command 5 query 0: ERROR:  could not read block 139592 in file "base/13756/16404.1": Bad address
client 4 script 0 aborted in command 5 query 0: ERROR:  could not read block 69032 in file "base/13756/16404": Bad address
pgbench: error: client 17 script 0 aborted in command 5 query 0: ERROR:  could not read block 46228 in file "base/13756/16404": Bad address
pgbench: error: client 29 script 0 aborted in command 5 query 0: ERROR:  could not read block 138865 in file "base/13756/16404.1": Bad address
pgbench: error: client 5 script 0 aborted in command 5 query 0: ERROR:  could not read block 157791 in file "base/13756/16404.1": Bad address
pgbench: error: client 18 script 0 aborted in command 5 query 0: ERROR:  could not read block 39901 in file "base/13756/16404": Bad address
pgbench: error: client 6 script 0 aborted in command 5 query 0: ERROR:  could not read block 112582 in file "base/13756/16404": Bad address
pgbench: error: client 7 script 0 aborted in command 5 query 0: ERROR:  could not read block 62002 in file "base/13756/16404": Bad address
pgbench: error: client 8 script 0 aborted in command 5 query 0: ERROR:  could not read block 227032 in file "base/13756/16404.1": Bad address
pgbench: error: client 2 script 0 aborted in command 5 query 0: ERROR:  could not read block 123751 in file "base/13756/16404": Bad address
pgbench: error: client 27 script 0 aborted in command 5 query 0: ERROR:  could not read block 202213 in file "base/13756/16404.1": Bad address
pgbench: error: client 22 script 0 aborted in command 5 query 0: ERROR:  could not read block 59944 in file "base/13756/16404": Bad address
pgbench: error: client 38 script 0 aborted in command 5 query 0: ERROR:  could not read block 12051 in file "base/13756/16404": Bad address
pgbench: error: client 20 script 0 aborted in command 5 query 0: ERROR:  could not read block 59586 in file "base/13756/16404": Bad address
pgbench: pgbench:pgbench: error:  client 32 script 0 aborted in command 5 query 0: ERROR:  could not read block 141604 in file "base/13756/16404.1": Bad address
error: error: client 0 script 0 aborted in command 5 query 0: ERROR:  could not read block 224083 in file "base/13756/16404.1": Bad address
client 12 script 0 aborted in command 5 query 0: ERROR:  could not read block 227347 in file "base/13756/16404.1": Bad address
pgbench: error: client 23 script 0 aborted in command 5 query 0: ERROR:  could not read block 98306 in file "base/13756/16404": Bad address
pgbench: error: pgbench: error: client 30 script 0 aborted in command 5 query 0: ERROR:  could not read block 159243 in file "base/13756/16404.1": Bad address
client 11 script 0 aborted in command 5 query 0: ERROR:  could not read block 21271 in file "base/13756/16404": Bad address
pgbench: error: client 24 script 0 aborted in command 5 query 0: ERROR:  could not read block 255084 in file "base/13756/16404.1": Bad address
pgbench: error: client 31 script 0 aborted in command 5 query 0: ERROR:  could not read block 166539 in file "base/13756/16404.1": Bad address
pgbench: error: client 13 script 0 aborted in command 5 query 0: ERROR:  could not read block 145356 in file "base/13756/16404.1": Bad address
pgbench: error: client 25 script 0 aborted in command 5 query 0: ERROR:  could not read block 25645 in file "base/13756/16404": Bad address
pgbench: error: client 26 script 0 aborted in command 5 query 0: ERROR:  could not read block 141343 in file "base/13756/16404.1": Bad address
pgbench: error: client 35 script 0 aborted in command 5 query 0: ERROR:  could not read block 157813 in file "base/13756/16404.1": Bad address
pgbench: error: client 14 script 0 aborted in command 5 query 0: ERROR:  could not read block 189722 in file "base/13756/16404.1": Bad address
pgbench: error: client 21 script 0 aborted in command 5 query 0: ERROR:  could not read block 217253 in file "base/13756/16404.1": Bad address
pgbench: error: client 28 script 0 aborted in command 5 query 0: ERROR:  could not read block 180584 in file "base/13756/16404.1": Bad address
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1000
query mode: simple
number of clients: 40
number of threads: 4
duration: 300 s
number of transactions actually processed: 1231
pgbench: fatal: Run was aborted; the above results are incomplete.
latency average = 36.432 ms
initial connection time = 89.884 ms
tps = 1097.924913 (without initial connection time)
statement latencies in milliseconds:
         0.001  \set aid random(1, 100000 * :scale)
         0.001  \set bid random(1, 1 * :scale)
         0.001  \set tid random(1, 10 * :scale)
         0.000  \set delta random(-5000, 5000)
         0.660  BEGIN;
         1.334  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
         0.873  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
         0.968  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
         1.506  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
         0.955  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
        29.517  END;

Uh oh, disaster has struck. We’re getting some data corruption. This actually came up in the past in the post I did on everything I’ve seen for optimizing Postgres on ZFS, but the resolutions that applied then are still in place!

Debug: Data corruption rears it’s head again

Are the mitigations from last time still in place at the server-level?

Yup, all my servers still have the drive-level volatile write caching disabled:

# nvme get-feature -f 6 /dev/nvme0n1
get-feature:0x6 (Volatile Write Cache), Current value:0x000001

Yeah the lower level volatile write cache is definitely not off at the disk level.

Does Postgres need special configuration? (yes)

Well the only thing I can think of possibly being an issue should be detected default fdatasync set for wal_sync_method. With recent kernels it resolves to fdatasync but I’ve had problems with the past (which are resolved by forcing wal_sync_method=fsync so I think I’ll give that a shot and try again…

Second time around I’m getting some nice IOPS numbers:

Ceph dashboard performance metrics 1000 IOPS after WAL sync method change

Looks like it’s taking much longer to init:

...
100000000 of 100000000 tuples (100%) done (elapsed 991.46 s, remaining 0.00 s)
vacuuming...
creating primary keys...
creating foreign keys...
done in 1382.26 s (drop tables 0.00 s, create tables 0.12 s, client-side generate 997.58 s, vacuum 161.32 s, primary keys 183.83 s, foreign keys 39.41 s).

And for the actual output of the run…

📜️ PGBench output (click to expand)
[frodo|default] mrman 14:08:32 [ceph-on-zfs] $ k logs pgbench-hxt4f
pgbench (14.1)
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1000
query mode: simple
number of clients: 40
number of threads: 4
duration: 300 s
number of transactions actually processed: 71424
latency average = 168.066 ms
initial connection time = 118.385 ms
tps = 238.002372 (without initial connection time)
statement latencies in milliseconds:
         0.001  \set aid random(1, 100000 * :scale)
         0.001  \set bid random(1, 1 * :scale)
         0.001  \set tid random(1, 10 * :scale)
         0.000  \set delta random(-5000, 5000)
         0.805  BEGIN;
        28.501  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
         1.131  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
        27.944  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
        15.730  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
        14.683  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
        79.308  END;

Good news, no corruption! Bad news, absolutely terrible TPS. As you might expect, synchronous writes mean quite a bit more latency – the average being ~168ms which is pretty high considering these two servers are “right next to each other”.

We’ve a long way to go as far as optimization goes (we’ve done none so far), and maybe it just doesn’t make much sense to do this in this way. It’s arguable that a database is precisely not the right thing to run on kind of setup (considering I don’t have access to fiber and a big SAN), but I think the data still carries some meaning. At the very least we know just how bad the naive approach will do.

Optimizations

Applying ZFS optimizations

OK so ~200TPS is pretty bad. Underneath we know the writes are going to a ZVOL with all the goodies of ZFS transparently available so maybe we can actually re-enable of the ZFS related optimizations that we made in the post about Postgres? In particular:

  • synchronous_commit=off
  • full_page_writes=off
  • wal_init_zero=off
  • wal_recycle=off

pgbench re-run

With the ZFS optimizations applied:

pgbench (14.1)
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1000
query mode: simple
number of clients: 40
number of threads: 4
duration: 300 s
number of transactions actually processed: 418864
latency average = 28.642 ms
initial connection time = 98.567 ms
tps = 1396.556798 (without initial connection time)
statement latencies in milliseconds:
         0.001  \set aid random(1, 100000 * :scale)
         0.001  \set bid random(1, 1 * :scale)
         0.001  \set tid random(1, 10 * :scale)
         0.000  \set delta random(-5000, 5000)
         0.624  BEGIN;
        22.312  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
         0.871  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
         2.243  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
         0.891  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
         0.997  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
         0.689  END;

Awesome we’re back up to ~1300 TPS!

And once again just to make sure:

pgbench (14.1)
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1000
query mode: simple
number of clients: 40
number of threads: 4
duration: 300 s
number of transactions actually processed: 453943
latency average = 26.437 ms
initial connection time = 103.577 ms
tps = 1513.007884 (without initial connection time)
statement latencies in milliseconds:
         0.001  \set aid random(1, 100000 * :scale)
         0.001  \set bid random(1, 1 * :scale)
         0.001  \set tid random(1, 10 * :scale)
         0.000  \set delta random(-5000, 5000)
         0.622  BEGIN;
        19.970  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
         0.881  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
         2.254  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
         0.986  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
         1.015  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
         0.687  END;

OK, about 1500 TPS this time, and I still haven’t done any Ceph-side optimization. Considering Longhorn was doing ~1000TPS while consuming quite a bit of CPU this is great news. In fact this news is so good it makes me question whether I’m being too optimistic. Let’s go over the reasoning (again) of why the options I’ve picked aren’t completely impractical for a “real” Postgres workload:

  • synchronous_commit=off
    • The database will never be in an inconsistent state, and we know that we have durability at the filesystem every ~5 seconds with ZFS flushes (txg syncs).
    • What this setting does say is that it’s possible for data to be reported as written but if the server crashes will be lost. This is pretty bad for a database but isn’t unheard of depending on the workload.
      • By default transactions will be treated as non-critical, and when a critical one comes through, it is possible to re-enable synchronous_commit when needed.
  • full_page_writes=off
    • This is fine, because ZFS will never create a torn page – see CoW.
  • wal_init_zero=off / wal_recycle=off

It would be nice to know what kind of performance we get with synchronous_commit enabled by default (as it would be if I were using this setup for NimbusWS for example) so let’s see:

pgbench (14.1)
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1000
query mode: simple
number of clients: 40
number of threads: 4
duration: 300 s
number of transactions actually processed: 134514
latency average = 89.218 ms
initial connection time = 101.130 ms
tps = 448.338449 (without initial connection time)
statement latencies in milliseconds:
         0.001  \set aid random(1, 100000 * :scale)
         0.001  \set bid random(1, 1 * :scale)
         0.001  \set tid random(1, 10 * :scale)
         0.000  \set delta random(-5000, 5000)
         0.718  BEGIN;
         1.177  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
         0.910  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
         1.077  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
         2.516  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
         0.935  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
        81.897  END;

Yup, that’s not great – ~400TPS (which at least is better than ~200!), pitiful performance given that we’re running on two SSDs at the end of the day, but at least we know just where we stand with the naive approach.

Disabling checksums

NOTE Don’t do this – as mqudsi noted on Reddit, if your server lacks ECC RAM, the Ceph <-> ZFS layer is susceptible to corruption and should be left protected via the checksums at the Ceph level.

Turning of checksums at the Ceph level is both dangerous and doesn’t appreciably change performance.

One thing that we noted earlier that was broken is that checksums are indeed still happening. Since we’re running on ZFS, we know that the bottom layer will not lose data, so it’s more than likely safe to turn off the “double work” Bluestore is doing checksumming.

Turning it off manually is pretty easy via the Ceph dashboard so I’ll do that first, and going forward bake it in via the ceph-override-config.

pgbench re-run

With the checksums disabled and the ZFS optimizations re-applied (including synchronous_commit=off):

kubectl logs job/pgbench
pgbench (14.1)
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1000
query mode: simple
number of clients: 40
number of threads: 4
duration: 300 s
number of transactions actually processed: 464944
latency average = 25.812 ms
initial connection time = 94.121 ms
tps = 1549.647392 (without initial connection time)
statement latencies in milliseconds:
         0.001  \set aid random(1, 100000 * :scale)
         0.001  \set bid random(1, 1 * :scale)
         0.001  \set tid random(1, 10 * :scale)
         0.000  \set delta random(-5000, 5000)
         0.608  BEGIN;
        19.765  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
         0.865  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
         2.007  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
         0.897  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
         0.976  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
         0.674  END;

OK, not much of a change there so it looks like it’s not worth worrying about whether checksums are disabled too much.

Results in context

Here’s a table of some best-of-two-runs pgbench results across various storage paradigms that I have set up right now (which are accessible in tandem):

Setup TPS Avg Latency (ms)
ZFS (recordsize=8k, compression=zstd, pg shared_buffers=8GB) 4048 9
ZFS (recordsize=8k, compression=zstd) 3178 12
mdadm 2348 17
Ceph-on-ZFS RAID1 (recordsize=128k, pg synchronous_commit=off) 1513 26
Longhorn (wal_sync_method=fsync) 737 54
Ceph-on-ZFS RAID1 (synchronous_commit=on, other ZFS optimizations) 448 89
Ceph-on-ZFS RAID1 (default) 238 168

NOTE All postgres instances used for the numbers above had 4vCores and 16GB of RAM.

As always, take your prerequisite salt! The benchmarks are really really disparate systems, and I use them more to tell me how far a completely naive implementation gets me, compared to what I know is possible with “plain” ZFS underneath and more production-appropriate settings. Even the production appropriate settings which are making about 4k TPS could be improved upon. The expectation was always that Ceph on ZFS would perform badly compared to “plain” ZFS, but I’m glad to find out just how badly, and how workable it is.

There are a lot more angles worth checking out here, but I don’t think I’ve ever seen a table with these data points on the internet before, so I’ll leave it here for today, I do need to get back to work building Nimbus.

If you’re interested in seeing some setups that are interesting but aren’t written here, please let me know or send me your own experiments! If you want me to do the experiments, get in touch and and we’ll figure something out (I’d probably do it if you only donated the material cost, as long as you are flexible on time!).

Places to optimize in the future

Here are at least some of hte places where more optimization is needed:

  • Actually tune Ceph (settings, throw more compute and memory at OSDs, etc)
  • Tune XFS
    • I got some great feedback on Reddit on optimizing XFS and making sure the mkfs options match properly. Similar to the tuning of ceph as far as partitioning the ZIL/SLOG and Ceph journal but different.
  • Find more places to eliminate redundant work
  • Better drive partitioning for ZIL/SLOG and Ceph journal
    • I basically didn’t do this right at all (see previous sections where I note all instructions I ignored), much more care needs to go into prepping the underlying machine, it’s not quite as easy as just throwing some PersistentVolumeClaims at the problem, more partitioning needed to happen
  • Huawei recommendations for Ceph usage
    • Interestingly enough they’ve put out some large guides on tuning Ceph, with lots of information on tuning the system as well as Ceph itself.
  • btrfs?
    • If you look aroudn the internet it looks like btrfs works great with Ceph but this seems to be in the Filestore days… See the notes section of OS recommendations
    • Bluestore is recommended over Ceph so maybe this should be ignored

If you (yes you!) have some ideas on some improvements or noticed anything I did obviously and blatantly wrong, please reach out and let me know!

Wrapup

Thanks for coming along for the ride – looks like Ceph on ZFS is just about as bad as you might expect simply hearing about it. I was surprised to find that turning off the checksumming on the Bluestore side didn’t really make a difference.

That said, Ceph on ZFS is actully not half bad from a pure capability standpoint, especially when compared to Longhorn. Even in a massively unoptimized naive case, being able to drive 1000TPS (admittedly for synthetic pgbench workload, under a very questionable optimization like synchronous_commit=off) leaves me a lot of space with workloads that are nowhere near as demanding. The CPU requirements are also much easier to tune than Longhorn and the synchronous multi-machine durability that Ceph provides is obviously a draw.

Of course, with Ceph on ZFS I’ve gained access to what I really wanted – the advanced features of Ceph for when I don’t have a performance-sensitive workload – CephFS and Object Storage will be great to use around the cluster for little projects.

I’ll be tweaking and optimizing the setup as time goes and will do another writeup when I find time to really dig in and do some more tweaking – there’s a lot more to do.