K8s storage provider benchmarks round 2, part 3

Categories
Kuberentes logo + OpenEBS logo / Rook logo / LINBIT logo

tl;dr - I install even more providers, this time OpenEBS cStor, OpenEBS Jiva, OpenEBS LocalPV hostPath, and OpenEBS LocalPV ZFS, LINSTOR via kvaps/kube-linstor. I skipped OpenEBS LocalPV because it threw a panic linked to a supposed kernel issue complaining about old kernels. Ain’t nobody got time for that. The GitLab Repo is finally public though, so you could skip this entire article and go there.

UPDATE (04/10/2021)

I had some more issues with linstor than I thought -- I've updated the linstor section to reflect it, but basically along with some install ordering issues I forgot to enable the drbd (the 'd' was missing) kernel module somehow.

NOTE: This a multi-part blog-post!

  1. Part 1 - Intro & cloud server wrangling
  2. Part 2 - Installing storage plugins
  3. Part 3 - Installing more storage plugins (you are here)
  4. Part 4 - Configuring the tests
  5. Part 5 - The results

Context

In part 2 we worked through installing some of the various storage plugins and got a repeatable process – mostly kubectl apply -f and a little Makefile glue – for installing them (with some bits hardcoded to the server). Now it’s time to install even more storage plugins.

Hopefully this post won’t be too long – surely you’ve had enough exposition at this point – let’s get right into it.

OpenEBS cStor - STORAGE_PROVIDER=openebs-cstor

OpenEBS cStor is based on userspace ZFS, see the openebs/cstor repo which uses libcstor, a storage engine built to work on uZFS. In the past it was a bit iffy and performed slightly slower than the Jiva engine but I’m looking forward to seeing how it performs these days.

Pros/Cons

Some pros on using cStor:

  • uZFS based (so ZFS based)
  • Lots of supported functionality
  • Actively worked on by OpenEBS (I think it’s their favorite)

And some reasons that I’ve found to not use cStor:

  • Slower than Jiva in previous testing
  • Volumes must be thin-provisioned, so over-provisioning is possible (see known limitations section)
  • Snapshots are somewhat delayed (possibly minutes or more)

I haven’t actually run cStor in my production cluster up until now – I’ve used Jiva mainly and cStor very briefly during some testing.

Deploying Control/Data plane

The general OpenEBS installation documentation is actually the same for the early bits of most of the products. cStor has some extra steps (it requires the creation of an extra pool object) but I’ll go through the general steps here mostly. Some of the early setups things are already done by way of ansible – verifying that iscsi is installed, iscsid is running, etc.

Since I don’t use helm (still haven’t found time to evaluate Helm 3), I’m going to be digging through and installing the components by hand. We’ll be using the kubectl installation method. Here’s the full list of resources you end up with at the end of the day:

$ tree .
.
├── Makefile
├── maya-apiserver.deployment.yaml
├── maya-apiserver.svc.yaml
├── openebs-admission-server.deployment.yaml
├── openebs-localpv-provisioner.deployment.yaml
├── openebs-maya-operator.rbac.yaml
├── openebs-maya-operator.serviceaccount.yaml
├── openebs-ndm-config.configmap.yaml
├── openebs-ndm.ds.yaml
├── openebs-ndm-operator.deployment.yaml
├── openebs.ns.yaml
├── openebs-provisioner.deployment.yaml
└── openebs-snapshot-operator.deployment.yaml

0 directories, 25 files

ndm is the Node Disk Manager, I.E. the thing that gets and keeps track of the actual physical hardware on every node.

Disabling analytics

Well one thing that really stood out while I was working on splitting these out of the huge openebs-operator.yaml YAML file they want you to download and apply was the opt-out nature of the analytics. Wasn’t too happy to find this:

        # OPENEBS_IO_ENABLE_ANALYTICS if set to true sends anonymous usage
        # events to Google Analytics
        - name: OPENEBS_IO_ENABLE_ANALYTICS
          value: "true"

It’s also in the LocalPV provisioner:

        - name: OPENEBS_IO_ENABLE_ANALYTICS
          value: "false"

A similar comment wasn’t there but I’m not taking any chances. Even if anonymized, I don’t want to send usage information to Google analytics. OpenEBS should at least set aside some engineering/ops resources to collecting their own analytics themselves, privacy-conscious devs/sysadmins are going to turn this off right away just based on where it is going.

Disabling the default storage classes

Another setting I wanted to change was removing the default storage classes:

        # If OPENEBS_IO_CREATE_DEFAULT_STORAGE_CONFIG is false then OpenEBS default
        # storageclass and storagepool will not be created.
        - name: OPENEBS_IO_CREATE_DEFAULT_STORAGE_CONFIG
          value: "false"

I want to manage my storage classes with pre-determined names and replication levels so no need to make any default ones. I’ll need to do a bit more reading into the cStor sparse pools (AKA zfs sparse datasets, since remember cStor is based on uZFS), but this is stuff I should know anyway if I’m going to administer them.

Checking the installation

OK, so making all this stuff is pretty easy – there is some documentation on verifying our installation so let’s do that as well. Looks like verification is just ensuring that the pods are all running well:

$ k get pods
NAME                                         READY   STATUS    RESTARTS   AGE
maya-apiserver-7f969b8db4-cb9tc              1/1     Running   2          44m
openebs-admission-server-78458d9ff6-mx7bn    1/1     Running   0          41m
openebs-localpv-provisioner-d7464d5b-dgfsw   1/1     Running   0          41m
openebs-ndm-jkftf                            1/1     Running   0          44m
openebs-ndm-operator-67876b4dc4-n94x6        1/1     Running   0          44m
openebs-provisioner-c666d6b4-djnpj           1/1     Running   0          41m
openebs-snapshot-operator-749db7b5f5-vq6ff   2/2     Running   0          12m

OK, pretty straight forward! The documentation says to check for StorageClasses next, but since I actually am going to set up the custom storage classes myself, what I’m going to check instead is that we have 2 BlockDevice custom resources. Remember that I have basically two pieces of storage I want to make available:

  • /dev/nvme0n1p5 - A ~396GB partition on my main drive (which has the OS installed) –
  • /dev/nvme1n1 - An empty (formerly software RAIDed) 512GB disk

    $ k get blockdevice
    NAME                                           NODENAME        SIZE           CLAIMSTATE   STATUS   AGE
    blockdevice-8608c47fdaab0450c9d449213c46d7de   all-in-one-01   512110190592   Unclaimed    Active   45m
    

Ah-ha! Looks like I have only one block device, the whole disk (you can tell by the SIZE). It like I might have to tell OpenEBS about this disk manually, but let’s see what changes have been made to the disks:

root@all-in-one-01 ~ # lsblk
NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
nvme1n1     259:0    0  477G  0 disk
nvme0n1     259:1    0  477G  0 disk
├─nvme0n1p1 259:2    0   16G  0 part [SWAP]
├─nvme0n1p2 259:3    0    1G  0 part /boot
├─nvme0n1p3 259:4    0   64G  0 part /
├─nvme0n1p4 259:5    0    1K  0 part
└─nvme0n1p5 259:6    0  396G  0 part

So nothing has been changed it looks like – the disks haven’t bene partitioned, or labeled, or anything. It look like OpenEBS doesn’t do anything with the disks until (see that CLAIMSTATE) you create a BlockDeviceClaim object, probably. That’s certainly a nice feature to have! According to the documentation though, NDM should be able to manage partitions:

Currently, NDM out of the box manages disks, partitions, lvm, crypt and other dm devices. If the user need to have blockdevice for other device types like md array or any other unsupported device types, the blockdevice resource can be manually created using the following steps:

So why don’t I see nvme0n1p5? There’s actually a frequently asked question section on NDM with an example of this case! Excellent work by OpenEBS, having this answer ready to go:

By default NDM excludes partitions mounted at /, /boot and /etc/hosts (which is same as the partition at which kubernetes / docker filesystem exists) and the parent disks of those partitions. In the above example /dev/sdb is excluded because of root partitions on that disk. /dev/sda4 contains the docker filesystem, and hence /dev/sda is also excluded.

So the existence of the / and /boot partitions caused the disk itself to be excluded. Unfortunately I think the fix for me isn’t as easy as removing the /etc/hosts entry from the os-disk-exclude-filter in the openebs-ndm-config ConfigMap – I want to use a partition on a disk that should be ignored. This means I’ll have to add my own manually-created BlockDevice:

---
apiVersion: openebs.io/v1alpha1
kind: BlockDevice
metadata:
  name: first-disk-partition
  namespace: openebs
  labels:
    kubernetes.io/hostname: all-in-one-01 # TODO: clustered setup needs ot have this be different
    ndm.io/managed: "false" # for manual blockdevice creation put false
    ndm.io/blockdevice-type: blockdevice
status:
  claimState: Unclaimed
  state: Active
spec:
  nodeAttributes:
    nodeName: all-in-one-01 # TODO: clustered setup needs to have this be different
  capacity:
    logicalSectorSize: 512
    storage: 425133957120 # TODO: get from `blockdev --getsize64 <device path>`
  details:
    # TODO: obtain this information automatically from `udevadm info`
    deviceType: partition # like disk, partition, lvm, crypt, md
    firmwareRevision: "EXF7201Q"
    model: "SAMSUNG MZVLB512HBJQ-00000"
    serial: "SAMSUNG MZVLB512HBJQ-00000_S4GENX0N425033"
    # compliance: <compliance of disk> #like "SPC-4" # normally get this from smartctl but sometimes it's not there.
    vendor: SAMSUNG
  devlinks:
    - kind: by-path
      path: /dev/disk/by-id/nvme-SAMSUNG_MZVLB512HBJQ-00000_S4GENX0N425033-part5
    - kind: by-path
      path: /dev/disk/by-id/nvme-eui.0025388401b90b26-part5
    - kind: by-path
      path: /dev/disk/by-partuuid/a1b5d104-05
    - kind: by-path
      path: /dev/disk/by-path/pci-0000:01:00.0-nvme-1-part5
    - kind: by-path
      path: /dev/disk/by-uuid/6d734e22-1d33-4d50-8e7e-cf079255f634
  path: /dev/nvme0n1p5 # like /dev/md0

Since I create this right away when the cluster is starting up, I have to do a little bit to make sure that I don’t try and make this BlockDevice before the CRD it exists:

blockdevice:
    @echo "[info] waiting until blockdevice CRD is installed..."
    @until $(KUBECTL) -n openebs get blockdevice; \
            do echo "trying again in 20 seconds (ctrl+c to cancel)"; \
            sleep 20; \
    done
    $(KUBECTL) apply -f first-disk-partition.blockdevice.yaml

Alternatively, I could have just made the CRDs myself to ensure they exist but I’m OK with this code since it runs all the way at the end and is not too terrible. At some point I’m going to have to fix all those TODOs so that I can more easily adapt this script to a cluster-driven or at least node-name/hard drive type agnostic setup… That’s work for another day. With this done, we now have two BlockDevices just like we want:

$ k get blockdevice
NAME                                           NODENAME        SIZE           CLAIMSTATE   STATUS   AGE
blockdevice-8608c47fdaab0450c9d449213c46d7de   all-in-one-01   512110190592   Unclaimed    Active   26m
first-disk-partition                           all-in-one-01   425133957120   Unclaimed    Active   24s

Setting up the cStor StoragePooland StorageClasses

OK, now that we’ve got our control/data plane and our BlockDevices registered, let’s make a StoragePool out of them! Earlier we turned off the automatic pool creation so we’re going to need to create them manually. Since this is the cStor section, I’ll outline only the cStor-relevant resources here. This is a good place to take a gander at the cStor storage pools documentation.

First up is the disk based StoragePool which spans over both BlockDevices, where all the storage will come from:

disk-pool.storagepool.yaml:

---
apiVersion: openebs.io/v1alpha1
kind: StoragePoolClaim
metadata:
  name: cstor-disk-pool
  namespace: openebs
  annotations:
    cas.openebs.io/config: |
      - name: PoolResourceRequests
        value: |-
            memory: 1Gi
      - name: PoolResourceLimits
        value: |-
            memory: 4Gi
      - name: AuxResourceRequests
        value: |-
            memory: 0.5Gi
            cpu: 100m
            ephemeral-storage: 50Mi
      - name: AuxResourceLimits
        value: |-
            memory: 0.5Gi
            cpu: 100m
spec:
  name: cstor-disk-pool
  type: disk
  poolSpec:
    poolType: mirrored
  blockDevices:
    blockDeviceList:
    - first-disk-partition
    - __PARTITION_NAME__

So everything looks good, except there’s an issue – I can’t make the pool without looking up the block device’s name! Looks like I should just make the second disk’s partition manually as well, so I can give a name. And here’s that updated second disk config:

---
apiVersion: openebs.io/v1alpha1
kind: BlockDevice
metadata:
  name: second-disk
  namespace: openebs
  labels:
    kubernetes.io/hostname: all-in-one-01 # TODO: clustered setup needs ot have this be different
    ndm.io/managed: "false" # for manual blockdevice creation put false
    ndm.io/blockdevice-type: blockdevice
status:
  claimState: Unclaimed
  state: Active
spec:
  nodeAttributes:
    nodeName: all-in-one-01 # TODO: clustered setup needs to have this be different
  path: /dev/nvme1n1
  capacity:
    logicalSectorSize: 512
    storage: 512110190592 # TODO: get from `blockdev --getsize64 <device path>`
  details:
    # TODO: obtain this information automatically from `udevadm info`
    deviceType: partition # like disk, partition, lvm, crypt, md
    firmwareRevision: "EXF7201Q"
    model: "SAMSUNG MZVLB512HBJQ-00000"
    serial: "SAMSUNG MZVLB512HBJQ-00000_S4GENX0N425033"
    # compliance: <compliance of disk> #like "SPC-4" # normally get this from smartctl but sometimes it's not there.
    vendor: SAMSUNG
  devlinks: # udevadm info -q property -n <device path>
    - kind: by-path
      path: /dev/disk/by-id/nvme-SAMSUNG_MZVLB512HBJQ-00000_S4GENX0N425034
    - kind: by-path
      path: /dev/disk/by-path/pci-0000:07:00.0-nvme-1
    - kind: by-path
      path: /dev/disk/by-id/nvme-eui.0025388401b90b27

OK, now let’s make that StoragePoolClaim and see if it comes up:

$ k get storagepoolclaim
NAME              AGE
cstor-disk-pool   4m8s
$ k get storagepool
No resources found
$ k get bd
NAME                   NODENAME        SIZE           CLAIMSTATE   STATUS   AGE
first-disk-partition   all-in-one-01   425133957120   Unclaimed    Active   31m
second-disk            all-in-one-01   512110190592   Unclaimed    Active   8m29s

Uh oh, something’s gone wrong – the claim got created but the pool is not, and the BlockDevices are still unclaimed!

Before I get get to debugging this though, I want to point out something – I’ve chosen to maek a disk based pool, but at this point I think I might actually like to use a sparse pool instead – OpenEB cStor “sparse” pools are not the same as ZFS sparse pools! See their warning:

Note: Starting with 0.9, cStor Sparse pool and its Storage Class are not created by default. If you need to enable the cStor Sparse pool for development or test environments, you should have the above Default Storage Configuration enabled as well as cStor sparse pool enabled using the instructions mentioned here.

So in production, it looks like you’re going to be want to be using disk based pools. Coming back to the StoragePoolClaim problem, k describe spc returns no events, so there’s nothing to analyze there. Oh but wait, I’ve actually checked the wrong place – there is a CStorPool object I should be checking instead:

$ k get csp
NAME                   ALLOCATED   FREE   CAPACITY   STATUS               READONLY   TYPE       AGE
cstor-disk-pool-0omx                                 PoolCreationFailed   false      mirrored   5m16s

Well there’s some nice feedback – storagepool supposedly includes cstorstoragepool but I guess if the creation failed then it can’t! Let’s see what k describe cstor-disk-pool-0omx says:

Events:
  Type     Reason      Age                    From       Message
  ----     ------      ----                   ----       -------
  Normal   Synced      6m54s                  CStorPool  Received Resource create event
  Normal   Synced      6m54s (x2 over 6m54s)  CStorPool  Received Resource modify event
  Warning  FailCreate  24s (x14 over 6m54s)   CStorPool  Pool creation failed zpool create command failed error: invalid vdev specification
use '-f' to override the following errors:
mirror contains devices of different sizes
: exit status 1

Well, we’re working with ZFS all right! The disks in the pool have to be similarly sized! Once thing that’s been cut out of Part 1 is all the experimentation I did to make a zpool just to realize that Ceph on top of ZFS was silly. Back then I actually devised a fairly easy to way to create the similiarly sized disk – copying the partition table from one disk to the other:

root@all-in-one-01 ~ # lsblk
NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
nvme1n1     259:0    0  477G  0 disk
nvme0n1     259:1    0  477G  0 disk
├─nvme0n1p1 259:2    0   16G  0 part [SWAP]
├─nvme0n1p2 259:3    0    1G  0 part /boot
├─nvme0n1p3 259:4    0   64G  0 part /
├─nvme0n1p4 259:5    0    1K  0 part
└─nvme0n1p5 259:6    0  396G  0 part

root@all-in-one-01 ~ # sgdisk -R /dev/nvme1n1 /dev/nvme0n1

***************************************************************
Found invalid GPT and valid MBR; converting MBR to GPT format
in memory.
***************************************************************

The operation has completed successfully.

root@all-in-one-01 ~ # lsblk
NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
nvme1n1     259:0    0  477G  0 disk
├─nvme1n1p1 259:11   0   16G  0 part
├─nvme1n1p2 259:12   0    1G  0 part
├─nvme1n1p3 259:13   0   64G  0 part
└─nvme1n1p5 259:14   0  396G  0 part
nvme0n1     259:1    0  477G  0 disk
├─nvme0n1p1 259:2    0   16G  0 part [SWAP]
├─nvme0n1p2 259:3    0    1G  0 part /boot
├─nvme0n1p3 259:4    0   64G  0 part /
├─nvme0n1p4 259:5    0    1K  0 part
└─nvme0n1p5 259:6    0  396G  0 part

Though terribly inefficient, I can use the same partition off of both disks and they are guaranteed to have the same size, so ZFS will be happy! Good thing OpenEBS hasn’t done anything with my disks yet since it’s back to the drawing board – I’ll need to update the ansible scripts to copy the partition and change the BlockDevices I’m creating. For now, since I’ve done it manually there’s no harm in a little check after deleting the StoragePoolClaim and re-doing my BlockDevices.

Along with making to right-size the disks there’s one more problem – if you disks that you want to manage manually, you must exclude them by path. If you don’t, NDM will compete with you, and be unable to use the disks because they could be used by other pools (your StoragePoolClaim will never get made). It’s documented in the “NDM related” section of the troubleshooting docs. You need to modify the path-filter filter configuration in the openebs-ndm-config ConfigMap like so:

      - key: path-filter
        name: path filter
        state: true
        include: ""
        exclude: "/dev/loop,/dev/fd0,/dev/sr0,/dev/ram,/dev/dm-,/dev/md,/dev/rbd,/dev/zd,/dev/nvme0n1p5,/dev/nvme1n1"

Assuming at this point you see exactly the list of BlockDevices you want (make sure there aren’t two with identical size), you shouldn’t run into any problems. After rightsizing the partitions and recreating everything, I can try again at creating the StoragePoolClaim, and I can see the CStorPool I want:

$ k get csp
NAME                   ALLOCATED   FREE   CAPACITY   STATUS    READONLY   TYPE       AGE
cstor-disk-pool-bpe9   140K        394G   394G       Healthy   false      mirrored   33m

That capacity is what we’d expect – inefficient, but it reflects the total mirrored capacity. Finally, we can make some StorageClasses to represent some use cases – I’ll show the replicated StorageClass below:

---
apiVersion: storage.k8s.io/v1
kind: StorageClass
provisioner: openebs.io/provisioner-iscsi
metadata:
  name: openebs-cstor-replicated
  annotations:
    openebs.io/cas-type: cstor
    cas.openebs.io/config: |
      - name: StoragePoolClaim
        value: cstor-disk-pool
      - name: ReplicaCount
        value: "2" # TODO: clustered setup we could have a multi-node pool with >2 disks

Once all the storage classes are created let’s make sure they show up again:

$ k get sc
NAME                        PROVISIONER                                                RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE
openebs-cstor-replicated    openebs.io/provisioner-iscsi                               Delete          Immediate           false                  83s
openebs-cstor-single        openebs.io/provisioner-iscsi                               Delete          Immediate           false                  85s
openebs-snapshot-promoter   volumesnapshot.external-storage.k8s.io/snapshot-promoter   Delete          Immediate           false                  3h33m

Trying it out: basic persistence test

OK now that we’ve got cStor all set up, Let’s make the usual test PVC + Pod combinations, here’s an example for the non-replicated (“single”) StorageClass:

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-single
  namespace: default
spec:
  storageClassName: openebs-cstor-single
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
---
apiVersion: v1
kind: Pod
metadata:
  name: test-single
  namespace: default
spec:
  containers:
    - name: alpine
      image: alpine
      command: ["ash", "-c", "while true; do sleep infinity; done"]
      imagePullPolicy: IfNotPresent
      resources:
        requests:
          cpu: 0.5
          memory: "512Mi"
        requests:
          cpu: 0.5
          memory: "512Mi"
      volumeMounts:
        - mountPath: /var/data
          name: data
  volumes:
    - name: data
      persistentVolumeClaim:
        claimName: test-single

The PVC, resulting PV and Pod get created nice and easy after the work we’ve done:

$ k get pvc
NAME          STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS           AGE
test-single   Bound    pvc-d92c89e0-899b-4038-9073-ffc1e31dc5c9   1Gi        RWO            openebs-cstor-single   84s
$ k get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                 STORAGECLASS           REASON   AGE
pvc-d92c89e0-899b-4038-9073-ffc1e31dc5c9   1Gi        RWO            Delete           Bound    default/test-single   openebs-cstor-single            85s
$ k get pod
NAME          READY   STATUS    RESTARTS   AGE
test-single   1/1     Running   0          85s

And here the output of the usual basic persistence test.

$ k exec -it test-single -n default -- /bin/ash
/ # echo "this is a test file" > /var/data/test-file.txt
/ #

$ k delete pod test-single -n default
pod "test-single" deleted

$ k get pv -n default
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                 STORAGECLASS           REASON   AGE
pvc-d92c89e0-899b-4038-9073-ffc1e31dc5c9   1Gi        RWO            Delete           Bound    default/test-single   openebs-cstor-single            3m9s

$ make test-single
make[1]: Entering directory '/home/mrman/code/foss/k8s-storage-provider-benchmarks/kubernetes/openebs/cstor'
kubectl --kubeconfig=/home/mrman/code/foss/k8s-storage-provider-benchmarks/ansible/output/all-in-one-01.k8s.storage-benchmarks.experiments.vadosware.io/var/lib/k0s/pki/admin.conf apply -f test-single.pvc.yaml
persistentvolumeclaim/test-single unchanged
kubectl --kubeconfig=/home/mrman/code/foss/k8s-storage-provider-benchmarks/ansible/output/all-in-one-01.k8s.storage-benchmarks.experiments.vadosware.io/var/lib/k0s/pki/admin.conf apply -f test-single.pod.yaml
pod/test-single created
make[1]: Leaving directory '/home/mrman/code/foss/k8s-storage-provider-benchmarks/kubernetes/openebs/cstor'

$ k exec -it test-single -n default -- /bin/ash
/ # cat /var/data/test-file.txt
this is a test file

Perfect – the PVC is definitely holding data between pod restarts. To keep this post from getting too long, this will be the only time I’ll actually print the output of the basic persistence test. From now on, I’ll just refer to having run it.

OpenEBS Jiva - STORAGE_PROVIDER=openebs-jiva

If I remember correctly Jiva the oldest (?) implementation that OpenEBS has worked on. It’s based on Longhorn though they differ in some places (see the note about modified architectural changes). In general, this is probably the easiest possible way to use storage on Kubernetes today I think – having writes go straight to other pods and then down to sparse files as a backend and shipping all the data around with ISCSI is pretty brilliant. Jiva is what I have deployed to run this blog right now and it’s been purring along for a couple years at this point (so much so that I can’t upgrade to the newest version easily anymore!).

Pros/Cons

Here are some Pros for Jiva as far as I’m concerned:

  • Easy to install and understand
  • Well supported by OpenEBS (also indirectly by Rancher)
  • Synchronous replication means very easy, obvious failover model (if a storage replica goes down, all the others are available for hotswapping)

Here are some points against Jiva:

Jiva is the incumbent so most of this testing is really for me to find something that I am fine using instead of OpenEBS Jiva.

I had to update my ansible automation to partition & mount drives for Jiva to use… It’s a bit weird that it can’t use raw disks (feels like it could handle the partitioning and filesystem creation). Right f

Setting up the StoragePool and StorageClasses

NOTE: Since the control/data plane setup for Jiva is identical for the most part to cStor I’ve excluded it (and the code is reused anyway)

The StoragePool for Jiva is really easy to set up:

---
apiVersion: openebs.io/v1alpha1
kind: StoragePool
metadata:
  name: second-disk
  namespace: openebs
  type: hostdir
spec:
  path: "/second-disk"

One thing I realized while setting htis up is that Jiva seems to only be able to use one disk for StoragePools! There was an issue about it a long time ago, it turns out. I’m somewhat surprised because in my own head I don’t remember it working like this, but now that I think about it maybe I picked this in the past due to the fact that I left the software RAID in place, so it was actually fine? That does leave me with some questions though, like where how storage pools on other nodes are picked – do all the mount points have to be named the same? I filed a ticket since I think this is something that should be mentioned in the documentation.

In my case where there is a part of a disk and a whole ‘nother disk to attach, maybe it makes the most sense to combine the first and second disks into an LVM logical volume or maybe use software RAID (mdadm) or something. Looks like ZFS is off the table though. It looks like btrfs does supoprt extent mapping so maybe that’s the way to go? I’m going to leave that for another day though.

Making the StorageClass was similarly easy:

---
apiVersion: storage.k8s.io/v1
kind: StorageClass
provisioner: openebs.io/provisioner-iscsi
metadata:
  name: openebs-jiva-d2-replicated
  annotations:
    openebs.io/cas-type: jiva
    cas.openebs.io/config: |
      - name: ReplicaCount
        value: "2"
      - name: StoragePool
        value: second-disk

It’s a bit annoying that I have to have 2 sets of storage classes for each disk – opebs-jiva-d1-[single|replicated] and openebs-jiva-d2-[single|replicated] but I guess things could be worse.

With the StoragePool and the StorageClasses set up, we’re free to make our PersistentVolumeClaims, PersistentVolumes and Pods. I won’t include the basic persistent test here but it works quite easily.

OpenEBS LocalPV hostpath - STORAGE_PROVIDER=openebs-localpv-hostpath

We’ll never know how many people rely on hostPath volumes in production, but theoretically they should be avoided in production. If you’re going to use them anyway, it’s nice to at least be able to dynamically provision them.

Pros/Cons

I won’t spend too much time rehashing all of the pros/cons of hostPath volumes here but in general:

Pros:

  • Simple to understand
  • No setup required (usually)
  • Dynamic provisioning
  • Management is somewhat possible with OpenEBS there to intercept the lifecycle (I don’t think this leverage is utilized much though)

Cons:

  • Monitoring volume usage is hard
  • Constraining volume size is hard
  • Can’t change the filesystem, usually matches whatever is on the node

Setting up StoragePool and StorageClasses

NOTE: Since the control/data plane setup for Jiva is identical for the most part to cStor I’ve excluded it (and the code is reused anyway)

As usual OpenEBS does have a good documentation page on this, so give that a read for a full guide. There’s actually no setup outside of the common OpenEBS setup since the openebs-localpv-provisioner Deployment is included. The “volumes” (folders) will be created under /var/openebs/local on the node that the PVCs are provisioned on. We do still need to make a StorageClass, so here’s what that looks like:

apiVersion: storage.k8s.io/v1
kind: StorageClass
provisioner: openebs.io/local
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
metadata:
  name: local-hostpath
  annotations:
    openebs.io/cas-type: local
    cas.openebs.io/config: |
      - name: StorageType
        value: hostpath

With this StorageClass in place we can go make the PVC and Pod like we usually do (which I won’t go into again) and we’re off to the races, nice and simple. This will very likely serve as the ballast against which to test performance, I wonder if there’s a point in even making “real” hostPath pods to test with when we get to it.

OpenEBS LocalPV device - STORAGE_PROVIDER=openebs-localpv-device

LocalPV is very similar to hostPath but better – it uses locally provisioned bind-mounted loopback devices which essentially interpret files on the filesystem as disks. It’s a clever idea and one that I had myself very early on – seemed like an obvious easy-to-make local provisioner.

Pros/Cons

Some pros of using LocalPV devices:

  • Volume usage can be controlled
  • Volume size can be controlled
  • Different “disk” files can use different filesystems, with different configurations
  • Can sometimes perform better than raw disk due to the smaller disk sizes

There is one potentially large issue with LocalPV though:

Performance looks to be halved for disks that sync on every write, and there is a whole class of programs who try to make sure to sync very often (databases) so that’s worth watching out for.

Setting up the StoragePool and StorageClasses

NOTE: Since the control/data plane setup for Jiva is identical for the most part to cStor I’ve excluded it (and the code is reused anyway)

Same here with hostPath – there’s great documentation, and there’s not much setup to do outside of making the StorageClass. There are two choices on underlying filesystems to pick though, ext4 and xfs – I wonder if it can use any installed filesystem whether it can use btrfs as well. Here’s what that looks like:

---
apiVersion: storage.k8s.io/v1
kind: StorageClass
provisioner: openebs.io/local
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
metadata:
  name: openebs-localpv-device
  annotations:
    openebs.io/cas-type: local
    cas.openebs.io/config: |
      - name: StorageType
        value: device

I ran into some issues while trying to reuse a cluster where openebs-localpv-hostpath was installed though:

2f-9c67-a23e4c87726a
2021-04-07T10:04:19.272Z        ERROR   app/provisioner_blockdevice.go:54               {"eventcode": "local.pv.provision.failure", "msg": "Failed to provision Local PV", "rname": "pvc-b4e9930d-7c4f-482f-9c67-a23e4c87726a", "reason": "Block device initialization failed", "storagetype": "device"}
github.com/openebs/dynamic-localpv-provisioner/cmd/provisioner-localpv/app.(*Provisioner).ProvisionBlockDevice
        /go/src/github.com/openebs/dynamic-localpv-provisioner/cmd/provisioner-localpv/app/provisioner_blockdevice.go:54
github.com/openebs/dynamic-localpv-provisioner/cmd/provisioner-localpv/app.(*Provisioner).Provision
        /go/src/github.com/openebs/dynamic-localpv-provisioner/cmd/provisioner-localpv/app/provisioner.go:131
sigs.k8s.io/sig-storage-lib-external-provisioner/controller.(*ProvisionController).provisionClaimOperation
        /go/src/github.com/openebs/dynamic-localpv-provisioner/vendor/sigs.k8s.io/sig-storage-lib-external-provisioner/controller/controller.go:1280
sigs.k8s.io/sig-storage-lib-external-provisioner/controller.(*ProvisionController).syncClaim
        /go/src/github.com/openebs/dynamic-localpv-provisioner/vendor/sigs.k8s.io/sig-storage-lib-external-provisioner/controller/controller.go:1019
sigs.k8s.io/sig-storage-lib-external-provisioner/controller.(*ProvisionController).syncClaimHandler
        /go/src/github.com/openebs/dynamic-localpv-provisioner/vendor/sigs.k8s.io/sig-storage-lib-external-provisioner/controller/controller.go:988
sigs.k8s.io/sig-storage-lib-external-provisioner/controller.(*ProvisionController).processNextClaimWorkItem.func1
        /go/src/github.com/openebs/dynamic-localpv-provisioner/vendor/sigs.k8s.io/sig-storage-lib-external-provisioner/controller/controller.go:895
sigs.k8s.io/sig-storage-lib-external-provisioner/controller.(*ProvisionController).processNextClaimWorkItem
        /go/src/github.com/openebs/dynamic-localpv-provisioner/vendor/sigs.k8s.io/sig-storage-lib-external-provisioner/controller/controller.go:917
sigs.k8s.io/sig-storage-lib-external-provisioner/controller.(*ProvisionController).runClaimWorker
        /go/src/github.com/openebs/dynamic-localpv-provisioner/vendor/sigs.k8s.io/sig-storage-lib-external-provisioner/controller/controller.go:869
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
        /go/src/github.com/openebs/dynamic-localpv-provisioner/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
        /go/src/github.com/openebs/dynamic-localpv-provisioner/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
        /go/src/github.com/openebs/dynamic-localpv-provisioner/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88

Weirdly enough, I’m not sure these two actually work well/coexist together – you’d think they did, but it’s a bit weird that I’m getting an error at all. There’s at least one block device (who knows why there isn’t two):

$ k get bd
NAME          NODENAME        SIZE           CLAIMSTATE   STATUS   AGE
second-disk   all-in-one-01   512110190592   Claimed      Active   7m43s

After a good ‘ol hard refresh things… Got worse:

$ k logs -f openebs-localpv-provisioner-d7464d5b-dqw8x
I0407 11:24:47.433831       1 start.go:70] Starting Provisioner...
I0407 11:24:47.445732       1 start.go:132] Leader election enabled for localpv-provisioner
I0407 11:24:47.446084       1 leaderelection.go:242] attempting to acquire leader lease  openebs/openebs.io-local...
I0407 11:24:47.450595       1 leaderelection.go:252] successfully acquired lease openebs/openebs.io-local
I0407 11:24:47.450667       1 controller.go:780] Starting provisioner controller openebs.io/local_openebs-localpv-provisioner-d7464d5b-dqw8x_5b405baf-42da-4cbe-9249-6fdd835d80e1!
I0407 11:24:47.450692       1 event.go:281] Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"openebs", Name:"openebs.io-local", UID:"137cd575-e424-4750-8da0-ccb8464c8087", APIVersion:"v1", ResourceVersion:"1148", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' openebs-localpv-provisioner-d7464d5b-dqw8x_5b405baf-42da-4cbe-9249-6fdd835d80e1 became leader
I0407 11:24:47.550786       1 controller.go:829] Started provisioner controller openebs.io/local_openebs-localpv-provisioner-d7464d5b-dqw8x_5b405baf-42da-4cbe-9249-6fdd835d80e1!

---- After a device PV gets made ----

I0407 11:28:03.568607       1 controller.go:1211] provision "default/test-single" class "openebs-localpv-device": started
I0407 11:28:03.574190       1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"test-single", UID:"600fb03a-ecc0-44a3-ba48-6c8058a096c1", APIVersion:"v1", ResourceVersion:"1796", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "default/test-single"
I0407 11:28:03.580632       1 helper_blockdevice.go:175] Getting Block Device Path from BDC bdc-pvc-600fb03a-ecc0-44a3-ba48-6c8058a096c1
E0407 11:28:08.589110       1 runtime.go:78] Observed a panic: runtime.boundsError{x:0, y:0, signed:true, code:0x0} (runtime error: index out of range [0] with length 0)
goroutine 88 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x1667180, 0xc0006a6380)
        /go/src/github.com/openebs/dynamic-localpv-provisioner/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0xa3
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
        /go/src/github.com/openebs/dynamic-localpv-provisioner/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x82
panic(0x1667180, 0xc0006a6380)
        /usr/local/go/src/runtime/panic.go:969 +0x166
github.com/openebs/dynamic-localpv-provisioner/cmd/provisioner-localpv/app.(*Provisioner).getBlockDevicePath(0xc00032c000, 0xc0002ff540, 0x0, 0x7, 0xc00012f488, 0x1743d01, 0x8, 0xc00028d838)
        /go/src/github.com/openebs/dynamic-localpv-provisioner/cmd/provisioner-localpv/app/helper_blockdevice.go:212 +0x751
github.com/openebs/dynamic-localpv-provisioner/cmd/provisioner-localpv/app.(*Provisioner).ProvisionBlockDevice(0xc00032c000, 0xc00044ef00, 0xc0000445d0, 0x28, 0xc000144fc0, 0xc000324000, 0xc0002cc280, 0x6, 0x174dd96, 0x10)
        /go/src/github.com/openebs/dynamic-localpv-provisioner/cmd/provisioner-localpv/app/provisioner_blockdevice.go:51 +0x2b1
github.com/openebs/dynamic-localpv-provisioner/cmd/provisioner-localpv/app.(*Provisioner).Provision(0xc00032c000, 0xc00044ef00, 0xc0000445d0, 0x28, 0xc000144fc0, 0xc000324000, 0xc, 0xc0002ce140, 0x4b)
        /go/src/github.com/openebs/dynamic-localpv-provisioner/cmd/provisioner-localpv/app/provisioner.go:131 +0x610
sigs.k8s.io/sig-storage-lib-external-provisioner/controller.(*ProvisionController).provisionClaimOperation(0xc00011c6c0, 0xc000144fc0, 0x252ee00, 0x0, 0x0, 0xc0007385a0)
        /go/src/github.com/openebs/dynamic-localpv-provisioner/vendor/sigs.k8s.io/sig-storage-lib-external-provisioner/controller/controller.go:1280 +0x1594
sigs.k8s.io/sig-storage-lib-external-provisioner/controller.(*ProvisionController).syncClaim(0xc00011c6c0, 0x1725580, 0xc000144fc0, 0xc0000444b0, 0x24)
        /go/src/github.com/openebs/dynamic-localpv-provisioner/vendor/sigs.k8s.io/sig-storage-lib-external-provisioner/controller/controller.go:1019 +0xd1
sigs.k8s.io/sig-storage-lib-external-provisioner/controller.(*ProvisionController).syncClaimHandler(0xc00011c6c0, 0xc0000444b0, 0x24, 0x413c33, 0xc000861cf8)
        /go/src/github.com/openebs/dynamic-localpv-provisioner/vendor/sigs.k8s.io/sig-storage-lib-external-provisioner/controller/controller.go:988 +0xb3
sigs.k8s.io/sig-storage-lib-external-provisioner/controller.(*ProvisionController).processNextClaimWorkItem.func1(0xc00011c6c0, 0x14eafc0, 0xc000714020, 0x0, 0x0)
        /go/src/github.com/openebs/dynamic-localpv-provisioner/vendor/sigs.k8s.io/sig-storage-lib-external-provisioner/controller/controller.go:895 +0xe0
sigs.k8s.io/sig-storage-lib-external-provisioner/controller.(*ProvisionController).processNextClaimWorkItem(0xc00011c6c0, 0x0)
        /go/src/github.com/openebs/dynamic-localpv-provisioner/vendor/sigs.k8s.io/sig-storage-lib-external-provisioner/controller/controller.go:917 +0x53
sigs.k8s.io/sig-storage-lib-external-provisioner/controller.(*ProvisionController).runClaimWorker(0xc00011c6c0)
        /go/src/github.com/openebs/dynamic-localpv-provisioner/vendor/sigs.k8s.io/sig-storage-lib-external-provisioner/controller/controller.go:869 +0x2b
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc00044b650)
        /go/src/github.com/openebs/dynamic-localpv-provisioner/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152 +0x5f
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00044b650, 0x3b9aca00, 0x0, 0x1, 0x0)
        /go/src/github.com/openebs/dynamic-localpv-provisioner/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153 +0xf8
k8s.io/apimachinery/pkg/util/wait.Until(0xc00044b650, 0x3b9aca00, 0x0)
        /go/src/github.com/openebs/dynamic-localpv-provisioner/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d
created by sigs.k8s.io/sig-storage-lib-external-provisioner/controller.(*ProvisionController).Run.func1
        /go/src/github.com/openebs/dynamic-localpv-provisioner/vendor/sigs.k8s.io/sig-storage-lib-external-provisioner/controller/controller.go:825 +0x42f
panic: runtime error: index out of range [0] with length 0 [recovered]
        panic: runtime error: index out of range [0] with length 0

goroutine 88 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
        /go/src/github.com/openebs/dynamic-localpv-provisioner/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:55 +0x105
panic(0x1667180, 0xc0006a6380)
        /usr/local/go/src/runtime/panic.go:969 +0x166
github.com/openebs/dynamic-localpv-provisioner/cmd/provisioner-localpv/app.(*Provisioner).getBlockDevicePath(0xc00032c000, 0xc0002ff540, 0x0, 0x7, 0xc00012f488, 0x1743d01, 0x8, 0xc00028d838)
        /go/src/github.com/openebs/dynamic-localpv-provisioner/cmd/provisioner-localpv/app/helper_blockdevice.go:212 +0x751
github.com/openebs/dynamic-localpv-provisioner/cmd/provisioner-localpv/app.(*Provisioner).ProvisionBlockDevice(0xc00032c000, 0xc00044ef00, 0xc0000445d0, 0x28, 0xc000144fc0, 0xc000324000, 0xc0002cc280, 0x6, 0x174dd96, 0x10)
        /go/src/github.com/openebs/dynamic-localpv-provisioner/cmd/provisioner-localpv/app/provisioner_blockdevice.go:51 +0x2b1
github.com/openebs/dynamic-localpv-provisioner/cmd/provisioner-localpv/app.(*Provisioner).Provision(0xc00032c000, 0xc00044ef00, 0xc0000445d0, 0x28, 0xc000144fc0, 0xc000324000, 0xc, 0xc0002ce140, 0x4b)
        /go/src/github.com/openebs/dynamic-localpv-provisioner/cmd/provisioner-localpv/app/provisioner.go:131 +0x610
sigs.k8s.io/sig-storage-lib-external-provisioner/controller.(*ProvisionController).provisionClaimOperation(0xc00011c6c0, 0xc000144fc0, 0x252ee00, 0x0, 0x0, 0xc0007385a0)
        /go/src/github.com/openebs/dynamic-localpv-provisioner/vendor/sigs.k8s.io/sig-storage-lib-external-provisioner/controller/controller.go:1280 +0x1594
sigs.k8s.io/sig-storage-lib-external-provisioner/controller.(*ProvisionController).syncClaim(0xc00011c6c0, 0x1725580, 0xc000144fc0, 0xc0000444b0, 0x24)
        /go/src/github.com/openebs/dynamic-localpv-provisioner/vendor/sigs.k8s.io/sig-storage-lib-external-provisioner/controller/controller.go:1019 +0xd1
sigs.k8s.io/sig-storage-lib-external-provisioner/controller.(*ProvisionController).syncClaimHandler(0xc00011c6c0, 0xc0000444b0, 0x24, 0x413c33, 0xc000861cf8)
        /go/src/github.com/openebs/dynamic-localpv-provisioner/vendor/sigs.k8s.io/sig-storage-lib-external-provisioner/controller/controller.go:988 +0xb3
sigs.k8s.io/sig-storage-lib-external-provisioner/controller.(*ProvisionController).processNextClaimWorkItem.func1(0xc00011c6c0, 0x14eafc0, 0xc000714020, 0x0, 0x0)
        /go/src/github.com/openebs/dynamic-localpv-provisioner/vendor/sigs.k8s.io/sig-storage-lib-external-provisioner/controller/controller.go:895 +0xe0
sigs.k8s.io/sig-storage-lib-external-provisioner/controller.(*ProvisionController).processNextClaimWorkItem(0xc00011c6c0, 0x0)
        /go/src/github.com/openebs/dynamic-localpv-provisioner/vendor/sigs.k8s.io/sig-storage-lib-external-provisioner/controller/controller.go:917 +0x53
sigs.k8s.io/sig-storage-lib-external-provisioner/controller.(*ProvisionController).runClaimWorker(0xc00011c6c0)
        /go/src/github.com/openebs/dynamic-localpv-provisioner/vendor/sigs.k8s.io/sig-storage-lib-external-provisioner/controller/controller.go:869 +0x2b
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc00044b650)
        /go/src/github.com/openebs/dynamic-localpv-provisioner/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152 +0x5f
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00044b650, 0x3b9aca00, 0x0, 0x1, 0x0)
        /go/src/github.com/openebs/dynamic-localpv-provisioner/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153 +0xf8
k8s.io/apimachinery/pkg/util/wait.Until(0xc00044b650, 0x3b9aca00, 0x0)
        /go/src/github.com/openebs/dynamic-localpv-provisioner/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d
created by sigs.k8s.io/sig-storage-lib-external-provisioner/controller.(*ProvisionController).Run.func1
        /go/src/github.com/openebs/dynamic-localpv-provisioner/vendor/sigs.k8s.io/sig-storage-lib-external-provisioner/controller/controller.go:825 +0x42f
runtime: note: your Linux kernel may be buggy
runtime: note: see https://golang.org/wiki/LinuxKernelSignalVectorBug
runtime: note: mlock workaround for kernel bug failed with errno 12

So at this point, I’m not even going to try and look into this, I’m running a very new kernel and I’m unlikely to use LocalPV Device seriously in production. In fact, I’m going to take out the LocalPV device all together.

OpenEBS LocalPV ZFS - STORAGE_PROVIDER=openebs-localpv-zfs

This is a particularly appealing choice for running databases on, ZFS is pretty awesome for rock solid data reliability and protection. The idea of being able to actually just set up ZFS (2 drive mirror) to replace the software RAID that is normally installed on Hetzner and using OpenEBS to provision out of the zpools for local-only storage is awesome. The only thing that would be missing is how to replicate writes for high availability. Of course, in the case you run some piece of software with application level replication features (ex. postgres w/ logical replication, streaming replication, etc), you can actually just mostly take backups (from the “master” node) and be OK. Anyway, I think this option flies under the radar for many people but is actually really compelling.

Pros/Cons

This is an incomplete list of pros to ZFS:

  • Copy-on-Write
  • Checksumming + Self healing
  • Recording of error rates for drives
  • Compression
  • Instant backups and snapshots
  • Self healing
  • Encryption
  • Flexibility in defining data sets, redundancy scheme(s)

And here are some cons, some influenced by my particular platform/settings:

  • Waste of a ~100GB due to disks having to be the same size in the mirrored layout
  • Likely to be slower than raw disk
  • More knobs to tune
  • No QoS support (in ZFS on Linux anyway)

ZFS is a pretty rock solid file system and it’s exciting to get a chance to use it easily in clusters equipped with the right operator.

Deploying Control/Data plane

The early “common” code for setting up OpenEBS actually doesn’t apply at all to the LocalPVs powered by ZFS. I had to make some changes to the ansible code to make sure that a zpool was created though:

    - name: Create a ZFS mirrored pool from first and second partition
      tags: [ "drive-partition-prep" ]
      # when: storage_plugin in target_plugins and nvme_disk_0.stat.exists and nvme_disk_1.stat.exists and nvme_disk_0_partition_5.stat.exists and nvme_disk_1_partition_5.stat.exists
      command: |
        zpool create tank mirror /dev/nvme0n1p5 /dev/nvme1n1p5
      vars:
        target_plugins:
          - openebs-localpv-zfs

As far as the Kubernetes resources go, I did take some liberties with the resources before applying them. One change I made was moving the resources to the openebs namespace. There’s a warning on the repo that goes like this:

You have access to install RBAC components into kube-system namespace. The OpenEBS ZFS driver components are installed in kube-system namespace to allow them to be flagged as system critical components.

Well luckily for me this restriction was flagged as a bug in 2018, and the restriction was lifted by v1.17. The documentation on guaranteed scheduling has also been updated, so I just went around setting priorityClassName on the structurally important bits. Here’s the full list of resources I needed:

$ tree .
.
├── Makefile
├── openebs-localpv-zfs-default-ext4.storageclass.yaml
├── openebs-localpv-zfs-default.storageclass.yaml
├── openebs-zfs-bin.configmap.yaml
├── openebs-zfs-controller-sa.rbac.yaml
├── openebs-zfs-controller-sa.serviceaccount.yaml
├── openebs-zfs-controller.statefulset.yaml
├── openebs-zfs-csi-driver.csidriver.yaml
├── openebs-zfs-node.ds.yaml
├── openebs-zfs-node-sa.rbac.yaml
├── openebs-zfs-node-sa.serviceaccount.yaml
├── volumesnapshotclass.crd.yaml
├── volumesnapshotcontents.crd.yaml
├── volumesnapshot.crd.yaml
├── zfsbackup.crd.yaml
├── zfsrestore.crd.yaml
├── zfssnapshot.crd.yaml
└── zfsvolume.crd.yaml

0 directories, 18 files

So ~17 distinct resources required, not a huge amount yet not a small amount either. If we don’t count the CRDs and StorageClasses then it’s even less, which is nice. Structurally the important compnents are the ZFS controller (openebs-zfs-controller.statefulset.yaml), ZFS per-node DaemonSet (openebs-zfs-node.ds.yaml) and the CSI driver (openebs-zfs-csi-driver.csidriver.yaml).

Setting up the StorageClasses

So the storage class looks like this:

---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: openebs-localpv-zfs-default-ext4
provisioner: zfs.csi.openebs.io
parameters:
  recordsize: "16k"
  compression: "lz4"
  atime: "off"
  dedup: "off"
  fstype: "ext4"
  logbias: "throughput"
  poolname: "tank"
  xattr: "sa"

All the ZFS settings that I mentioned in part 2 that might be good are in there, but there’s one interesting thing – the fstypeis specifyable! This particular storage class will make ext4 the filesystem, but the default (represented by openebs-localpv-zfs-default) is actually to give out ZFS-formatted PersistentVolumes.

With the StorageClasses made, we’re free to make our PersistentVolumeClaims, PersistentVolumes and Pods. Remember, since the replication is happening at the ZFS level (the mirrored tank pool), I don’t have to make any *-replicated.*.yaml pods/PVCs – all the pods will be replicated and will have disk-level high durability/availability (though if one disk goes down, we’re in a super dangerous but working position). ZFS also doesn’t do synchronous replication (and it’s up to you to set up asynchronous replication via zfs send/zfs recv), so there’s no node-level high availability built in.

One thing I ran into was the need for topologies to be specified. When I first created a Pod + PVC I saw the following events:

  Type     Reason                Age                  From                                                                              Message
  ----     ------                ----                 ----                                                                              -------
  Normal   Provisioning          37s (x9 over 4m50s)  zfs.csi.openebs.io_openebs-zfs-controller-0_0c369586-c312-4551-a297-88fe247b0c79  External provisioner is provisioning volume for claim "default/test-single"
  Warning  ProvisioningFailed    37s (x9 over 4m50s)  zfs.csi.openebs.io_openebs-zfs-controller-0_0c369586-c312-4551-a297-88fe247b0c79  failed to provision volume with StorageClass "openebs-localpv-zfs-default": error generating accessibility requirements: no available topology found
  Normal   ExternalProvisioning  5s (x21 over 4m50s)  persistentvolume-controller                                                       waiting for a volume to be created, either by external provisioner "zfs.csi.openebs.io" or manually created by system administrator

So the PVC couldn’t be created because there was no available topology. Looking at the docs it didn’t seem like the topology options were required, but they are:

---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: openebs-localpv-zfs-default
provisioner: zfs.csi.openebs.io
volumeBindingMode: WaitForFirstConsumer
parameters:
  recordsize: "16k"
  compression: "lz4"
  atime: "off"
  dedup: "off"
  fstype: "zfs"
  logbias: "throughput"
  poolname: "tank"
  xattr: "sa"
allowedTopologies:
  - matchLabelExpressions:
      - key: zfs-support
        values:
          - "yes"

I guess this makes sense since not every node will necessarily have ZFS tools installed, so I’d rather do this than disable --strict-topology on the controller. Of course we’ll have to label the node:

$ k label node all-in-one-01 zfs-support=yes
node/all-in-one-01 labeled
$ k get nodes --show-labels
NAME            STATUS   ROLES    AGE   VERSION        LABELS
all-in-one-01   Ready    <none>   54m   v1.20.5-k0s1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=all-in-one-01,kubernetes.io/os=linux,zfs-support=yes

OK great, except this didn’t work! The error is different now, but there’s still something wrong:

Events:
  Type     Reason                Age               From                                                                              Message
  ----     ------                ----              ----                                                                              -------
  Normal   WaitForFirstConsumer  15s               persistentvolume-controller                                                       waiting for first consumer to be created before binding
  Normal   Provisioning          6s (x4 over 13s)  zfs.csi.openebs.io_openebs-zfs-controller-0_0c369586-c312-4551-a297-88fe247b0c79  External provisioner is provisioning volume for claim "default/test-single"
  Warning  ProvisioningFailed    6s (x4 over 13s)  zfs.csi.openebs.io_openebs-zfs-controller-0_0c369586-c312-4551-a297-88fe247b0c79  failed to provision volume with StorageClass "openebs-localpv-zfs-default": error generating accessibility requirements: no topology key found on CSINode all-in-one-01
  Normal   ExternalProvisioning  5s (x3 over 13s)  persistentvolume-controller                                                       waiting for a volume to be created, either by external provisioner "zfs.csi.openebs.io" or manually created by system administrator

Has the CSINode not properly taken in the Node labels? It looks like editing the node after the fact (maybe without a restart) wasn’t going to update the CSINode properly, so I went ahead and did a full reset…. Which didn’t work. Looks like it’s time for more digging!

The first thing that sticks out is that the CSINode that matches my node (all-in-one-01) has no drivers:

$ k get csinode
NAME            DRIVERS   AGE
all-in-one-01   0         9h

That’s a bit bizarre! Luckily for me there’s an issue filed that shows what’s wrong – it’s our old friend /var/lib/kubelet! I added another comment to the issue in k0s and called it a day. I also added a small documentation PR for openebs/zfs-localpv. If the k0s/k3s crew don’t fix this early they’re going to be pelted by mistaken reports of issues down the road – every important cluster/node-level utility that tries to access /var/lib/kubelet is going to have this issue.

With registration fix, we’ve got the CSI driver registered:

$ k get csinode
NAME            DRIVERS   AGE
all-in-one-01   1         10h

And the pod + PVC combination is working properly:

$ k get pod
NAME          READY   STATUS    RESTARTS   AGE
test-single   1/1     Running   0          8s
$ k exec -it test-single -n default -- /bin/ash
/ # echo "hello ZFS!" > /var/data/hello-zfs.txt
/ # ls /var/data/
hello-zfs.txt

OK since everything is working now, rather than rehashing the Pod and PVC specs I’ll include the output of some zfs subcommands to show what’s happened after I spun up a Pod with the right PersistentVolumeClaim:

root@all-in-one-01 ~ # zfs list
NAME                                            USED  AVAIL     REFER  MOUNTPOINT
tank                                            153K   382G       24K  /tank
tank/pvc-14327c05-a1b6-4359-8a62-32ebb6db80e2    24K  1024M       24K  legacy

Oh look there’s the PVC we made as a data set! Awesome. Now we’ve got all the power of ZFS. This might be the best bang-for-buck storage plugin there is – You could provision and manage ZFS pools with this quite efficiently. Some automation around zfs send/zfs recv, or running sanoid as a DaemonSet could be really awesome.

LINSTOR (drbd) - STORAGE_PROVIDER=linstor

LINSTOR is a somewhat new (to me) entry into the field – drbd is venerable old technology which has been in the kernel since December 2009 (version 2.6.33). The way it works looks to be relatively simple, and it’s free and open source software. The corporate apparatus around it is a bit stodgy (take a look at their website and you’ll see) but luckily we don’t have to buy an enterprise license to take it for a spin and see how it performs against the rest. After looking for some videos on LINSTOR I came across Case-Study: 600 Nodes on DRBD + LINSTOR for Kubernetes, OpenNebula and Proxmox which I watched – pretty exciting stuff. It certainly scales so that’s nice. There’s also a quick video on the HA abilities of drbd. It looks like one of the features they were working on in that 600 node talk, backups to S3, got implemented? If LINSTOR storage performance turns out to be good I’m going to be pretty excited to use it, or at least have it installed alongside the other options.

Along with that it looks like another project I like a lot, KubeVirt works with DRBD block devices, including live migration.

Pros/Cons

Pros of LINSTOR:

  • Proven & established technology underneath (drbd)
  • Commercial support available
  • Seems to be very highly scalable, with support for geographical scaling
  • Backup to S3 looks to have been implemented

Cons of LINSTOR:

  • Not a huge following in the k8s community yet

One thing about LINSTOR was that I was quite confused about the various projects that existed which were trying to integrate it with Kubernetes. There were two main ones I found:

I was a bit confused on how they fit together, so I filed an issue @ kvaps/kube-linstor (you might recognize/remember, kvaps gave that talk on the 600 node LINSTOR cluster), and kvaps himself was nice enough to respond and clarify things for me. Reproduced below:

Actually kube-linstor (this project) started earlier than the official piraeus-operator. Both projects have the same goals to containerize linstor-server and other components and provide ready solution for running them in Kubernetes. But they are doing that different ways, eg:

  • kube-linstor - is just standard Helm-chart, it consists of static YAML-manifests, very simple, but it has no some functionality inherent in piraeus-operator, eg. autoconfiguration of storage-pools (work in progress) and auto-injecting DRBD kernel-module.
  • piraeus-operator - implements the operator pattern and allows to bootstrap LINSTOR cluster and configure it using Kubernetes Custom Resources.

That made things very clear, so thanks to kvaps for that! I think I’m going to go with kvaps/kube-linstor since it looks like it is more straight forward/simple and might be a little easier for me to get started with, both projects look great though.

One major benefit of using kvaps/kube-linstor is that it includes the optionally includes TLS setup for the control plane – I’m not currently using any automatic mutual TLS solution (Cilium, Istio, Calico, etc) so this is nice for me. Unfortunately, I’m not using Helm, and writing the code to generate the TLS certs, store them, and use them in the scripts is not complication I want so I’m going to be forgoing that bit. In production what I’ll do is turn on the wireguard integration and/or use the automatic mutual TLS features of something like linkerd or Cilium that is mostly set-and-forget.

One major benefit of using the piraeus-data-store/piraeus-operator is that it has some facilities for automatic drive detection (and manual detection if you want). piraeus-operator is a bit more automated and “kubernetes native” – almost too much, starting the controller is a CRD and not a DaemonSet – so it’s also worth giving a shot… Hard to decide between these two but since I want to maximize the amount of LINSTOR I learn while doing this (using this as a sort of intro to the technology in general), I’ll go with the slightly-more-manual kvaps/kube-linstor.

Anyway, here’s the full listing of files required (remember this is using the slightly more manual kvaps/kube-linstor):

$ tree .
.
├── controller-client.configmap.yaml
├── controller-config.secret.yaml
├── controller.deployment.yaml
├── controller.rbac.yaml
├── controller.serviceaccount.yaml
├── controller.svc.yaml
├── csi-controller.deployment.yaml
├── csi-controller.rbac.yaml
├── csi-controller.serviceaccount.yaml
├── csi-node.ds.yaml
├── csi-node.rbac.yaml
├── csi-node.serviceaccount.yaml
├── linstor-control-plane.psp.yaml
├── linstor.ns.yaml
├── Makefile
├── satellite.configmap.yaml
├── satellite.ds.yaml
├── satellite.rbac.yaml
└── satellite.serviceaccount.yaml

0 directories, 19 files

So quite a few files/things to keep track of, but the big structural elements are easy to make out (I’ve taken tons of liberties with naming of files and resources).

Preparing the node for LINSTOR

There are a few things I need to do at the node level to prepare them to run LINSTOR. In the previous post I went over how I installed as much LINSTOR related software from apt as I could, hopefully this won’t come back to bite me like pre-installing ceph things did. It shouldn’t since most documentation calls for prepping the given node(s) with the appropriate kernel modules and things. Anyway, here’s the LINSTOR-specific Ansible code:

    - name: Install ZFS
      when: storage_plugin in target_plugins
      ansible.builtin.apt:
        name: zfsutils-linux
        update_cache: yes
        state: present
      vars:
        target_plugins:
          - openebs-localpv-zfs
          - linstor-bd9

    - name: Install LVM
      when: storage_plugin in target_plugins
      block:
        - name: Install lvm2
          ansible.builtin.apt:
            name: lvm2
            update_cache: yes
            state: present
        - name: Ensure rbd kernel module is installed
          community.general.modprobe:
            name: rbd
            state: present
      vars:
        target_plugins:
          - rook-ceph-lvm
          - linstor-rbd9

    - name: Add drbd9 apt repositories
      when: storage_plugin in target_plugins
      ansible.builtin.apt_repository:
        repo: ppa:linbit/linbit-drbd9-stack
        state: present
      vars:
        target_plugins:
          - linstor-rbd9

    - name: Install LINSTOR components
      when: storage_plugin in target_plugins
      block:
        - name: Install drbd packages
          ansible.builtin.apt:
            name:
              - drbd-dkms
              - drbd-utils
            update_cache: yes
            state: present
        - name: Install linstor components
          ansible.builtin.apt:
            name:
              - linstor-controller
              - linstor-satellite
              - linstor-client
            update_cache: yes
            state: present
        - name: Ensure rbd kernel module is installed
          community.general.modprobe:
            name: rbd
            state: present
      vars:
        target_plugins:
          - linstor-rbd9

I’ve installed both LVM and ZFS to make it possible to go either way on LINSTOR’s underlying storage. As far the disk layout, I just made sure to leave the single disk partition and the second disk as empty as possible, and refer to the original documentation for the configuration as directed in the docs.

Configuring Storage Pools for LINSTOR

LINSTOR’s user guide is the place to go for information on provisioning disks for LINSTOR, so I went there and read up a bit. The section on storage pools makes it pretty easy, and with the linstor CLI tool already installed (thanks apt!) I only have to run ~1 command outside of the safety of the LINSTOR documentation on-ramp, which is making the LVM Volume Group (which I messed with back in part-2. LINSTOR can use either LVM or ZFS so I figured I would go with LVM since we are already testing ZFS-on-bare-metal with OpenEBS’s LocalPV ZFS.

LINSTOR expects referenced LVM VolumeGroups to already be present, so we’re going to have to do a bit of setup ourselves. I wasn’t 100% sure on the differences and tradeoffs between thick and thin LVM pools, so I needed to spend some time reading up on them – I found a few good resources:

Reading these I think I’m going with thin provisioning. Another option that we can set on LINSTOR LVM pools is the LVM RAID level. I’m pretty happy with RAID1 (mirroring) across the disks, so I’ll go with that. Here’s roughly what this looks like in ansible YAML:

    - name: Create an LVM thin provisioned pool for LINSTOR
      when: storage_plugin in target_plugins
      tags: [ "drive-partition-prep" ]
      block:
        - name: Create Volume Group with both drives
          # NOTE: LINSTOR requires/expects similar LVM VG/ZPool naming to share across nodes
          command: |
            vgcreate vg_nvme /dev/nvme0n1p5 /dev/nvme1n1
        - name: Initialize LINSTOR storage-pool for disk one
          shell: |
            linstor storage-pool create lvmthin linstor-{{ k8s_node_name }} pool_nvme vg_nvme
      vars:
        target_plugins:
          - openebs-localpv-zfs

While I was figuring this out I referenced the LINSTOR documentation and referred to another good guide out there from 2019.

Setting up the LINSTOR control plane (pre-installing bites me again)

So a huge deep breath, and it’s time apply -f all the files. I immediately found some issues with my YAML which I went and fixed but I was met with this nasty error:

$ k logs satellite-6q76d
LINSTOR, Module Satellite
Version:            1.11.1 (fe95a94d86c66c6c9846a3cf579a1a776f95d3f4)
Build time:         2021-02-11T14:40:43+00:00
Java Version:       11
Java VM:            Debian, Version 11.0.9.1+1-post-Debian-1deb10u2
Operating system:   Linux, Version 5.4.0-67-generic
Environment:        amd64, 1 processors, 15528 MiB memory reserved for allocations


System components initialization in progress

07:25:09.212 [main] INFO  LINSTOR/Satellite - SYSTEM - ErrorReporter DB version 1 found.
07:25:09.213 [main] INFO  LINSTOR/Satellite - SYSTEM - Log directory set to: '/logs'
07:25:09.230 [main] WARN  io.sentry.dsn.Dsn - *** Couldn't find a suitable DSN, Sentry operations will do nothing! See documentation: https://docs.sentry.io/clients/java/ ***
07:25:09.234 [Main] INFO  LINSTOR/Satellite - SYSTEM - Loading API classes started.
07:25:09.366 [Main] INFO  LINSTOR/Satellite - SYSTEM - API classes loading finished: 132ms
07:25:09.366 [Main] INFO  LINSTOR/Satellite - SYSTEM - Dependency injection started.
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by com.google.inject.internal.cglib.core.$ReflectUtils$1 (file:/usr/share/linstor-server/lib/guice-4.2.3.jar) to method java.lang.ClassLoader.defineClass(java.lang.String,byte[],int,int,java.security.ProtectionDomain)
WARNING: Please consider reporting this to the maintainers of com.google.inject.internal.cglib.core.$ReflectUtils$1
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
07:25:09.800 [Main] INFO  LINSTOR/Satellite - SYSTEM - Dependency injection finished: 434ms
07:25:10.098 [Main] INFO  LINSTOR/Satellite - SYSTEM - Initializing main network communications service
07:25:10.098 [Main] INFO  LINSTOR/Satellite - SYSTEM - Starting service instance 'TimerEventService' of type TimerEventService
07:25:10.099 [Main] INFO  LINSTOR/Satellite - SYSTEM - Starting service instance 'FileEventService' of type FileEventService
07:25:10.099 [Main] INFO  LINSTOR/Satellite - SYSTEM - Starting service instance 'SnapshotShippingService' of type SnapshotShippingService
07:25:10.099 [Main] INFO  LINSTOR/Satellite - SYSTEM - Starting service instance 'DeviceManager' of type DeviceManager
07:25:10.105 [Main] WARN  LINSTOR/Satellite - SYSTEM - NetComService: Connector NetComService: Binding the socket to the IPv6 anylocal address failed, attempting fallback to IPv4
07:25:10.105 [Main] ERROR LINSTOR/Satellite - SYSTEM - NetComService: Connector NetComService: Attempt to fallback to IPv4 failed
07:25:10.123 [Main] ERROR LINSTOR/Satellite - SYSTEM - Initialization of the com.linbit.linstor.netcom.TcpConnectorService service instance 'NetComService' failed. [Report number 606EAFD4-928FB-000000]

07:25:10.128 [Main] ERROR LINSTOR/Satellite - SYSTEM - Initialisation of SatelliteNetComServices failed. [Report number 606EAFD4-928FB-000001]

07:25:10.129 [Thread-2] INFO  LINSTOR/Satellite - SYSTEM - Shutdown in progress
07:25:10.129 [Thread-2] INFO  LINSTOR/Satellite - SYSTEM - Shutting down service instance 'DeviceManager' of type DeviceManager
07:25:10.129 [Thread-2] INFO  LINSTOR/Satellite - SYSTEM - Waiting for service instance 'DeviceManager' to complete shutdown
07:25:10.130 [Thread-2] INFO  LINSTOR/Satellite - SYSTEM - Shutting down service instance 'SnapshotShippingService' of type SnapshotShippingService
07:25:10.130 [Thread-2] INFO  LINSTOR/Satellite - SYSTEM - Waiting for service instance 'SnapshotShippingService' to complete shutdown
07:25:10.130 [Thread-2] INFO  LINSTOR/Satellite - SYSTEM - Shutting down service instance 'FileEventService' of type FileEventService
07:25:10.130 [Thread-2] INFO  LINSTOR/Satellite - SYSTEM - Waiting for service instance 'FileEventService' to complete shutdown
07:25:10.131 [Thread-2] INFO  LINSTOR/Satellite - SYSTEM - Shutting down service instance 'TimerEventService' of type TimerEventService
07:25:10.131 [Thread-2] INFO  LINSTOR/Satellite - SYSTEM - Waiting for service instance 'TimerEventService' to complete shutdown
07:25:10.131 [Thread-2] INFO  LINSTOR/Satellite - SYSTEM - Shutdown complete

As this point I’d be suspicious if anything worked on the first try, so I’m mostly enhearted that the logging is good and orderly for LINSTOR! But it lokos like LINSTOR might be written in java..? Oh no, say it ain’t so, I didn’t even notice it on the LINBIT/linstor-server GitHub project. Oh well I won’t let my bias against Java spoil this free high-quality software I’m about to partake of!

The error at hand is that the NetComService has failed – it could not bind to the IPv6 port (there shouldn’t be a problem with that) and so it tried IPv4 but failed there too (again, weird). This is probably a PSP binding (which happens through RBAC problem)… And after fixing those, nothing changed (though the Role was missing a namespace). You know what this is – this is another case of pre-installing biting me in the ass. Let’s see if there’s something listening on port 3366 already on the machine:

root@all-in-one-01 ~ # lsof -i :3366
COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
java    25432 root  132u  IPv6 119091      0t0  TCP *:3366 (LISTEN)

Of course – I made the same mistake as with ceph – installing the package from apt is not a good idea, because it will start a LINSTOR instance for you. I removed the installation of linstor-controller and linstor-satellite from the ansible code and from-zero re-install of the machine.

I had one more issue – be careful of ReplicaSets getting stuck with 0 pods but still being present – if you look at the controller pod logs, it’s possible that the “leader” doesn’t actually exist:

$ k get pods
NAME                          READY   STATUS             RESTARTS   AGE
controller-58dcfb884d-2jpg7   1/1     Running            0          19s
controller-58dcfb884d-c8v9d   1/1     Running            0          19s
csi-node-jmrcd                2/3     CrashLoopBackOff   5          3m9s
satellite-kkgmh               1/1     Running            0          5h47m
$ k logs -f controller-58dcfb884d-c8v9d
time="2021-04-08T14:07:17Z" level=info msg="running k8s-await-election" version=v0.2.2
I0408 14:07:17.385975       1 leaderelection.go:242] attempting to acquire leader lease  linstor/controller...
time="2021-04-08T14:07:17Z" level=info msg="long live our new leader: 'controller-587cb449d7-vvjp2'!"

This is probably an unlikely occurrence (I probably had just the right failures in just the right order), but worth knowing about. After clearing that up, I was able to get everything running:

$ k get pods
NAME                          READY   STATUS    RESTARTS   AGE
controller-58dcfb884d-2jpg7   1/1     Running   0          11h
controller-58dcfb884d-c8v9d   1/1     Running   0          11h
csi-node-c6n5q                3/3     Running   1          24s
satellite-dz66f               1/1     Running   0          2m51s

OK awesome, now that the control plane is up, I’ve realized that I’m missing something… I need to define storage pools that LINSTOR will use!

SIDETRACK: thoughts on choosing a DB for LINSTOR

LINSTOR can use a variety of databases:

I don’t want to introduce too much non-essential complexity into deploying LINSTOR though, so I’m going to go with ETCD. Looking at the configuration it looks like when etcd is used for the DB, kube-linstor will actually use the available ETCD instance that Kubernetes is running on, and you can supply a prefix to “isolate” your writes from everything else. I’m going to use an etcd prefix of "linstor" (configured in controller-config.secret.yaml).

As I haven’t seen any complaints about the DB being used just yet in the logs, I’m assuming I’m good on this front!

Setting up StoragePools for LINSTOR

One thing that kvaps/kube-linstor doesn’t do (or seem to do, as far as I can tell) is create the storage pools you’ll actually be using. The piraeus-operator does do this but I didn’t choose that so… Theoretically LINSTOR is capable of creating the storage pools automatically/dealing with disks, but in my case I want to follow the LINSTOR documentation where they discuss one pool per disk for disk-level replication, but they mention it but don’t really describe it – I can’t tell if I should be making two LMV volume groups (with one volume each) or exposing one volume group which is mirrored (and will waste some of the space, but theoretically in a larger cluster the space wouldn’t get wasted cross-machine). The best way to make sure I get the layout I want is to configure it manually.

Luckily for me there’s a PR on adding storage pools via init on csi-node which lays it all out very well for me. While I don’t need the helm machinations, I’ve adapted that code to work for my setup as an init container:

      # Configure storage pools
      # (see https://github.com/kvaps/kube-linstor/pull/31)
       - name: add-storage-pools
         image: ghcr.io/kvaps/linstor-satellite:v1.11.1-1
         imagePullPolicy: IfNotPresent
         command:
           - /bin/sh
           - -exc
           # TODO: Maybe get this JSON by mounting a ConfigMap? Automatic discovery would be much better
           # TODO: clustered deployments will have a problem with this.
           # First drive
           - |
             curl -s -f http://${CONTROLLER_HOST}.${CONTROLLER_NAMESPACE}:${CONTROLLER_PORT}/v1/nodes/${NODE_NAME} || exit 1
             curl -s -f \
             -H "Content-Type: application/json" \
             -d "{\"storage_pool_name\":\"linstor-${NODE_NAME}-0\",\"provider_kind\":\"LVM_THIN\",\"props\":{\"StorDriver/LvmVg\":\"vg_linstor_0\",\"StorDriver/ThinPool\":\"lv_thin_linstor_0\"}}" \
             http://${CONTROLLER_HOST}.${CONTROLLER_NAMESPACE}:${CONTROLLER_PORT}/v1/nodes/${NODE_NAME}/storage-pools || true

           # Second drive
           - |
             curl -s -f http://${CONTROLLER_HOST}.${CONTROLLER_NAMESPACE}:${CONTROLLER_PORT}/v1/nodes/${NODE_NAME} || exit 1
             curl -s -f \
             -H "Content-Type: application/json" \
             -d "{\"storage_pool_name\":\"linstor-${NODE_NAME}-1\",\"provider_kind\":\"LVM_THIN\",\"props\":{\"StorDriver/LvmVg\":\"vg_linstor_1\",\"StorDriver/ThinPool\":\"lv_thin_linstor_1\"}}" \
             http://${CONTROLLER_HOST}.${CONTROLLER_NAMESPACE}:${CONTROLLER_PORT}/v1/nodes/${NODE_NAME}/storage-pools || true
         env:
           - name: CONTROLLER_PORT
             value: "3370"
           - name: CONTROLLER_HOST
             value: "controller"
           - name: CONTROLLER_NAMESPACE
             value: linstor
           - name: NODE_NAME
             valueFrom:
               fieldRef:
                 fieldPath: spec.nodeName

A bit manual, but in a world where I’m dealing with mostly homogeneous nodes I think I’m OK. With the re-used/consistent naming I should also be able to get re-use for all the 1st and second drives – it’s not clear to me if LINSTOR will pick up on the logical volume name or the volume group name for sharing – if it was looking at the LV name only I could probably name them all lv_nvme and get the whole cluster to be fully pooled but right now all the first disks and all the second disks would be pooled as far as I can tell. I’m only running on one node here so it doesn’t really matter but it will later.

I did have to do some adjustment to the ansible side to pre-provision the LVM pieces though, so I’ll share that as well:

    - name: Create an LVM thin provisioned pool for LINSTOR
      when: storage_plugin in target_plugins and nvme_disk_0_partition_5.stat.exists and nvme_disk_1.stat.exists
      tags: [ "drive-partition-prep" ]
      block:
        # NOTE: LINSTOR requires/expects similar LVM VG/ZPool naming to share across nodes
        # NOTE: To get per-device isolation, we need to create a storage pool per backend device, not sure if this includes VGs and thin LVs
        # (see: https://linbit.com/drbd-user-guide/linstor-guide-1_0-en/#s-a_storage_pool_per_backend_device)
        - name: Create Volume Group for disk one partition 5
          community.general.lvg:
            vg: vg_linstor_0
            pvs: /dev/nvme0n1p5
            pvresize: yes # maximum available size
        - name: Create Volume Group for disk 2
          community.general.lvg:
            vg: vg_linstor_1
            pvs: /dev/nvme1n1
            pvresize: yes # maximum available size
        - name: Create thinpool for disk one partition 5
          community.general.lvol:
            vg: vg_linstor_0
            thinpool: lv_thin_linstor_0
            size: 100%FREE
        - name: Create thinpool for disk one
          community.general.lvol:
            vg: vg_linstor_1
            thinpool: lv_thin_linstor_1
            size: 100%FREE
      vars:
        target_plugins:
          - linstor-drbd9

Setting up the StorageClasses

OK so now that the control and data planes are set up we can make the stoarge classes:

---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: linstor-single
provisioner: linstor.csi.linbit.com
volumeBindingMode: WaitForFirstConsumer
allowedTopologies:
  - matchLabelExpressions:
      - key: zfs-support
        values:
          - "yes"
parameters:
  # # CSI related parameters
  # csi.storage.k8s.io/fstype: xfs

  # LINSTOR parameters
  placementCount: "1" # aka `autoPlace`, replica count
  # resourceGroup: "full-example"
  # storagePool: "my-storage-pool"
  # disklessStoragePool: "DfltDisklessStorPool"
  # layerList: "drbd,storage"
  # placementPolicy: "AutoPlace"
  # allowRemoteVolumeAccess: "true"
  # encryption: "true"
  # nodeList: "diskful-a,diskful-b"
  # clientList: "diskless-a,diskless-b"
  # replicasOnSame: "zone=a"
  # replicasOnDifferent: "rack"
  # disklessOnRemaining: "false"
  # doNotPlaceWithRegex: "tainted.*"
  # fsOpts: "nodiscard"
  # mountOpts: "noatime"
  # postMountXfsOpts: "extsize 2m"

  # # DRBD parameters
  # DrbdOptions/*: <x>

Lots of options I’m not going to touch there, and I wonder how it would be to jam all the ZFS options into mountOpts, but for now I’m happy it’s finished with nothing else going wrong!

I’ll spare you the PVC + Pod YAML and persistence test, but I’ve developed some intuition for testing whether a CSI storage plugin is working and that’s worth sharing:

$ k get csidriver
NAME                     ATTACHREQUIRED   PODINFOONMOUNT   MODES        AGE
linstor.csi.linbit.com   true             true             Persistent   55m
$ k get csinode
NAME            DRIVERS   AGE
all-in-one-01   1         75m

Naming for the different storage classes

So up until now I’ve been using test-single and test-replicated as my storage class names but I’m going to change them – I couldn’t think of the right naming for non replicated disks, but that’s just because I’m not a graybeard sysadmin! In RAID terminology the closest thing to the right term is “RAID0”, but there’s also the term JBOD. Since RAID0 seems to almost always mean striping I can’t use that, but jbod works for the non-replicated case. So basically, test-single.storageclass.yaml -> jbod.storageclass.yaml and test-replicated.storageclass.yaml everywhere. Not a huge thing but thought it was worth describing.

Unfortunately not everything matches up nicely for example in LocalPV ZFS the disk below is mirrored so it doesn’t make sense to make a jbod StorageClass, but what can you do – I think I’ve yak shaved enough.

Wrapup

OpenEBS continued it’s tradition of being easy to set up and LINSTOR wasn’t much far behind! I have to admit I’m still a little sore from the difficulties of the Ceph setup but it was refreshing that these were so much easier to install. Most people just never look at the resources and kubectl apply -f the large file, but that’s crazy to me, because when something goes wrong you’d have absolutely no part which pieces were which and the expanse of the problem domain.

Anyway, hopefully now everyone has some reference material on the installation of these tools. Finally we can get on to actually running the tests – I’m pretty sure simply running some fio pods (and even booting up postgres/etc pods for testing) will be much easier than this setup was. It’s the whole point of using Kubernetes – set this stuff up once, and the abstractions above (PersistentVolumeClaims, PersistentVolumes, Pods, Deployments) are super easy to use.

UPDATE: LINSTOR was broken, kernel modules weren’t restart-proof

So it looks like I let some broken code slip through – the LINSTOR code is broken/flaky. It turns out it’s really important the order in which you start the compnents (piraeus-operator may have been better at this…) – controller comes first, then satellite (which registers the node) then the csi-node (which registers storage pools) and the rest of it. After a hard refresh it was very finnicky. You have to be sure you can get at least this:

$ k exec -it deploy/controller -- /bin/bash
root@controller-869dcf7955-6xtgb:/# linstor node list
╭─────────────────────────────────────────────────────────────────╮
┊ Node          ┊ NodeType  ┊ Addresses                  ┊ State  ┊
╞═════════════════════════════════════════════════════════════════╡
┊ all-in-one-01 ┊ SATELLITE ┊ xx.xxx.xx.xxx:3366 (PLAIN) ┊ Online ┊
╰─────────────────────────────────────────────────────────────────╯

Once you have the node registered (which means satellite is fine), the next thing to check is if you have the storage pools (generally the state of the csi-node deployment would tell you this) :

root@controller-869dcf7955-6xtgb:/# linstor storage-pool list
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ StoragePool          ┊ Node          ┊ Driver   ┊ PoolName ┊ FreeCapacity ┊ TotalCapacity ┊ CanSnapshots ┊ State ┊
╞══════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ DfltDisklessStorPool ┊ all-in-one-01 ┊ DISKLESS ┊          ┊              ┊               ┊ False        ┊ Ok    ┊
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Nope, we are definitely missing a pool made of disks in there. If you look at the output of the add-storage-pools init container there might be some hints:

[Default|linstor] mrman 17:50:01 [linstor] $ k logs ds/csi-node -c add-storage-pools
+ curl -s -f http://controller.linstor:3370/v1/nodes/all-in-one-01
+ exit 1
{"name":"all-in-one-01","type":"SATELLITE","props":{"CurStltConnName":"default","NodeUname":"all-in-one-01"},"net_interfaces":[{"name":"default","address":"XXX.XXX.XXX.XXX","satellite_port":3366,"satellite_encryption_type":"PLAIN","is_active":true,"uuid":"2ef1e6f7-c2df-46c2-a346-6b7443ed3e43"}],"connection_status":"ONLINE","uuid":"b0b7ff34-4232-465c-b97a-6df0f8320cbe","storage_providers":["DISKLESS","LVM","LVM_THIN","FILE","FILE_THIN","OPENFLEX_TARGET"],"resource_layers":["LUKS","NVME","WRITECACHE","CACHE","OPENFLEX","STORAGE"],"unsupported_providers":{"SPDK":["IO exception occured when running 'rpc.py get_spdk_version': Cannot run program \"rpc.py\": error=2, No such file or directory"],"ZFS_THIN":["'cat /sys/module/zfs/version' returned with exit code 1"],"ZFS":["'cat /sys/module/zfs/version' returned with exit code 1"]},"unsupported_layers":{"DRBD":["DRBD version has to be >= 9. Current DRBD version: 0.0.0"]}}

And if we clean that up…

{
  "name": "all-in-one-01",
  "type": "SATELLITE",
  "props": {
    "CurStltConnName": "default",
    "NodeUname": "all-in-one-01"
  },
  "net_interfaces": [
    {
      "name": "default",
      "address": "XXX.XXX.XXX.XXX",
      "satellite_port": 3366,
      "satellite_encryption_type": "PLAIN",
      "is_active": true,
      "uuid": "2ef1e6f7-c2df-46c2-a346-6b7443ed3e43"
    }
  ],
  "connection_status": "ONLINE",
  "uuid": "b0b7ff34-4232-465c-b97a-6df0f8320cbe",
  "storage_providers": [
    "DISKLESS",
    "LVM",
    "LVM_THIN",
    "FILE",
    "FILE_THIN",
    "OPENFLEX_TARGET"
  ],
  "resource_layers": [
    "LUKS",
    "NVME",
    "WRITECACHE",
    "CACHE",
    "OPENFLEX",
    "STORAGE"
  ],
  "unsupported_providers": {
    "SPDK": [
      "IO exception occured when running 'rpc.py get_spdk_version': Cannot run program \"rpc.py\": error=2, No such file or directory"
    ],
    "ZFS_THIN": [
      "'cat /sys/module/zfs/version' returned with exit code 1"
    ],
    "ZFS": [
      "'cat /sys/module/zfs/version' returned with exit code 1"
    ]
  },
  "unsupported_layers": {
    "DRBD": [
      "DRBD version has to be >= 9. Current DRBD version: 0.0.0"
    ]
  }
}

Well a few unexpected errors there, but it looks like I can ignore the unsupported_providers section – zfs isn’t installed so of course it couldn’t get the version. One section that I probably can’t ignore is the unsupported_layers section – drbd is definitely supposed to be on the machine – Current drbd version should not be 0.0.0 (if that’s even a valid version). Why would it be unable to find DRBD?

I definitely installed drbd… Right? Well I found an issue similar to mine and of course I skipped to the bit where he check the hostname (because the issue filer did have the drbd9 kernel module installed), and sure enough my username does match my LINSTOR node name (the node registered)… But actually, if I run lsmod | grep -i drbd9 I get:

$ lsmod | grep -i drbd9

Welp, looks like I installed but forgot to enable (via modprobe) the drbd kernel module:

modified   ansible/storage-plugin-setup.yml
@@ -132,7 +132,7 @@
             state: present
         - name: Ensure rbd kernel module is installed
           community.general.modprobe:
-            name: rbd
+            name: drbd
             state: present
       vars:
         target_plugins:
           - linstor-drbd9

Off by a single ’d’ (rbd was required for some other things so there’s another legitimate modprobe elsewhere). Another thing – to make sure this kernel module gets loaded every time, I’m going to need to make sure there’s something in /etc/modprobe.d:

        - name: Ensure kernel module comes up with next restart
          ansible.builtin.copy:
            dest: "/etc/modprobe.d/drbd.conf"
            content: |
              options drbd

I fixed rbd and nvme-tcp as well. After that I got a slightly different error (always a good sign):

"unsupported_layers":{
    "DRBD": [
        "DRBD version has to be >= 9. Current DRBD version: 8.4.11"
        ]
    }
}

Well that’s not good – why don’t I have v9? turns out it wasn’t enough just to try and install drbd-dkms and add the ppa:linbit/linbit-drbd9-stack repository – I found an excellent guide. So you need to anso make sure that the linux sources are installed so that a new kernel can be rebuilt – after installing linux-generic I saw this output (important bit is towards the end):

root@all-in-one-01 ~ # apt install linux-generic
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following additional packages will be installed:
  linux-headers-5.4.0-70 linux-headers-5.4.0-70-generic linux-headers-generic
The following NEW packages will be installed:
  linux-generic linux-headers-5.4.0-70 linux-headers-5.4.0-70-generic linux-headers-generic
0 upgraded, 4 newly installed, 0 to remove and 0 not upgraded.
Need to get 12.4 MB of archives.
After this operation, 85.9 MB of additional disk space will be used.
Do you want to continue? [Y/n] Y
Get:1 http://de.archive.ubuntu.com/ubuntu focal-updates/main amd64 linux-headers-5.4.0-70 all 5.4.0-70.78 [11.0 MB]
Get:2 http://de.archive.ubuntu.com/ubuntu focal-updates/main amd64 linux-headers-5.4.0-70-generic amd64 5.4.0-70.78 [1,400 kB]
Get:3 http://de.archive.ubuntu.com/ubuntu focal-updates/main amd64 linux-headers-generic amd64 5.4.0.70.73 [2,428 B]
Get:4 http://de.archive.ubuntu.com/ubuntu focal-updates/main amd64 linux-generic amd64 5.4.0.70.73 [1,896 B]
Fetched 12.4 MB in 0s (27.9 MB/s)
Selecting previously unselected package linux-headers-5.4.0-70.
(Reading database ... 45993 files and directories currently installed.)
Preparing to unpack .../linux-headers-5.4.0-70_5.4.0-70.78_all.deb ...
Unpacking linux-headers-5.4.0-70 (5.4.0-70.78) ...
Selecting previously unselected package linux-headers-5.4.0-70-generic.
Preparing to unpack .../linux-headers-5.4.0-70-generic_5.4.0-70.78_amd64.deb ...
Unpacking linux-headers-5.4.0-70-generic (5.4.0-70.78) ...
Selecting previously unselected package linux-headers-generic.
Preparing to unpack .../linux-headers-generic_5.4.0.70.73_amd64.deb ...
Unpacking linux-headers-generic (5.4.0.70.73) ...
Selecting previously unselected package linux-generic.
Preparing to unpack .../linux-generic_5.4.0.70.73_amd64.deb ...
Unpacking linux-generic (5.4.0.70.73) ...
Setting up linux-headers-5.4.0-70 (5.4.0-70.78) ...
Setting up linux-headers-5.4.0-70-generic (5.4.0-70.78) ...
/etc/kernel/header_postinst.d/dkms:
 * dkms: running auto installation service for kernel 5.4.0-70-generic

Kernel preparation unnecessary for this kernel.  Skipping...

Building module:
cleaning build area...
make -j12 KERNELRELEASE=5.4.0-70-generic -C src/drbd KDIR=/lib/modules/5.4.0-70-generic/build.....
cleaning build area...

DKMS: build completed.

drbd.ko:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/5.4.0-70-generic/updates/dkms/

drbd_transport_tcp.ko:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/5.4.0-70-generic/updates/dkms/

depmod...

DKMS: install completed.
   ...done.
Setting up linux-headers-generic (5.4.0.70.73) ...
Setting up linux-generic (5.4.0.70.73) ...

So I guess it just… didn’t do anything before? Do I have v9 now? What version was i pulling before? I’m restarting to be sure, but considering sometimes it doesn’t build properly I may have to do a build from scratch. Luckily for me, I did not have to build it from scratch, and after a reboot the DRBD-specific errors are gone from the init container. The other errors still show up though:

"unsupported_providers": {
    "SPDK": [
        "IO exception occured when running 'rpc.py get_spdk_version': Cannot run program \"rpc.py\": error=2, No such file or directory"
    ],
    "ZFS_THIN": [
        "'cat /sys/module/zfs/version' returned with exit code 1"],"ZFS":["'cat /sys/module/zfs/version' returned with exit code 1"
    ]
}

I guess I’ll fix these by just installing spdk and zfs. ZFS is easy to fix (apt install zfsutils) [SPDK][spdk] is very bleeding edge tech… I’m not sure I want to try and install that just yet… Also the error it’s giving me has almost nothing to do with spdk itself but some script called rpc.py that tries to get the SPDK version? yikes. As you might expect after installing ZFS I’m left with only the SPDK error stopping the init container from making progress:

"unsupported_providers": {
    "SPDK": [
        "IO exception occured when running 'rpc.py get_spdk_version': Cannot run program \"rpc.py\": error=2, No such file or directory"
    ]
}

I could just modify the script to skip this check (having any unsupported provider seems to hmake the curl fail, but I think I’ll install SPDK – I do have NVMe drives and there might be some cool features unlocked by having SPDK available Jumping down the rabbit hole I ended up installing spdk (skipped out on dpdk there were some issues similar to this persons’s). I won’t get into it here but you can find out all about it in the ansible code if you’re really interested. Of course, no good deed goes unpunished:

"unsupported_providers": {
    "SPDK": [
        "'rpc.py get_spdk_version' returned with exit code 1"
    ]
}

If I exec into the container and try to run it myself I get the following:

[Default|linstor] mrman 20:57:09 [linstor] $ k exec -it ds/satellite --  /bin/bash
root@all-in-one-01:/# rpc.py get_spdk_version
Traceback (most recent call last):
  File "/usr/local/sbin/rpc.py", line 3, in <module>
    from rpc.client import print_dict, print_json, JSONRPCException
  File "/usr/local/sbin/rpc.py", line 3, in <module>
    from rpc.client import print_dict, print_json, JSONRPCException
ModuleNotFoundError: No module named 'rpc.client'; 'rpc' is not a package

Welp, I’m doing trying to get this working now, and I’ve gained a little more hate for interpreted languages. I bet this stupid rpc script doesn’t even do much, but it was too much for linstor to re-implement in java so now they’re trying to call python from a Java application. I’m just going to ignore the check in the storage pool creation init container:

       - name: add-storage-pools
         image: ghcr.io/kvaps/linstor-satellite:v1.11.1-1
         imagePullPolicy: IfNotPresent
         command:
           - /bin/sh
           - -exc
           # TODO: Maybe get this JSON by mounting a ConfigMap? Automatic discovery would be much better
           # TODO: clustered deployments will have a problem with this.
           # First drive
           - |
             curl -s -f http://${CONTROLLER_HOST}.${CONTROLLER_NAMESPACE}:${CONTROLLER_PORT}/v1/nodes/${NODE_NAME} || echo "ERROR: failed to retrieve node [${NODE_NAME}] from controller... proceeding anyway..." && true
             curl -s -f \
             -H "Content-Type: application/json" \
             -d "{\"storage_pool_name\":\"linstor-${NODE_NAME}-0\",\"provider_kind\":\"LVM_THIN\",\"props\":{\"StorDriver/LvmVg\":\"vg_linstor_0\",\"StorDriver/ThinPool\":\"lv_thin_linstor_0\"}}" \
             http://${CONTROLLER_HOST}.${CONTROLLER_NAMESPACE}:${CONTROLLER_PORT}/v1/nodes/${NODE_NAME}/storage-pools || true

Not ideal, but working:

$ k get pods --watch
NAME                          READY   STATUS            RESTARTS   AGE
controller-869dcf7955-6xtgb   1/1     Running           6          3h39m
csi-node-f4hsc                0/3     PodInitializing   0          3s
satellite-h7d7p               1/1     Running           0          18m
csi-node-f4hsc                3/3     Running           0          3s

And what does the cluster think?

$ k exec -it deploy/controller -- /bin/bash
root@controller-869dcf7955-6xtgb:/# linstor storage-pool list
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ StoragePool             ┊ Node          ┊ Driver   ┊ PoolName                       ┊ FreeCapacity ┊ TotalCapacity ┊ CanSnapshots ┊ State ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ DfltDisklessStorPool    ┊ all-in-one-01 ┊ DISKLESS ┊                                ┊              ┊               ┊ False        ┊ Ok    ┊
┊ linstor-all-in-one-01-0 ┊ all-in-one-01 ┊ LVM_THIN ┊ vg_linstor_0/lv_thin_linstor_0 ┊        0 KiB ┊         0 KiB ┊ True         ┊ Error ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
ERROR:
Description:
    Node: 'all-in-one-01', storage pool: 'linstor-all-in-one-01-0' - Failed to query free space from storage pool
Cause:
    Unable to parse free thin sizes

Well, we almost did it! The information looks almost right but storage pool free space query is failing… I wonder why. Well at least someone else has run into this before, on piraeus-server so I’m not the only one. Also, it’s becoming clearer and clearer that the lack of self-healing is a problem – the order in which these thing start is really important, such that if the csi-node starts before the satellite, it does not register the storage pool, and I essentially have restart the rollout. I might as well fix this otherwise things will be inconsistent so I took some time and added some initContainers that wait for various bits to start.

It looks like the LVM issues were solved by a restart of the node (which probably will be OK with a hard reset) and I had to add some code to make sure things started up in the right order – for example in satellite I now wait for controlller with an init container:

      initContainers:
        ## Wait for controller to be ready -- it must be before satellite can register with it
        - name: wait-for-controller
          image: bitnami/kubectl
          imagePullPolicy: IfNotPresent
          command:
            - /bin/sh
            - -exc
            - |
              n=0
              until [ $n -ge 30 ]; do
                  REPLICA_COUNT=$(kubectl get deploy/${CONTROLLER_DEPLOYMENT_NAME} -n ${CONTROLLER_NAMESPACE} -o template --template='{{ .status.availableReplicas }}')
                  if [ "${REPLICA_COUNT}" -gt "0" ] ; then
                      echo "[info] found ${REPLICA_COUNT} available replicas."
                      break
                  fi
                  echo -n "[info] waiting 10 seconds before trying again..."
                  sleep 10
              done
          env:
            - name: CONTROLLER_DEPLOYMENT_NAME
              value: "controller"
            - name: CONTROLLER_NAMESPACE
              value: linstor

So LINSTOR’s been a much bigger pain than I expected, but at least it’s now reporting the right status:

ult|linstor] mrman 23:34:30 [linstor] $ k exec -it deploy/controller -- linstor storage-pool list
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ StoragePool             ┊ Node          ┊ Driver   ┊ PoolName                       ┊ FreeCapacity ┊ TotalCapacity ┊ CanSnapshots ┊ State ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ DfltDisklessStorPool    ┊ all-in-one-01 ┊ DISKLESS ┊                                ┊              ┊               ┊ False        ┊ Ok    ┊
┊ linstor-all-in-one-01-0 ┊ all-in-one-01 ┊ LVM_THIN ┊ vg_linstor_0/lv_thin_linstor_0 ┊   395.74 GiB ┊    395.74 GiB ┊ True         ┊ Ok    ┊
┊ linstor-all-in-one-01-1 ┊ all-in-one-01 ┊ LVM_THIN ┊ vg_linstor_1/lv_thin_linstor_1 ┊   476.70 GiB ┊    476.70 GiB ┊ True         ┊ Ok    ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

So after this I ran into tons of issues (I even filed an issue on piraeusdatastore/piraeus-operator since evidently that’s where all the linstor server issues should go) and solved them one by one (see the GitLab repo, unfortunately a bunch of this work happened on the add-tests branch), but ultimately the last issue came down to a misconfigured port on the csi-controller:

modified   kubernetes/linstor/csi-controller.deployment.yaml
@@ -137,7 +137,7 @@ spec:
             port: 9808
         env:
         - name: LS_CONTROLLERS
-          value: "http://controller:3366"
+          value: "http://controller:3370"
         volumeMounts:
         - name: socket-dir
           mountPath: /var/lib/csi/sockets/pluginproxy/

The controller listens for Rest API access on port 3370, not 3366.

Along with that I also ran into some placement issues (LINSTOR expects more than one node, which is reasonable) – with that, the PVCs were working again. Turns out to force LINSTOR to put replicas on the same node you need to use replicasOnSame, and set some AUX properties:

$ k exec -it deploy/controller -- linstor node set-property all-in-one-01 --aux node all-in-one-01
SUCCESS:
    Successfully set property key(s): Aux/node
WARNING:
    The property 'Aux/node' has no effect since the node 'all-in-one-01' does not support DRBD 9
SUCCESS:
Description:
    Node 'all-in-one-01' modified.
Details:
    Node 'all-in-one-01' UUID is: 44ff6a28-5499-4457-a5b0-87e6b8e55899
SUCCESS:
    (all-in-one-01) Node changes applied.

Once you’ve done that, you have to restrict the StorageClass to target the node. Oh one more thing… I did have to install DRBD9 manually (huge thanks tot he guide @ nethence.com).

Even more painful, painful experimentation later, turns out you need to make Resource Groups for this stuff to work:

$ k exec -it deploy/controller -- linstor resource-group create --storage-pool=linstor-all-in-one-01-0,linstor-all-in-one-01-1 --place-count=1 jbod
SUCCESS:
Description:
    New resource group 'jbod' created.
Details:
    Resource group 'jbod' UUID is: c56afd58-479d-4e26-945d-fdf5b1f27f2a

This lead to success though, finally. It’s been like ~1-2 solid days to get to this point:

$ k exec -it deploy/controller -- linstor resource-group list
╭────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceGroup ┊ SelectFilter                                                    ┊ VlmNrs ┊ Description ┊
╞════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ DfltRscGrp    ┊ PlaceCount: 2                                                   ┊        ┊             ┊
╞┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄╡
┊ jbod          ┊ PlaceCount: 1                                                   ┊        ┊             ┊
┊               ┊ StoragePool(s): linstor-all-in-one-01-1                         ┊        ┊             ┊
┊               ┊ LayerStack: ['DRBD', 'STORAGE']                                 ┊        ┊             ┊
╞┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄╡
┊ raid1         ┊ PlaceCount: 2                                                   ┊        ┊             ┊
┊               ┊ StoragePool(s): linstor-all-in-one-01-0,linstor-all-in-one-01-1 ┊        ┊             ┊
╰────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Unfortunately, it looks like raid1 just doesn’t work across disks on the same node, despite how good this looks – I spent hours (honestly the better part of a day) trying to get LINSTOR to believe that replicas could be placed across disks and it just didn’t work. I tried just about every permutation of placementPolicy, placementCount, nodeList, clientList that I could think of and it just didn’t work. So at least for the single node I’m just not going to att empt to test RAID1 on with LINSTOR. Since single-replica (single placeCount) was working just fine, in production it’s probably a better idea to just use LVM to RAID1 the disks underneath and let LINSTOR provision from that.

This is definitely more work than I expected to do for LINSTOR, and I’m not comfortable with it because Ceph is the much more industry chosen and trusted solution – at this point I think I’ve spent more time (with less intermediate progress) figuring out how LINSTOR is supposed to work on k8s than Rook/Ceph. Anyway, hopefully this update helps anyone who was wondering why the LINSTOR stuff didn’t quite work.