Awesome FOSS Logo
Discover awesome open source software
Launched 🚀🧑‍🚀

Kicking the Tires on OpenEBS for Cluster Storage

Categories
Kubernetes logo + Openebs logo

tl;dr - outline of some approaches I’ve taken to storage on my small k8s cluster, why I can’t just use Rook (which is/was primarily Ceph underneath), and setup & evaluation of OpenEBS. OpenEBS is working great and is wonderfuly simple – I’m probably going to be using it for everything from now on.

Discovering Rook (and resultingly Ceph, which was Rook’s first underlying system) was a huge moment for me in figuring out how to do interesting things with Kubernetes. My “cluster” is super small (only the one node!), but I always wanted to get away from the hackiness of hostsPath volumes, and use something that was a little more dynamic.

Using Rook with Ceph underneath meant that you needed to hand over an entire disk to Ceph to manage. I re-read both Ceph and Rook documentation countless times because both of them seem to suggest you can just run “in a folder”, but I’m convinced that what they mean is in a folder on a dedicated disk. There’s a fundamental problem with being able to constraint folder sizes in linux to consider as well. Either way, giving an entire disk to Ceph to manage is doable – you can just let ansible do the heavy lifting of wiping the second drive and reformatting it. In my case things are a little more difficult because Hetzner’s dedicated servers come with software RAID1’d disks (RAID1 = multiple copies of the same data). This meant I had to spend some time learning about software RAID on linux in general and learning to dissassemble it.

Relatively recently this set up bit me (I’ll get into the how/why later) and I reverted to the indignity that is managed hostPath volumes again (it’s not that bad, I just make sure to keep all data in a central location like /var/data and folder by project). Recovering the data is also really easy if you follow this path because you can just SSH on to the machine (and if it doesn’t boot go to recovery mode), and rsync the whole folder out (don’t forget rsync’s compression options!).

You might be wondering why I’d go for a solution like Rook/Ceph when I could just use any of the other awesome volume types that Kubernetes supports. Well there are a few reasons:

  • I don’t use a cloud provider (“baremetal” cluster)
  • I prefer (basically only use) F/OSS solutions
  • local volumes are awesome but don’t support dynamic provisioning as of now

And of course, tools like Rook/Ceph are what I want to get used to going forward because they do things like handle replication of data under the covers for you (so essentially RAIDx), and I’m preparing for the day where I enter the normal case of 99% of k8s operators who run more than one node and data starts flying everywhere. I have a fetish for good yet general solutions, so I’d rather run Rook/Ceph on one node then figure out how to run it on more than commit to a one-node solution that isn’t really managable in a multi-node environment (given that I’ll be multi-node sooner rather than later).

I only recently of OpenEBS through a random comment by u/dirtypete1981 on reddit, and I had no idea OpenEBS existed, and the idea of Container Attached Storage is interesting, although “CAS” is a terribly overloaded term already in computer science. After reading up on the concepts (or watching the FOSDEM 2018 talk by Jeffrey Molanus), I was interested in trying it out – while I don’t know that a case can be made for the CAS approach being faster than traditional approaches, the flexbility is self-apparent.

The CAS approach is kind of like Ceph turned inside out – the OSD/Mons and other internal stuff are exposed as part of your infrastructure instead of behind the Ceph curtain. Could OpenEBS be my solution to small-scale but general storage-for-my-workloads problems? (Spoilers: the answer is yes, which is why this blog post exists).

How I got here: borking my Rook setup

tl;dr - After prepping for a kubernetes 1.12 upgrade, I updated the OS as well and grub2 carelessly.

I don’t know why I keep doing this to myself (it’s not the first time), but in the middle of getting ready to migrate to Kubernetes 1.12, I did an apt-get update && apt-get upgrade. Nothing better than adding one big upgrade while you perform another. While apt was doing it’s thing I noticed that grub2-install was trying to configure itself and asking me for input. I picked some settings that I thought were correct (and that the config tool LGTM’d), but it turns out that grub/grub2 basically doesn’t support proper installation for LVM/Software RAID. That, or I’m just not smart enough to get it to work and I need more gray hairs on my beard, either way my setup was borked – say goodbye to that sweet Kubernetes 1.12 upgrade.

Cue hours of downtime, I spend lots of time running around the internet frantically searching terms like “grub”, “raid1”, “mdadm”, “grub-install”, and trying to figure out how I could get grub to realize where it should be booting from.

I will spoil it for you now though: in the end I had to get my data and rebuild the server completely. The silver lining is that my ansible infrastructure code (you can find an unmaintained snapshot of it on gitlab) was able to get from fresh 18.04 install to k8s 1.12 (might as well do the upgrade if I’m remaking the cluster) very quickly with no manual intervention. I did choose to remove the code that disassembled software RAID (you can approximate it by scaling your raid cluster down to 1 disk, then doing stuff with the second one without going into hetzner rescue mode) – I’m going to leave the drives RAID1’d on hetzner boxes from now on.

Well let’s pretend that it was going to work – here are some helpful resources I found along the way:

The first link helped me mount disks properly in Hetzner’s rescue mode after observing the machine wouldn’t boot. After requesting a live connection to my machine I saw that it was stuck at the Hetzner PXE boot screen (I really wish Hetzner let you bring your own PXE boot setup), tries the local disk but never succeeds. My first instinct though was of course to try and get my data off this possibly borked server.

After searching and trying many things, checking the drives for errors, wiping and re-building the raid partitions and messing with grub configuration, I gave up. As I said before, in the end I didn’t win this particular battle, but the silver lining is that I got to test my infrastructure code and it didn’t rot very much at all.

Now that I’m not trying to undo the RAID so I can give a full disk to Ceph any more I started to wonder what my other options were. While I was knee deep in incomplete help threads and beginner-level instructions, I realized that another way I could have solved this problem was to create a loop-based virtual disk and give that to Ceph. So now here are the options I know of:

  • hostPath/local volumes
  • loop-based virtual disk
  • OpenEBS(?)

By the title/flow of this article you problably know which one I’m going to investigate, but I do want to note that the virtual disk solution actually seems really promising for dynamic provisioning on a per-node level, because it seems nestable. Instead of trying to deal with size-constraining folders on disk, why not make one “big” virtual disk (let’s say 500GB), mount that, then partition it into smaller virtual disks? I get the feeling I could hack together an operator to do this and provide the ever-elusive “dynamic hostpath/local volumes” very very easily. Eventually I’ll find time to explore that idea, but that time isn’t this time.

Before choosing to go with OpenEBS I took a step back to evaluate why I want to solve this problem at all. At the end of the day I want:

  • Dynamic PV provisioning (a way to use the second disk on my dedicated server without sitting there and slicing up partitions/etc)
  • Consistency/Ease of application deployment (I can just use PerstistentVolumes and PersistentVolumeClaims, no managing hostPaths)
  • Replication & Durability (less useful in my current case of one drive on one machine)

Evaluating OpenEBS

Step 0: RTFM

One of the first things I looked at was OpenEBS’s list of features, and they’re pretty great. Some highlights:

  • Synchronous replication - Similar to how Ceph does it to ensure durability of writes
  • Snapshots - one of the biggest questions I rarely see answered. Getting a PVC up and running is fine, but what happens when the application goes down or I need to migrate to another node?

Snapshots are a huge differentiator if true. Rook is still working on it according to their roadmap (scheduled for v0.9, which is). I was also really impressed by the architecture docs, they’re pretty concise and informative. Reading through the docs it’s looking like OpenEBS is going to offer me a way to have dynamically allocated drives & PV/PVCs without the static provisioning that you’d need for local volumes.

NOTE I just realised that Rook 0.9 is out now, so they should have snapshots.

Obviously, there’s a lot of tech to read up on here if you’re new to the space. At this point basically I know enough about Kubernetes + Rook + Ceph to be dangerous after reading documentation and setting up/fiddling. Here’s a loose list of things you may want to read up on/know about:

Skimming these resources is obviously enough – it would take months/years to be an actual expert nevermind the actual in-the-trenches experience. Importantly, we need to keep the user-level goal in mind, which I can try and encapsulate with this statement:

When I start a pod, if there is space either in a local disk or some network attached storage I’ve purchased, I want a PVC to be automatically created for it, and I want to automatically have data replicated as the pod makes use of the filesystem

The idea is simple of course, but the devil is in the details, and there are deals (tradeoffs) to be made all over. Storage systems can be good for some usecases but bad for others – so one system might be great for storing & replicating pictures, but bad for storing and replicating writes to a Write Ahead Log like Postgres (or your favorite database) might perform. I didn’t choose GlusterFS when I was first looking into distributed storage mostly because of reports (that never seemed to get rebutted) that it was less than ideal for running databases on. What I’m looking for is a solution with a decent general-case usage I’m not Google, or a tech giant, I don’t run applications that are causing writes thousands of times a second, but I do want to enable easy operations.

OK enough exposition let’s get to installing OpenEBS.

Step 1: Installing OpenEBS

The OpenEBS documentation has a section on installing OpenEBS as you’d expect which we’re going to follow. We’ll use the default Jiva store, which seems to work with a local folder on the machine by default. I’m basing this understanding off the following quote:

OpenEBS can be used to create Storage Pool on a host disk or an externally mounted disk. This Storage Pool can be used to create Jiva volume which can be utilized to run applications. By default, Jiva volume will be deployed on host path. If you are using an external disk, see storage pool for more details about creating a storage pool with an external disk.

Hopefully they don’t mean this in the same way Rook/Ceph did, and I can just give OpenEBS a folder on-disk (which again is actually 2 disks software-RAIDed together) and OpenEBS will manage sizing and dynamic provisioning of data and expose it via iscsi. Which brings me to one of the hard requirements of OpenEBS – you need open-iscsi installed, as noted in the prerequsities:

root@Ubuntu-1810-cosmic-64-minimal ~ # sudo apt-get install open-iscsi

As for the kubernetes-parts, they recommend that you install with Helm or by running kubectl on a monolithic YAML file like this:

kubectl apply -f https://openebs.github.io/charts/openebs-operator-0.8.0.yaml

As usual, I don’t ever do that, but instead pull down and split up the monolithic YAML file and get an idea of what’s running. Here’s what it looks like for me (I use the makeinfra pattern:

infra/kubernetes/cluster/storage/openebs/openebs.ns.yaml:

---
apiVersion: v1
kind: Namespace
metadata:
  name: openebs

infra/kubernetes/cluster/storage/openebs/openebs.serviceaccount.yaml:

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: openebs-maya-operator
  namespace: openebs

infra/kubernetes/cluster/storage/openebs/openebs.configmap.yaml:

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: openebs-ndm-config
  namespace: openebs
data:
  # udev-probe is default or primary probe which should be enabled to run ndm
  # filterconfigs contails configs of filters - in their form fo include
  # and exclude comma separated strings
  node-disk-manager.config: |
    probeconfigs:
      - key: udev-probe
        name: udev probe
        state: true
      - key: smart-probe
        name: smart probe
        state: true
    filterconfigs:
      - key: os-disk-exclude-filter
        name: os disk exclude filter
        state: true
        exclude: "/,/etc/hosts,/boot"
      - key: vendor-filter
        name: vendor filter
        state: true
        include: ""
        exclude: "CLOUDBYT,OpenEBS"
      - key: path-filter
        name: path filter
        state: true
        include: ""
        exclude: "loop,/dev/fd0,/dev/sr0,/dev/ram,/dev/dm-,/dev/md"    

infra/kubernetes/cluster/storage/openebs/openebs.rbac.yaml:

---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
  name: openebs-maya-operator
rules:
- apiGroups: ["*"]
  resources: ["nodes", "nodes/proxy"]
  verbs: ["*"]
- apiGroups: ["*"]
  resources: ["namespaces", "services", "pods", "deployments", "events", "endpoints", "configmaps", "jobs"]
  verbs: ["*"]
- apiGroups: ["*"]
  resources: ["storageclasses", "persistentvolumeclaims", "persistentvolumes"]
  verbs: ["*"]
- apiGroups: ["volumesnapshot.external-storage.k8s.io"]
  resources: ["volumesnapshots", "volumesnapshotdatas"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["apiextensions.k8s.io"]
  resources: ["customresourcedefinitions"]
  verbs: [ "get", "list", "create", "update", "delete"]
- apiGroups: ["*"]
  resources: [ "disks"]
  verbs: ["*" ]
- apiGroups: ["*"]
  resources: [ "storagepoolclaims", "storagepools"]
  verbs: ["*" ]
- apiGroups: ["*"]
  resources: [ "castemplates", "runtasks"]
  verbs: ["*" ]
- apiGroups: ["*"]
  resources: [ "cstorpools", "cstorvolumereplicas", "cstorvolumes"]
  verbs: ["*" ]
- nonResourceURLs: ["/metrics"]
  verbs: ["get"]

---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
  name: openebs-maya-operator
  namespace: openebs
subjects:
- kind: ServiceAccount
  name: openebs-maya-operator
  namespace: openebs
- kind: User
  name: system:serviceaccount:default:default
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: ClusterRole
  name: openebs-maya-operator
  apiGroup: rbac.authorization.k8s.io

infra/kubernetes/cluster/storage/openebs/openebs-api-server.deployment.yaml:

---
apiVersion: apps/v1beta1
kind: Deployment
metadata:
  name: maya-apiserver
  namespace: openebs
spec:
  replicas: 1
  template:
    metadata:
      labels:
        name: maya-apiserver
    spec:
      serviceAccountName: openebs-maya-operator
      containers:
      - name: maya-apiserver
        imagePullPolicy: IfNotPresent
        image: quay.io/openebs/m-apiserver:0.8.0
        ports:
        - containerPort: 5656
        env:
        # OPENEBS_IO_KUBE_CONFIG enables maya api service to connect to K8s
        # based on this config. This is ignored if empty.
        # This is supported for maya api server version 0.5.2 onwards
        #- name: OPENEBS_IO_KUBE_CONFIG
        #  value: "/home/ubuntu/.kube/config"
        # OPENEBS_IO_K8S_MASTER enables maya api service to connect to K8s
        # based on this address. This is ignored if empty.
        # This is supported for maya api server version 0.5.2 onwards
        #- name: OPENEBS_IO_K8S_MASTER
        #  value: "http://172.28.128.3:8080"
        # OPENEBS_IO_INSTALL_DEFAULT_CSTOR_SPARSE_POOL decides whether default cstor sparse pool should be
        # configured as a part of openebs installation.
        # If "true" a default cstor sparse pool will be configured, if "false" it will not be configured.
        - name: OPENEBS_IO_INSTALL_DEFAULT_CSTOR_SPARSE_POOL
          value: "true"
        # OPENEBS_NAMESPACE provides the namespace of this deployment as an
        # environment variable
        - name: OPENEBS_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        # OPENEBS_SERVICE_ACCOUNT provides the service account of this pod as
        # environment variable
        - name: OPENEBS_SERVICE_ACCOUNT
          valueFrom:
            fieldRef:
              fieldPath: spec.serviceAccountName
        # OPENEBS_MAYA_POD_NAME provides the name of this pod as
        # environment variable
        - name: OPENEBS_MAYA_POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: OPENEBS_IO_JIVA_CONTROLLER_IMAGE
          value: "quay.io/openebs/jiva:0.8.0"
        - name: OPENEBS_IO_JIVA_REPLICA_IMAGE
          value: "quay.io/openebs/jiva:0.8.0"
        - name: OPENEBS_IO_JIVA_REPLICA_COUNT
          value: "3"
        - name: OPENEBS_IO_CSTOR_TARGET_IMAGE
          value: "quay.io/openebs/cstor-istgt:0.8.0"
        - name: OPENEBS_IO_CSTOR_POOL_IMAGE
          value: "quay.io/openebs/cstor-pool:0.8.0"
        - name: OPENEBS_IO_CSTOR_POOL_MGMT_IMAGE
          value: "quay.io/openebs/cstor-pool-mgmt:0.8.0"
        - name: OPENEBS_IO_CSTOR_VOLUME_MGMT_IMAGE
          value: "quay.io/openebs/cstor-volume-mgmt:0.8.0"
        - name: OPENEBS_IO_VOLUME_MONITOR_IMAGE
          value: "quay.io/openebs/m-exporter:0.8.0"
        # OPENEBS_IO_ENABLE_ANALYTICS if set to true sends anonymous usage
        # events to Google Analytics
        - name: OPENEBS_IO_ENABLE_ANALYTICS
          value: "false"
        # OPENEBS_IO_ANALYTICS_PING_INTERVAL can be used to specify the duration (in hours)
        # for periodic ping events sent to Google Analytics. Default is 24 hours.
        #- name: OPENEBS_IO_ANALYTICS_PING_INTERVAL
        #  value: "24h"
        livenessProbe:
          exec:
            command:
            - /usr/local/bin/mayactl
            - version
          initialDelaySeconds: 30
          periodSeconds: 60
        readinessProbe:
          exec:
            command:
            - /usr/local/bin/mayactl
            - version
          initialDelaySeconds: 30
          periodSeconds: 60

infra/kubernetes/cluster/storage/openebs/openebs-provisioner.deployment.yaml:

---
apiVersion: apps/v1beta1
kind: Deployment
metadata:
  name: openebs-provisioner
  namespace: openebs
spec:
  replicas: 1
  template:
    metadata:
      labels:
        name: openebs-provisioner
    spec:
      serviceAccountName: openebs-maya-operator
      containers:
      - name: openebs-provisioner
        imagePullPolicy: IfNotPresent
        image: quay.io/openebs/openebs-k8s-provisioner:0.8.0
        env:
        # OPENEBS_IO_K8S_MASTER enables openebs provisioner to connect to K8s
        # based on this address. This is ignored if empty.
        # This is supported for openebs provisioner version 0.5.2 onwards
        #- name: OPENEBS_IO_K8S_MASTER
        #  value: "http://10.128.0.12:8080"
        # OPENEBS_IO_KUBE_CONFIG enables openebs provisioner to connect to K8s
        # based on this config. This is ignored if empty.
        # This is supported for openebs provisioner version 0.5.2 onwards
        #- name: OPENEBS_IO_KUBE_CONFIG
        #  value: "/home/ubuntu/.kube/config"
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        - name: OPENEBS_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        # OPENEBS_MAYA_SERVICE_NAME provides the maya-apiserver K8s service name,
        # that provisioner should forward the volume create/delete requests.
        # If not present, "maya-apiserver-service" will be used for lookup.
        # This is supported for openebs provisioner version 0.5.3-RC1 onwards
        #- name: OPENEBS_MAYA_SERVICE_NAME
        #  value: "maya-apiserver-apiservice"
        livenessProbe:
          exec:
            command:
            - pgrep
            - ".*openebs"
          initialDelaySeconds: 30
          periodSeconds: 60

infra/kubernetes/cluster/storage/openebs/openebs-snapshot-operator.deployment.yaml:

---
apiVersion: apps/v1beta1
kind: Deployment
metadata:
  name: openebs-snapshot-operator
  namespace: openebs
spec:
  replicas: 1
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        name: openebs-snapshot-operator
    spec:
      serviceAccountName: openebs-maya-operator
      containers:
        - name: snapshot-controller
          image: quay.io/openebs/snapshot-controller:0.8.0
          imagePullPolicy: IfNotPresent
          env:
          - name: OPENEBS_NAMESPACE
            valueFrom:
              fieldRef:
                fieldPath: metadata.namespace
          livenessProbe:
            exec:
              command:
              - pgrep
              - ".*controller"
            initialDelaySeconds: 30
            periodSeconds: 60
        # OPENEBS_MAYA_SERVICE_NAME provides the maya-apiserver K8s service name,
        # that snapshot controller should forward the snapshot create/delete requests.
        # If not present, "maya-apiserver-service" will be used for lookup.
        # This is supported for openebs provisioner version 0.5.3-RC1 onwards
        #- name: OPENEBS_MAYA_SERVICE_NAME
        #  value: "maya-apiserver-apiservice"
        - name: snapshot-provisioner
          image: quay.io/openebs/snapshot-provisioner:0.8.0
          imagePullPolicy: IfNotPresent
          env:
          - name: OPENEBS_NAMESPACE
            valueFrom:
              fieldRef:
                fieldPath: metadata.namespace
        # OPENEBS_MAYA_SERVICE_NAME provides the maya-apiserver K8s service name,
        # that snapshot provisioner  should forward the clone create/delete requests.
        # If not present, "maya-apiserver-service" will be used for lookup.
        # This is supported for openebs provisioner version 0.5.3-RC1 onwards
        #- name: OPENEBS_MAYA_SERVICE_NAME
        #  value: "maya-apiserver-apiservice"
          livenessProbe:
            exec:
              command:
              - pgrep
              - ".*provisioner"
            initialDelaySeconds: 30
            periodSeconds: 60

infra/kubernetes/cluster/storage/openebs/openebs-disk-manager.ds.yaml:

---
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: openebs-ndm
  namespace: openebs
spec:
  template:
    metadata:
      labels:
        name: openebs-ndm
    spec:
      # By default the node-disk-manager will be run on all kubernetes nodes
      # If you would like to limit this to only some nodes, say the nodes
      # that have storage attached, you could label those node and use
      # nodeSelector.
      #
      # e.g. label the storage nodes with - "openebs.io/nodegroup"="storage-node"
      # kubectl label node <node-name> "openebs.io/nodegroup"="storage-node"
      #nodeSelector:
      #  "openebs.io/nodegroup": "storage-node"
      serviceAccountName: openebs-maya-operator
      hostNetwork: true
      containers:
      - name: node-disk-manager
        command:
        - /usr/sbin/ndm
        - start
        image: quay.io/openebs/node-disk-manager-amd64:v0.2.0
        imagePullPolicy: IfNotPresent
        securityContext:
          privileged: true
        volumeMounts:
        - name: config
          mountPath: /host/node-disk-manager.config
          subPath: node-disk-manager.config
          readOnly: true
        - name: udev
          mountPath: /run/udev
        - name: procmount
          mountPath: /host/mounts
        - name: sparsepath
          mountPath: /var/openebs/sparse
        env:
        # pass hostname as env variable using downward API to the NDM container
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        # specify the directory where the sparse files need to be created.
        # if not specified, then sparse files will not be created.
        - name: SPARSE_FILE_DIR
          value: "/var/openebs/sparse"
        # Size(bytes) of the sparse file to be created.
        - name: SPARSE_FILE_SIZE
          value: "10737418240"
        # Specify the number of sparse files to be created
        - name: SPARSE_FILE_COUNT
          value: "1"
        livenessProbe:
          exec:
            command:
            - pgrep
            - ".*ndm"
          initialDelaySeconds: 30
          periodSeconds: 60
      volumes:
      - name: config
        configMap:
          name: openebs-ndm-config
      - name: udev
        hostPath:
          path: /run/udev
          type: Directory
      # mount /proc/1/mounts (mount file of process 1 of host) inside container
      # to read which partition is mounted on / path
      - name: procmount
        hostPath:
          path: /proc/1/mounts
      - name: sparsepath
        hostPath:
          path: /var/openebs/sparse

infra/kubernetes/cluster/storage/openebs/openebs.svc.yaml:

---
apiVersion: v1
kind: Service
metadata:
  name: maya-apiserver-service
  namespace: openebs
spec:
  ports:
  - name: api
    port: 5656
    protocol: TCP
    targetPort: 5656
  selector:
    name: maya-apiserver
  sessionAffinity: None

And a very basic Makefile to tie it all together:

infra/kubernetes/cluster/storage/openebs/Makefile:

.PHONY: install uninstall

KUBECTL := kubectl

install: namespace serviceaccount rbac configmap api-server provisioner snapshot-operator node-disk-manager svc

namespace:
    $(KUBECTL) apply -f openebs.ns.yaml

serviceaccount:
    $(KUBECTL) apply -f openebs.serviceaccount.yaml

configmap:
    $(KUBECTL) apply -f openebs.configmap.yaml

rbac:
    $(KUBECTL) apply -f openebs.rbac.yaml

svc:
    $(KUBECTL) apply -f openebs.svc.yaml

api-server:
    $(KUBECTL) apply -f openebs-api-server.deployment.yaml

provisioner:
    $(KUBECTL) apply -f openebs-provisioner.deployment.yaml

snapshot-operator:
    $(KUBECTL) apply -f openebs-snapshot-operator.deployment.yaml

node-disk-manager:
    $(KUBECTL) apply -f openebs-disk-manager.ds.yaml

uninstall:
    $(KUBECTL) delete -f openebs.svc.yaml
    $(KUBECTL) delete -f openebs-disk-manager.ds.yaml
    $(KUBECTL) delete -f openebs-snapshot-operator.deployment.yaml
    $(KUBECTL) delete -f openebs-provisioner.deployment.yaml
    $(KUBECTL) delete -f openebs-api-server.deployment.yaml
    $(KUBECTL) delete -f openebs.configmap.yaml
    $(KUBECTL) delete -f openebs.rbac.yaml
    $(KUBECTL) delete -f openebs.serviceaccount.yaml
    $(KUBECTL) delete -f openebs.namespace.yaml

OK, now that it’s installed let’s check if everything looks good:

$ make
kubectl apply -f openebs.ns.yaml
namespace/openebs created
kubectl apply -f openebs.serviceaccount.yaml
serviceaccount/openebs-maya-operator created
kubectl apply -f openebs.rbac.yaml
clusterrole.rbac.authorization.k8s.io/openebs-maya-operator created
clusterrolebinding.rbac.authorization.k8s.io/openebs-maya-operator created
kubectl apply -f openebs.configmap.yaml
configmap/openebs-ndm-config created
kubectl apply -f openebs-api-server.deployment.yaml
deployment.apps/maya-apiserver created
kubectl apply -f openebs-provisioner.deployment.yaml
deployment.apps/openebs-provisioner created
kubectl apply -f openebs-snapshot-operator.deployment.yaml
deployment.apps/openebs-snapshot-operator created
kubectl apply -f openebs-disk-manager.ds.yaml
daemonset.extensions/openebs-ndm created
kubectl apply -f openebs.svc.yaml
service/maya-apiserver-service created
$ # ... wait some time ...
$ kubectl get all -n openebs
NAME                                             READY   STATUS    RESTARTS   AGE
pod/cstor-sparse-pool-o9mk-7b585d7b8d-bgc4q      2/2     Running   0          2m33s
pod/maya-apiserver-78c59c89c-5h674               1/1     Running   0          3m16s
pod/openebs-ndm-29d9g                            1/1     Running   0          2m50s
pod/openebs-provisioner-77dd68645b-tv98t         1/1     Running   5          3m14s
pod/openebs-snapshot-operator-85dd4d7c94-hbbd8   2/2     Running   0          3m12s

NAME                             TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
service/maya-apiserver-service   ClusterIP   10.110.168.61   <none>        5656/TCP   34s

NAME                         DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/openebs-ndm   1         1         1       1            1           <none>          2m51s

NAME                                        READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/cstor-sparse-pool-o9mk      1/1     1            1           2m34s
deployment.apps/maya-apiserver              1/1     1            1           3m18s
deployment.apps/openebs-provisioner         1/1     1            1           3m15s
deployment.apps/openebs-snapshot-operator   1/1     1            1           3m13s

NAME                                                   DESIRED   CURRENT   READY   AGE
replicaset.apps/cstor-sparse-pool-o9mk-7b585d7b8d      1         1         1       2m34s
replicaset.apps/maya-apiserver-78c59c89c               1         1         1       3m17s
replicaset.apps/openebs-provisioner-77dd68645b         1         1         1       3m15s
replicaset.apps/openebs-snapshot-operator-85dd4d7c94   1         1         1       3m13s

Well that certainly looks good to me – no errors, and the node management daemon set is running without issue. Let’s try and test it out.

Step 2: Testing it out with a simple Pod + PVC

Now that we have the system theoretically in a working state, let’s make a Pod with a PersistentVolumeClaim to validate. I want to note here that StatefulSets and PersistentVolumeClaims are separate concepts. I often see people mention them as if the only way to use a PersistentVolumeClaim is to have a StatefulSet – but this has more to do with the how the other options work (i.e. a Deployment) – it’s perfectly possible to have a single Deployment use a PVC, but you can’t have more than one, because the second instance/replica would try to mount the same PV. StatefulSets offer more things like consistent/different startup semantics and namings, and that’s what makes them well suited for less flexible stateful workloads.

The Makefile is a little disingenuous becuase of how the operator works, a bunch of Custom Resource Definitions (CRDS) also got installed as well as things like StorageClassess. Since we’ll need to know the pool to be able to make our PersistentVolumeClaim, let’s list them:

$ kubectl get sc
NAME                        PROVISIONER                                                AGE
openebs-cstor-sparse        openebs.io/provisioner-iscsi                               124m
openebs-jiva-default        openebs.io/provisioner-iscsi                               125m
openebs-snapshot-promoter   volumesnapshot.external-storage.k8s.io/snapshot-promoter   124m

Let’s use the openebs-jiva-default – now we can write our resource definitions for our PersistentVolumeClaim and Pod:

openebs-test.allinone.yaml:

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pvc-test-data
  namespace: default
  labels:
    app: pvc-test
spec:
  storageClassName: openebs-jiva-default
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 5Gi

---
apiVersion: v1
kind: Pod
metadata:
  name: pvc-test
  namespace: default
  labels:
    app: pvc-test
spec:
  containers:
    - name: pvc-test
      image: alpine
      command: ["ash", "-c", "while true; do sleep 60s; done"]
      imagePullPolicy: IfNotPresent
      resources:
        requests:
          cpu: 0.25
          memory: "256Mi"
        limits:
          cpu: 0.50
          memory: "512Mi"
      volumeMounts:
        - mountPath: /var/data
          name: data
  volumes:
    - name: data
      persistentVolumeClaim:
        claimName: pvc-test-data

Shortly after kubectl apply -fing that file:

$ kubectl get pods
NAME                                                             READY   STATUS              RESTARTS   AGE
pvc-5e74411c-10b4-11e9-9cf0-8c89a517d15e-ctrl-76548fb456-dnw84   2/2     Running             0          25s
pvc-5e74411c-10b4-11e9-9cf0-8c89a517d15e-rep-6ff7fd654d-2fr6d    0/1     Pending             0          25s
pvc-5e74411c-10b4-11e9-9cf0-8c89a517d15e-rep-6ff7fd654d-j7swl    1/1     Running             0          25s
pvc-5e74411c-10b4-11e9-9cf0-8c89a517d15e-rep-6ff7fd654d-sndh2    0/1     Pending             0          25s
pvc-test                                                         0/1     ContainerCreating   0          12s

OK so here we see the CAS concept taking off – there are a bunch of Pods being started that manage the data being shuffled around – if you look closely you can see the -ctrl- and -rep- in the pod names. I assume there are 3 data taking nodes + 1 manager here for the one PVC. I did nothing to tell OpenEBS I only have one node, so it’s running in the usual HA pattern.

After waiting a bit for some of the Pending containers to come out of pending and the pvc-test pod to get created I realized there was something wrong. A quick kubectl describe pod pvc-test reveals the problem:

Events:
Type     Reason                  Age                  From                                    Message
----     ------                  ----                 ----                                    -------
Normal   Scheduled               3m3s                 default-scheduler                       Successfully assigned default/pvc-test to ubuntu-1810-cosmic-64-minimal
Normal   SuccessfulAttachVolume  3m3s                 attachdetach-controller                 AttachVolume.Attach succeeded for volume "pvc-5e74411c-10b4-11e9-9cf0-8c89a517d15e"
Warning  FailedMount             63s (x8 over 2m42s)  kubelet, ubuntu-1810-cosmic-64-minimal  MountVolume.WaitForAttach failed for volume "pvc-5e74411c-10b4-11e9-9cf0-8c89a517d15e" : failed to get any path for iscsi disk, last err seen:
iscsi: failed to sendtargets to portal 10.108.192.121:3260 output: iscsiadm: cannot make connection to 10.108.192.121: Connection refused
iscsiadm: cannot make connection to 10.108.192.121: Connection refused
iscsiadm: cannot make connection to 10.108.192.121: Connection refused
iscsiadm: cannot make connection to 10.108.192.121: Connection refused
iscsiadm: cannot make connection to 10.108.192.121: Connection refused
iscsiadm: cannot make connection to 10.108.192.121: Connection refused
iscsiadm: connection login retries (reopen_max) 5 exceeded
iscsiadm: No portals found
, err exit status 21
Warning  FailedMount  60s  kubelet, ubuntu-1810-cosmic-64-minimal  Unable to mount volumes for pod "pvc-test_default(6ead8128-10b4-11e9-9cf0-8c89a517d15e)": timeout expired waiting for volumes to attach or mount for pod "default"/ "pvc-test". list of unmounted volumes= [data]. list of unattached volumes= [data default-token-lsfvf]

Well, this is par for the course, since things very rarely work the first time, let’s get in and solve the issues. Before we go on though, let’s check what those other pods are doing:

Events:
Type     Reason            Age                     From               Message
----     ------            ----                    ----               -------
Warning  FailedScheduling  3m16s (x37 over 8m19s)  default-scheduler  0/1 nodes are available: 1 node (s) didn't match pod affinity/anti-affinity, 1 node (s) didn't satisfy existing pods anti-affinity rules.

OK, so the node couldn’t start because it’s anti-affinity requirements couldn’t be met. This actually isn’t a problem but is actually expected behavior – OpenEBS runs 3 replicas for fault tolerance, and I’m going in the face of this since I’m only using one node. I love tools that I can predict/reason/guess about armed with only documentation knowledge – in this case it was just a guess but this is a great sign. Rather than re-configure OpenEBS to make less replicas right now I’m going to just ignore the 2 pending containers, and focus on the connection issues.

DEBUG: Connection refused to 10.108.192.121

Since we’re having connectivity issues, let’s make sure I don’t have any NetworkPolicy (I use and love kube-router in my cluster)set that’s preventing the communication:

$ kubectl get networkpolicy
No resources found.

OK, all’s clear on that front, let’s figure out what is behind 10.108.192.121 that my pod is trying to talk to:

$ kubectl get pods -o=wide
NAME                                                             READY   STATUS              RESTARTS   AGE   IP             NODE                            NOMINATED NODE   READINESS GATES
pvc-5e74411c-10b4-11e9-9cf0-8c89a517d15e-ctrl-76548fb456-dnw84   2/2     Running             0          13m   10.244.0.137   ubuntu-1810-cosmic-64-minimal   <none>           <none>
pvc-5e74411c-10b4-11e9-9cf0-8c89a517d15e-rep-6ff7fd654d-2fr6d    0/1     Pending             0          13m   <none>         <none>                          <none>           <none>
pvc-5e74411c-10b4-11e9-9cf0-8c89a517d15e-rep-6ff7fd654d-j7swl    1/1     Running             0          13m   10.244.0.136   ubuntu-1810-cosmic-64-minimal   <none>           <none>
pvc-5e74411c-10b4-11e9-9cf0-8c89a517d15e-rep-6ff7fd654d-sndh2    0/1     Pending             0          13m   <none>         <none>                          <none>           <none>
pvc-test                                                         0/1     ContainerCreating   0          12m   <none>         ubuntu-1810-cosmic-64-minimal   <none>           <none>

The wider output (-o=wide) lets us know that the running pods (again, the Pending pods are OK, since we’re in a very not HA situation) – and that the IP we’re trying to connect to isn’t any one of these pods. But if you stop and think about it, of course it isn’t one of these pods – Pod IPs can shift, and if you want a reliable pointer to another pod what you need is a Service, let’s check the IPs of our services:

$ kubectl get svc -o=wide
NAME                                                TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE   SELECTOR
kubernetes                                          ClusterIP   10.96.0.1        <none>        443/TCP                      26d   <none>
pvc-5e74411c-10b4-11e9-9cf0-8c89a517d15e-ctrl-svc   ClusterIP   10.108.192.121   <none>        3260/TCP,9501/TCP,9500/TCP   15m   openebs.io/controller=jiva-controller,openebs.io/persistent-volume=pvc-5e74411c-10b4-11e9-9cf0-8c89a517d15e

BINGO! As you might expect with the CAS model, we have a Service exposing the harddrive interface that our Pod will use, to make it accessible. Now we need to figure out why our Pod can’t seem to talk to this service. Let’s dig deeper into the service and make sure it has Endpoints attached:

$ kubectl describe svc pvc-5e74411c-10b4-11e9-9cf0-8c89a517d15e-ctrl-svc
Name:              pvc-5e74411c-10b4-11e9-9cf0-8c89a517d15e-ctrl-svc
Namespace:         default
Labels:            openebs.io/cas-template-name=jiva-volume-create-default-0.8.0
openebs.io/cas-type=jiva
openebs.io/controller-service=jiva-controller-svc
openebs.io/persistent-volume=pvc-5e74411c-10b4-11e9-9cf0-8c89a517d15e
openebs.io/persistent-volume-claim=pvc-test-data
openebs.io/storage-engine-type=jiva
openebs.io/version=0.8.0
pvc=pvc-test-data
Annotations:       <none>
Selector:          openebs.io/controller=jiva-controller,openebs.io/persistent-volume=pvc-5e74411c-10b4-11e9-9cf0-8c89a517d15e
Type:              ClusterIP
IP:                10.108.192.121
Port:              iscsi  3260/TCP
TargetPort:        3260/TCP
Endpoints:         10.244.0.137:3260
Port:              api  9501/TCP
TargetPort:        9501/TCP
Endpoints:         10.244.0.137:9501
Port:              exporter  9500/TCP
TargetPort:        9500/TCP
Endpoints:         10.244.0.137:9500
Session Affinity:  None
Events:            <none>

All of this looks fine and dandy to me – in particular, there are endpoints for the pods that did start up. Everything looks fine as far as Kubernetes concepts go, so let’s look back at the error message for some hints:

Warning  FailedMount             63s (x8 over 2m42s)  kubelet, ubuntu-1810-cosmic-64-minimal  MountVolume.WaitForAttach failed for volume "pvc-5e74411c-10b4-11e9-9cf0-8c89a517d15e" : failed to get any path for iscsi disk, last err seen:
iscsi: failed to sendtargets to portal 10.108.192.121:3260 output: iscsiadm: cannot make connection to 10.108.192.121: Connection refused
iscsiadm: cannot make connection to 10.108.192.121: Connection refused
iscsiadm: cannot make connection to 10.108.192.121: Connection refused
iscsiadm: cannot make connection to 10.108.192.121: Connection refused
iscsiadm: cannot make connection to 10.108.192.121: Connection refused
iscsiadm: cannot make connection to 10.108.192.121: Connection refused
iscsiadm: connection login retries (reopen_max) 5 exceeded
iscsiadm: No portals found

So it looks like the iscsi subystem tried to connect to the kubernetes service @ 10.108.192.121:3260 (which goes to Endpoint for the -ctrl- pod 10.244.0.137). Let’s see what’s happening in the Pod with that IP address, we see its Running but how are things going?

$ kubectl logs pvc-5e74411c-10b4-11e9-9cf0-8c89a517d15e-ctrl-76548fb456-dnw84
Error from server (BadRequest): a container name must be specified for pod pvc-5e74411c-10b4-11e9-9cf0-8c89a517d15e-ctrl-76548fb456-dnw84, choose one of: [pvc-5e74411c-10b4-11e9-9cf0-8c89a517d15e-ctrl-con maya-volume-exporter]

OK, so I need to pick one of the internal containers, how about pvc-5e74411c-10b4-11e9-9cf0-8c89a517d15e-ctrl-con:

$ kubectl logs pvc-5e74411c-10b4-11e9-9cf0-8c89a517d15e-ctrl-76548fb456-dnw84 -c pvc-5e74411c-10b4-11e9-9cf0-8c89a517d15e-ctrl-con
time="2019-01-05T06:37:57Z" level=info msg="REPLICATION_FACTOR: 3"
time="2019-01-05T06:37:57Z" level=info msg="Starting controller with frontendIP: , and clusterIP: 10.108.192.121"
time="2019-01-05T06:37:57Z" level=info msg="resetting controller"
time="2019-01-05T06:37:57Z" level=info msg="Listening on :9501"
time="2019-01-05T06:38:11Z" level=info msg="List Replicas"
time="2019-01-05T06:38:11Z" level=info msg="List Replicas"
time="2019-01-05T06:38:11Z" level=info msg="Register Replica for address 10.244.0.136"
time="2019-01-05T06:38:11Z" level=info msg="Register Replica, Address: 10.244.0.136 Uptime: 15.399307176s State: closed Type: Backend RevisionCount: 0"
time="2019-01-05T06:38:11Z" level=warning msg="No of yet to be registered replicas are less than 3 , No of registered replicas: 1"
10.244.0.136 - - [05/Jan/2019:06:38:11 +0000] "POST /v1/register HTTP/1.1" 200 0
time="2019-01-05T06:38:16Z" level=info msg="Register Replica for address 10.244.0.136"
time="2019-01-05T06:38:16Z" level=info msg="Register Replica, Address: 10.244.0.136 Uptime: 20.396328606s State: closed Type: Backend RevisionCount: 0"
10.244.0.136 - - [05/Jan/2019:06:38:16 +0000] "POST /v1/register HTTP/1.1" 200 0
time="2019-01-05T06:38:16Z" level=warning msg="No of yet to be registered replicas are less than 3 , No of registered replicas: 1"
time="2019-01-05T06:38:21Z" level=info msg="Register Replica for address 10.244.0.136"
<the last ~3 lines loop forever>

OK, this was actually the hypothesis I was starting to form in my head – in particular the fact that I haven’t told OpenEBS about how many replicas it’d be able to create (which is my fault since I only have one node, and not 3) might be causing some issues. It’s only a warning, but the repeating nature might suggest that registering is not completing because of this mismatch. Since this isn’t quite a smoking gun let’s check the other container’s logs:

$ kubectl logs pvc-5e74411c-10b4-11e9-9cf0-8c89a517d15e-ctrl-76548fb456-dnw84 -c maya-volume-exporter
I0105 06:38:01.357964       1 command.go:97] Starting maya-exporter ...
I0105 06:38:01.358045       1 logs.go:43] Initialising maya-exporter for the jiva
I0105 06:38:01.358175       1 exporter.go:39] Starting http server....

Well, absolutely no visible prolbems there… so let’s go ahead and reduce the replication factor that OpenEBS is using and see if that fixes things. It took a little digging after re-reading the docs on deploying Jiva, but the StorageClass we’re using for the PersistentVolumeClaim is where we can make this change. Let’s make a new one based on the existing default:

$ kubectl get sc openebs-jiva-default -o=yaml > openebs-jiva-non-ha.storageclass.yaml
$ emacs -nw openebs-jiva-non-ha.storageclass.yaml
.... make edits ...

And here’s what I ended up with:

---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: openebs-jiva-non-ha
  annotations:
    cas.openebs.io/config: |
      - name: ReplicaCount
        value: "1"
      - name: StoragePool
        value: default
      #- name: TargetResourceLimits
      #  value: |-
      #      memory: 1Gi
      #      cpu: 100m
      #- name: AuxResourceLimits
      #  value: |-
      #      memory: 0.5Gi
      #      cpu: 50m
      #- name: ReplicaResourceLimits
      #  value: |-
      #      memory: 2Gi      
    openebs.io/cas-type: jiva
provisioner: openebs.io/provisioner-iscsi
reclaimPolicy: Delete
volumeBindingMode: Immediate

Those limits definitely seem like a good idea but I’m ignoring them for now (the default is the same way). After kubectl applying this StorageClass, and updating our PVC to use the changed storageClassName, we can delete everything (kubectl delete -f openebs-test.allinone.yaml), update our Makefile and and re-make everything. After we do:

$ kubectl apply -f openebs-test.allinone.yaml
persistentvolumeclaim/pvc-test-data created
pod/pvc-test created
... after waiting a few seconds ...
$ kubectl get pods
NAME                                                             READY   STATUS              RESTARTS   AGE
pvc-1367d8a1-10ba-11e9-9cf0-8c89a517d15e-ctrl-5b5d84cd8f-v5zcn   2/2     Running             0          37s
pvc-1367d8a1-10ba-11e9-9cf0-8c89a517d15e-rep-84897dfc97-t59bb    1/1     Running             0          37s
pvc-test                                                         0/1     ContainerCreating   0          37s
sjr-pvc-5e74411c-10b4-11e9-9cf0-8c89a517d15e-tcj7-6d484          0/1     Completed           0          5m18s

Great, so we’ve no pending -rep- pods, and one sjr-pvc pod that I’ve never seen before, but it seems that pod gets left after cleanup happens. More important is making sure pvc-test makes it out of the ContainerCreating state, let’s inspect it:

Events:
Type     Reason                  Age                    From                                    Message
----     ------                  ----                   ----                                    -------
Warning  FailedScheduling        2m14s (x3 over 2m14s)  default-scheduler                       pod has unbound immediate PersistentVolumeClaims
Normal   Scheduled               2m14s                  default-scheduler                       Successfully assigned default/pvc-test to ubuntu-1810-cosmic-64-minimal
Normal   SuccessfulAttachVolume  2m14s                  attachdetach-controller                 AttachVolume.Attach succeeded for volume "pvc-1367d8a1-10ba-11e9-9cf0-8c89a517d15e"
Warning  FailedCreatePodSandBox  8s (x9 over 116s)      kubelet, ubuntu-1810-cosmic-64-minimal  Failed create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container: failed to create containerd task: OCI runtime create failed: container_linux.go:265: starting container process caused "process_linux.go:348: container init caused \"read init-p: connection reset by peer\"": unknown

Well good news and bad news the Pod was able to attach but it looks like containerd is having some issues… Which may have nothing to do with OpenEBS. Let’s take a detour

Bonus Round: Impromptu debugging of PodSandBox creation issues

Checking containerd’s systemd status says it’s running fine, so let’s try and start a pod without a PVC:

Normal   Scheduled               16s               default-scheduler                       Successfully assigned default/no-pvc-test to ubuntu-1810-cosmic-64-minimal
Warning  FailedCreatePodSandBox  3s (x2 over 15s)  kubelet, ubuntu-1810-cosmic-64-minimal  Failed create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container: failed to create containerd task: OCI runtime create failed: container_linux.go:265: starting container process caused "process_linux.go:348: container init caused \"read init-p: connection reset by peer\"": unknown

Alright, it looks like something is just wrong with containerd, which is good news because it means OpenEBS is ostensibly working, but bad news because it’s a bit of a chink in the armor. Not being able to create new Pods is definitely not ideal if I were running in a more serious production environment. To avoid a full machine reboot, Let’s take a look at the kubelet logs:

Jan 05 07:28:09 Ubuntu-1810-cosmic-64-minimal kubelet[1586]: E0105 08:28:09.535560    1586 kuberuntime_sandbox.go:65] CreatePodSandbox for pod "no-pvc-test_default(3f8d2a77-10bb-11e9-9cf0-8c89a517d15e)" failed: rpc error: code = Unknown desc = failed to start sandbox contai
Jan 05 07:28:09 Ubuntu-1810-cosmic-64-minimal kubelet[1586]: E0105 08:28:09.535587    1586 kuberuntime_manager.go:662] createPodSandbox for pod "no-pvc-test_default(3f8d2a77-10bb-11e9-9cf0-8c89a517d15e)" failed: rpc error: code = Unknown desc = failed to start sandbox conta
Jan 05 07:28:09 Ubuntu-1810-cosmic-64-minimal kubelet[1586]: E0105 08:28:09.535659    1586 pod_workers.go:190] Error syncing pod 3f8d2a77-10bb-11e9-9cf0-8c89a517d15e ("no-pvc-test_default(3f8d2a77-10bb-11e9-9cf0-8c89a517d15e)"), skipping: failed to "CreatePodSandbox" for "n
Jan 05 07:28:09 Ubuntu-1810-cosmic-64-minimal kubelet[1586]: W0105 08:28:09.708588    1586 manager.go:1195] Failed to process watch event {EventType:0 Name:/kubepods/burstable/pod13bea637-10ba-11e9-9cf0-8c89a517d15e/83cce6ed723480f83227706c155fc6f6ead206c4587b64e4c5084416bb
Jan 05 07:28:09 Ubuntu-1810-cosmic-64-minimal kubelet[1586]: W0105 08:28:09.709232    1586 container.go:409] Failed to create summary reader for "/kubepods/burstable/pod3f8d2a77-10bb-11e9-9cf0-8c89a517d15e/55d52719eb084edcbb77f64167d9de7cce6e25f54990364d5fe7e1c8819d437d": n
Jan 05 07:28:09 Ubuntu-1810-cosmic-64-minimal kubelet[1586]: E0105 08:28:09.753328    1586 dns.go:132] Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 213.133.98.98 213.133.99.99 213.133.100.100

This is only partial but you can see things aren’t going well. Unfortunately both kubelet and containerd are not getting back into good states after restarting, so I’m just going to restart the box :(. I believe this has happened before and the quickest fix was to just restart everything, and this time I will absolutely not try and upgrade the entire system.

Well, I did all that only to realize that the issue is more nuanced – the resources I set in the pod specification were bad. With some binary-search-comment-and-uncomment, I realized my memory specification was wrong:, here’s the no-pvc-test Pod after the fix:

---
apiVersion: v1
kind: Pod
metadata:
  name: no-pvc-test
  namespace: default
  labels:
    app: no-pvc-test
spec:
  containers:
    - name: no-pvc-test
      image: alpine
      command: ["ash", "-c", "while true; do sleep 60s; done"]
      imagePullPolicy: IfNotPresent
      resources:
        requests:
          cpu: 0.25
          memory: "512Mi"
        limits:
          cpu: 0.50
          memory: "512Mi"

Woops! Looks like a classic case of user error

Finally putting it all together

I went back and fixed the other resources and everything worked out just fine, all the pods are running (with the PVC):

$ kubectl get pods
NAME                                                             READY   STATUS      RESTARTS   AGE
pvc-1367d8a1-10ba-11e9-9cf0-8c89a517d15e-ctrl-5b5d84cd8f-v5zcn   2/2     Running     2          38m
pvc-1367d8a1-10ba-11e9-9cf0-8c89a517d15e-rep-84897dfc97-t59bb    1/1     Running     1          38m
pvc-test                                                         1/1     Running     0          114s
sjr-pvc-5e74411c-10b4-11e9-9cf0-8c89a517d15e-tcj7-6d484          0/1     Completed   0          43m
sjr-pvc-af0c599e-10b9-11e9-9cf0-8c89a517d15e-hdu8-vjf6v          0/1     Completed   0          39m

Let’s try and kubectl exec our way in to try and write some data:

$ kubectl exec -it pvc-test ash
/ # ls /var/data
lost+found
/ # echo "HELLO WORLD" > /var/data/hello-world.txt
/ # ls /var/data
hello-world.txt  lost+found

Now, let’s delete only the pod (careful, don’t delete the PVC, we have the reclaimPolicy set to Delete, though it probably wouldn’t delete fast enough). After we delete the pod, we should be able to restart it and it will pick up the same volume:

$ kubectl delete pod pvc-test
pod "pvc-test" deleted
$ kubectl apply -f openebs-test.allinone.yaml
persistentvolumeclaim/pvc-test-data unchanged
pod/pvc-test created
$ kubectl exec -it pvc-test ash
/ # ls /var/data
hello-world.txt  lost+found
/ # cat /var/data/hello-world.txt
HELLO WORLD

We did it! We’ve got awesome persistent volumes working with OpenEBS and have a great non HA (but could easily go HA) setup. We’re standing on the shoulders of many giants, and things definitely look pretty good from up here!

If we look at the on-disk representation, we can check out the files in /var/openebs:

root@Ubuntu-1810-cosmic-64-minimal ~ # tree /var/openebs/
/var/openebs/
├── pvc-1367d8a1-10ba-11e9-9cf0-8c89a517d15e
│   ├── revision.counter
│   ├── volume-head-000.img
│   ├── volume-head-000.img.meta
│   └── volume.meta
├── pvc-5e74411c-10b4-11e9-9cf0-8c89a517d15e
│   └── scrubbed.txt
├── pvc-af0c599e-10b9-11e9-9cf0-8c89a517d15e
│   └── scrubbed.txt
├── shared-cstor-sparse-pool
│   ├── cstor-sparse-pool.cache
│   ├── uzfs.sock
│   └── zrepl.lock
└── sparse
    └── 0-ndm-sparse.img

    5 directories, 10 files

Looks like OpenEBS is basically doing that “loop-based disk image maintenance” idea I had or something similar (and I’m sure way more robustly) – this might just be the best solution I’ve come across so far for storage with Kubernetes. let’s check out what some of these files are:

/var/openebs/pvc-1367d8a1-10ba-11e9-9cf0-8c89a517d15e: directory
/var/openebs/pvc-1367d8a1-10ba-11e9-9cf0-8c89a517d15e/volume-head-000.img: Linux rev 1.0 ext4 filesystem data, UUID=4820dfb0-7574-47b4-91b0-39a31580fbf2 (needs journal recovery) (extents) (64bit) (large files) (huge files)
/var/openebs/pvc-1367d8a1-10ba-11e9-9cf0-8c89a517d15e/volume-head-000.img.meta: ASCII text
/var/openebs/pvc-1367d8a1-10ba-11e9-9cf0-8c89a517d15e/volume.meta: ASCII text
/var/openebs/pvc-1367d8a1-10ba-11e9-9cf0-8c89a517d15e/revision.counter: ASCII text, with no line terminators

Pretty awesome, straight forward and predictable stuff!

Wrapup

Thus concludes our whirlwind tour through setting up OpenEBS. As you can see the work was pretty light on our side, things just worked, and that’s thanks to a lot of hard work from the team behind OpenEBS and committers to the project (and all the other giants we’re standing on).

Going forward it looks like I’m going to be using OpenEBS over Rook for my bare metal clusters (on Hetzner at least) – it was/is a blast to try and keep up with this area and see how it evolves over time.