Automating k0s cluster backup with Ansible and SystemD

Categories
K0s Logo + Ansible logo

tl;dr - If you didn’t know how to set up a SystemD timer and service to backup your k0s cluster, you no longer have an excuse, scroll down for the code

Recently while going through my workloads and making sure everything’s backed up to external storage (I’m using BackBlaze B2) and it’s been great so far). I came across the problem of how I should back up the cluster itself. Originally I thought I might go with Velero but the following stopped me:

  • In most cases I don’t want to take just a filesystem/block-level PVC backup
    • Ex. Postgres and even SQLite need some warning if the block is going to get copied out from under. WAL mode for SQLite makes things better but generally there
    • As far as copying out stuff like User Generated Content (UGC) I’m fine with cp at time of backup (this stuff should be in S3 to start with but bear with me)
  • I actually do want a backup of the ETCD state, not just the cluster resources (I have all the resources in my IaC repos anyway)

While there are lots of options in the Kubernetes distribution space I use and prefer k0s – it makes the right amount of decisions and chooses pretty neutral options which leaves me lots of lattitude. Along with making the right decisions and having good defaults it also layers on some nice functionality via the k0sctl command line tool – one of those being cluster backup.

Well if you’re going to back up your cluster, you probably want to back it up continuously right? Well it took me longer than I thought it would so I figured it was worth writing about, and that brings us to this post you’re looking at right now. Let’s get into it.

Set up your Object Storage (object versioning, least-privilege key)

I found that making an object storage bucket with a write-only key that you can use for backups was pretty reasonable – even if the key were to get exposed, people would only be able to write to (and possibly fill up) my storage. That’s not great, but it’s better than all my backups getting deleted. Unfortunately it looks like backblaze doesn’t quite protect overwrites from deleting but at least with object versioning set up I’m also safe from the issue of someone trying to overwrite.

Start with a hastily whipped together script

Not too hard to whip together a script that runs k0s backup:

#!/bin/bash

echo -e "\n[info] Ensuring backup save path [${K0S_BACKUP_SAVE_PATH}] exists...";
mkdir -p ${K0S_BACKUP_SAVE_PATH};

echo -e "\n[info] Running k0s backup..."
/usr/bin/k0s backup --save-path=${K0S_BACKUP_SAVE_PATH}

echo -e "\n[info] Retrieving most recent backup..."
MOST_RECENT_BACKUP=$(ls -Art ${K0S_BACKUP_SAVE_PATH} | tail -n1 | tr -d '\n')

if [ -z $MOST_RECENT_BACKUP ] ; then
    echo -e "\n[error] Failed to find a most recent backup in [${K0S_BACKUP_SAVE_PATH}]";
    exit 1;
fi

echo -e "\n[info] Most recent backup is [${MOST_RECENT_BACKUP}]"

echo "[info] Sending data to Backblaze account with ID [${B2_ACCOUNT_ID}]..."
rclone copy \
       --b2-account $B2_ACCOUNT_ID \
       --b2-key $B2_KEY \
       ${K0S_BACKUP_SAVE_PATH}/${MOST_RECENT_BACKUP} \
       :b2:k8s-${CLUSTER_NAME}-backups/cluster/`date +%F`

Easy peasy! Of course you’ll need rclone installed on the base system so make sure to add that to your automation.

Wrap that script in a systemd Service

I use Ansible so my templates are Jinja2 templates, but you should recognize the normal Unit-file-isms:

# k0s-backups.service.j2

[Unit]
Description=Saves k0s cluster backup

[Service]
Environment=K0S_BACKUP_SAVE_PATH=/tmp/k0s-backups
Environment=B2_ACCOUNT_ID={{ b2_account_id }}
Environment=B2_KEY={{ b2_key }}
Environment=CLUSTER_NAME={{ k0s_cluster_name }}
ExecStart=/etc/k0s-node-backup.bash

[Install]
WantedBy=multi-user.target

As you might imagine, you need to make a few variables available via ansible here – b2_account_id, b2_key, and k0s_cluster_name. Once you have those in they’ll be picked up by that script when the service actually runs. Well when does the service run? I’m glad you asked!

Template out a SystemD timer

[Unit]
Description=Backs up the k0s cluster every 12 hours
Requires=k0s-backups.service

[Timer]
Unit=k0s-backups.service
OnUnitInactiveSec=12h
RandomizedDelaySec=5m

[Install]
WantedBy=timers.target

And with that, you’ve got a timer which triggers a service. Not much template happening here but let’s not split hairs – is a template with no replacements a template?

Write yourself an Ansible task to put the pieces in place

Writing all these files doesn’t amount to much if they never make it on to the machine they’re supposed to run on, so here’s a bit of ansible to wrap it together:

#
# Playbook for installing a backup timer on a controller node
#
---
- name: Install backup timer for cluster backups
  hosts: "{{ ansible_limit | default(omit) }}"
  remote_user: root
  vars:
    k8s_node_name: "{{ lookup('env', 'NODE_NAME') }}"
    b2_account_id: "{{ lookup('env', 'B2_ACCOUNT_ID') }}"
    b2_key: "{{ lookup('env', 'B2_KEY') }}"
    k0s_cluster_name: "{{ lookup('env', 'CLUSTER') }}"
  tasks:
    - name: Install rclone
      become: yes
      ansible.builtin.apt:
        name: "{{ packages }}"
        update_cache: yes
        state: present
      vars:
        packages:
        - rclone

    - name: Add the k0s-backups script
      ansible.builtin.template:
        src: ../../templates/k0s-node-backup.bash.j2
        dest: /etc/k0s-node-backup.bash
        owner: root
        group: root
        mode: 0700

    - name: Add the k0s-backups service
      ansible.builtin.template:
        src: ../../templates/k0s-backups.service.j2
        dest: /etc/systemd/system/k0s-backups.service
        owner: root
        group: root
        mode: 0644

    - name: Start & Enable k0s-backups.service
      ansible.builtin.systemd:
        name: k0s-backups.service
        state: started
        enabled: yes
        daemon_reload: yes

    - name: Add the k0s-backups timer
      ansible.builtin.template:
        src: ../../templates/k0s-backups.timer.j2
        dest: /etc/systemd/system/k0s-backups.timer
        owner: root
        group: root
        mode: 0644

    - name: Start & Enable k0s-backups.timer
      ansible.builtin.systemd:
        name: k0s-backups.timer
        state: started
        enabled: yes
        daemon_reload: yes

Note there are many ways to get these variables in – I’ve chosen ENV as it fits a bit better with my Makefile-driven orchestration, but any of the other ansible approved (or unapproved) methods would work.

Get on the machine and verify everything is working as you expect

Automation is dandy and Ansible is a very reliable tool but at this point you’re probably going to want to at least check that your service works and your timer is running (ex. systemctl list-timers).

Once you’ve verified the backup did indeed make it to your object storage, you’re probably going to want to test your backup as well. k0s makes it really easy here – almost as easy as getting the backup:

$ k0s restore <path to compressed backup file>

Why not use a DaemonSet?

One alternative to a systemd service unit and timer is to run a privileged DaemonSet with some shared namespaces to run and perform the needed steps, but I shyed away from that a little bit since I don’t want the backup-taking mechanism to be implemented via the thing I’m a backup of.

BONUS: Backing up a SQLite DB with a CronJob

Well while I’m here I might as well show a basic CronJob for taking a backup of a SQLite database. This is probably the simplest implementation (outside of just PVC snapshotting) of a SQLite backup, but it does get a little hairy.

First start with a ServiceAccount that you’ll be doing the backup-taking with:

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: your-app-backups
  namespace: your-app

Now we’ll go ahead and add some RBAC to allow the account to do get what we need accomplished in the given namespace:

---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: your-app-backups
  namespace: your-app
rules:
  - apiGroups:
      - ""
      - apps
      - extensions
      - autoscaling
    resources:
      - deployments
    verbs:
      - get
      - list

  - apiGroups:
      - ""
    resources:
      - pods
    verbs:
      - get
      - list

  - apiGroups:
      - ""
    resources:
      - "pods/exec"
    verbs:
      - create

Of course we need to bind this Role to the ServiceAccount we created earlier:

---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: your-app-backups
  namespace: your-app
roleRef:
  kind: Role
  name: your-app-backups
  apiGroup: rbac.authorization.k8s.io
subjects:
- kind: ServiceAccount
  name: your-app-backups

And now we can finally get to the meat of the work, the CronJob. This script is a bit wasteful (I install rclone every time and kubectl as well), but it gets the job done:

---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: your-app-backups
  namespace: your-app
spec:
  schedule: "0 0,12 * * *" # who doesn't love/hate cron syntax?
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: OnFailure
          serviceAccountName: your-app-backups # You're going to want
          containers:
            - name: job
              image: alpine:3.14.2
              imagePullPolicy: IfNotPresent
              command:
                - /bin/ash
                - -c
                - |
                  echo -e "[info] Installing rclone"
                  apk add rclone curl

                  echo -e "[info] Installing kubectl"
                  curl -LO https://dl.k8s.io/release/v1.21.4/bin/linux/amd64/kubectl
                  chmod +x kubectl
                  mv kubectl /usr/bin/kubectl

                  export BACKUP_FILE_NAME=backup-`date +%F@%H_%M_%S-%Z`
                  export BACKUP_FILE_PATH=/tmp/${BACKUP_FILE_NAME}
                  echo -e "[info] BACKUP_FILE_NAME=${BACKUP_FILE_NAME}"
                  echo -e "[info] BACKUP_FILE_PATH=${BACKUP_FILE_PATH}"

                  echo "[info] installing sqlite3..."
                  kubectl exec deploy/your-app -n ${NAMESPACE} -- apk add sqlite

                  echo "[info] taking backup..."
                  kubectl exec deploy/your-app -n ${NAMESPACE} -- sqlite3 ${YOUR_APP_SQLITE_DB_PATH} ".backup '${BACKUP_FILE_PATH}'"
                  echo -e "[info] backup taken, @ [${BACKUP_FILE_PATH}] inside your-app pod"

                  echo "[info] copying out backup from container..."
                  export YOUR_APP_CONTAINER_NAME=$(kubectl get pods -n vadosware-blog -l app=your-app --template '{{range .items}}{{.metadata.name}}{{end}}')
                  kubectl cp ${YOUR_APP_CONTAINER_NAME}:${BACKUP_FILE_PATH} ${BACKUP_FILE_PATH} -n ${NAMESPACE}

                  export BACKUP_SIZE=$(du -hs ${BACKUP_FILE_PATH})
                  echo -e "[info] Backup size: [${BACKUP_SIZE}]"

                  echo -e "[info] Zipping backup..."
                  gzip ${BACKUP_FILE_PATH}

                  echo "[info] saving backup to Backblaze under account [${B2_ACCOUNT_ID}]..."
                  rclone copy \
                    --b2-account $B2_ACCOUNT_ID \
                    --b2-key $B2_KEY \
                    ${BACKUP_FILE_PATH}.gz \
                    :b2:$BUCKET/$NAMESPACE/$RESOURCE_TYPE/$RESOURCE_NAME/`date +%F`                  

              env:
                # Info required for backup
                - name: YOUR_APP_SQLITE_DB_PATH
                  value: /var/data/your-app/db.sqlite
                - name: NAMESPACE
                  valueFrom:
                    fieldRef:
                      fieldPath: metadata.namespace
                # S3 folder info
                - name: BUCKET
                  value: your-app-backup-bucket
                - name: RESOURCE_TYPE
                  value: deployment
                - name: RESOURCE_NAME
                  value: your-app-sqlite
                # Rclone (S3 info)
                - name: B2_ACCOUNT_ID
                  valueFrom:
                    secretKeyRef:
                      name: backup-secrets
                      key: B2_ACCOUNT_ID.secret
                - name: B2_KEY
                  valueFrom:
                    secretKeyRef:
                      name: backup-secrets
                      key: B2_KEY.secret

And we’re done! Easy peasy SQLite backups that are done the “right” way (using .dump) though a PVC snapshot would have probably been enough (assuming you have SQLite WAL mode enabled).

As always make sure to test your backups!

BONUS: Backing up a Postgres DB with a CronJob

And while we’re here, a similarly amateur (but functional) backup of a Postgres database. This is pretty basic (and wasteful like the previous one) so of course you’ll need some consideration before taking it into your production environment:

---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: your-app-backups
spec:
  schedule: "0 0,12 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: OnFailure
          containers:
            - name: job
              image: postgres:13.1-alpine
              imagePullPolicy: IfNotPresent
              command:
                - /bin/ash
                - -xc
                - |
                  apk add rclone

                  export BACKUP_FILE_NAME=backup-`date +%F@%H_%M_%S-%Z`
                  echo "[info] BACKUP_FILE_NAME=$(BACKUP_FILE_NAME)"

                  echo "[info] taking backup..."
                  pg_dump \
                    --user=yourdbuser \
                    --clean \
                    --create \
                    --no-owner \
                    --format=custom \
                    --file=/tmp/$BACKUP_FILE_NAME
                  echo -e "[info] backup taken, @ [/tmp/$(BACKUP_FILE_NAME)]"

                  echo "[info] starting rclone..."
                  rclone copy \
                    --b2-account $B2_ACCOUNT_ID \
                    --b2-key $B2_KEY \
                    /tmp/$BACKUP_FILE_NAME \
                    :b2:$BUCKET/$NAMESPACE/deployment/$RESOURCE_NAME/`date +%F`                  

              env:
                - name: RESOURCE_TYPE
                  value: deployment
                - name: RESOURCE_NAME
                  value: your-app-pg
                - name: BUCKET
                  value: your-app-backups-bucket
                - name: NAMESPACE
                  valueFrom:
                    fieldRef:
                      fieldPath: metadata.namespace
                # psql configuration for checking whether the DB exists
                - name: PGHOST
                  value: your-app-pg.vadosware-blog.svc.cluster.local
                - name: PGUSER
                  value: yourappuser
                - name: PGDATABASE
                  value: yourappdb
                - name: PGPASSWORD
                  valueFrom:
                    secretKeyRef:
                      name: your-app-secrets
                      key: DB_PASSWORD.secret
                # Rclone
                - name: B2_ACCOUNT_ID
                  valueFrom:
                    secretKeyRef:
                      name: backup-secrets
                      key: B2_ACCOUNT_ID.secret
                - name: B2_KEY
                  valueFrom:
                    secretKeyRef:
                      name: backup-secrets
                      key: B2_KEY.secret

And as always test your backup!. It’s not hard to spin up a Postgres instance and do a quick restore and ensure your tables and data are still present.

Wrapup

Well this was a pretty quick writeup but hopefully it gets someone out there off on the right foot – It’s easy to do this stuff in theory but sitting down to write it always takes a surprisingly long time (for me at least).

Like what you're reading? Get it in your inbox