tl;dr - stream-of-consciousness notes as I stumbled through upgrading a k0s cluster from single-node to HA control plane. You probably shouldn’t attempt to do this though, just make a new HA-from-day-one cluster and move your workloads over.

Recently I went through the somewhat traumatic experience of upgrading a k0s Kubernetes cluster from a single master (I was running 1 controller+worker node with other worker-only nodes) to a group of HA controller-only nodes with the same set of workers. Looking around the internet I couldn’t find much on how to do the conversion (the k0s documentation are excellent, but don’t quite cover how to convert), so I had to somewhat piece it together.

The better way to do what I did was probably to build a new cluster and move workloads over one by one (it’s possible to even put in place some networking to make the switchover seamless). In retrospect, that was the better idea. But you’re not here for the clean yak shaves.

I left myself some stream-of-consciousness notes so I’m sharing them here for anyone that might come after and attempt to do this, but I do want to note that you should almost certainly just make a new cluster.

The current setup and migration plan

Before undergoing this change the cluster was 3-4 beefy machines running k0s:

Where I want to get to is the following:

loadbalancer, 3 controller nodes and worker node architecture

Obviously the issue here will be how to get from 1 worker acting as a worker and controller to a completely different set of nodes serving as the controllers.

Well it’s simple (in theory):

Go up to 4 controller nodes (a 4 node ETCD cluster)
Remove one controller node

RTFM

As the k0s docs are pretty good, there is a great page on control plane HA which is worth reading:

k0s high availability diagram showing three controllers and three workers

Before you start: take a backup with k0s

Before starting it was important (for me) to take a backup with k0s since it was almost guaranteed something would go wrong eventually (things did go wrong).

Even if the control plane goes down, if your DNS is targeting worker nodes, apps should not go down immediately, giving you time to correct issues or revert (as long as you have a backup).

Get all your workloads off the current master (`controller+worker` role node)

Since I’m running workloads on the main node, those workloads were at risk (the master node is likely to go down/have issues) – redeploying workloads where necessary and kubectl draining the master node should prevent unnecessary downtime.

Another thing that’s worth taking a direct backup of is etcd – this is included in the k0s backup but knowing how to take a backup with etcd and how the data is laid out on disk is a good-to-know.

With a backup in place we can keep calm and restore from backup if/when necessary.

To make sure all workloads will be safe from control plane failures as we do this migration (and/or restarting the main node), you can evict all workloads from the node and mark it unschedulable.

Set up the load balancer

This is also a great time to set up the load balancer, and connect it to the new controller node we just stood up.

A load balancer is required (as of now) for the HA control plane used by k0s.

Set up the new controller node

The k0s documentation on setting up nodes is worth reading and should be followed – in our case we’re going from a --single setup to a HA setup so some consideration is required.

First we’ll want to create a new controller node, and since I’m using hcloud I have to do all the usual node hardeining tasks – installing ufw, fail2ban, ensuring only required ports are open, etc.

Hybrid single-master/HA configuration for k0s

On your existing controller, take a backup of your k0s config file (usually @ /etc/k0s/k0s.yaml) for safe keeping.

We’re going to modify this file, but it will be somewhat in-between the single-master and HA settings.

At this point, for k0s config I didn’t insert the HA load balancer as the external address, so the config is essentially a mishmash. Do not put the load balancer related address in just yet.

  api:
    address: XXX.XXX.XXX.XXX # IP address of the controller+worker node
    externalAddress: node-1.k8s.cluster.domain.tld # DNS name of the controller + worker node
    extraArgs:
      http2-max-streams-per-connection: "1000"
    k0sApiPort: 9443
    port: 6443
    sans:
    - XXX.XXX.XXX.XXX
    - YYY.YYY.YYY.YYY # another node
    - node-1.k8s.cluster.domain.tld
    - 127.0.0.1

Note that we’re using the previous master’s IP & DNS name as the address/externalAddress so we’re not following the map for k0s HA just yet.

The SANs include the new node and the old node and whatever else was there previously.

One thing we will set to the new load balancer (which currently has one node) is the ETCD peer address. The original master which is running etcd will be able to connect to the load balancer and make a connection the the etcd node there (attached to the one new controller node).

The following ports should be open on the new controller and load balancer:

  22, // SSH (for k0sctl)
  6443, // k8s
  8132, // konnectivity
  9443, // k0s api (for cluster join)
  2380, // etcd (controller <-> controller)

Add your new controller to `k0sctl.yaml`

Now we want to update the k0s YAML cluster configuration. We’re not going to add the load balancer in here just yet either.

At this point, only one entry should have controller+worker as the role (the rest would be worker roles).

---
apiVersion: k0sctl.k0sproject.io/v1beta1
kind: Cluster
metadata:
  name: frodo
spec:
  k0s:
    version: 1.21.1+k0s.0
  # NOTE: with only one node, draining has to be skipped w/ --no-drain
  hosts:
    - role: controller+worker
      ssh:
        address: <the current main node you already have>
        user: root
        port: 22
        keyPath: ~/.ssh/path/to/your/id_rsa

    # New controller only node
    - role: controller
      ssh:
        address: <address of the new first HA node>
        user: root
        port: 22
        keyPath: ~/.ssh/path/to/your/id_rsa

    - role: worker
      ssh:
        address: <a worker>
        user: root
        port: 22
        keyPath: ~/.ssh/path/to/your/id_rsa

    # ... more worker nodes down here ....

Start the k0s join API on port 9443 by removing `--single` from the `k0scontroller` systemd unit configuration

k0scontroller is the process that runs on controller nodes (or controller+worker nodes), and it’s controlled by systemd, so it’s configuration is specified in a unit file.

On your current master (node with role controller+worker) --single was specified specified which prevents new nodes from joining. The reason nodes can’t join a cluster with a --single controller is that the k0s join API is not running at port 9443 like it usually is in a HA setup.

Removing the --single flag from your k0scontroller systemd service configuration (see /etc/systemd/system) will enable k0s to start running the join API

Initialize and attach the new controller node

We’ve got teh cluster ready to accept a new node, but we haven’t actually joined the new node yet!

You’ll need a controller role join token, every time, for every new controller. So on the current master, run the following:

$ k0s token create --role=controller --expiry=48h > /tmp/controller-join.token

Copy the file generated @ /tmp/controller-join.token to the new controller node that has been set up.

You can attach the new node manually (assuming you copied the controller join token to /tmp/controller-join.token) by running the following:

$ k0s install controller --token-file /tmp/controller-join.token -c /etc/k0s/k0s.yaml
$ k0s start

On the original master, you should see a log like this coming out of k0scontroller:

INFO[2021-12-06 13:37:31] etcd API, adding new member: https://NNN.NNN.NNN.NNN:2380

Check for success: Watch etcd members update

If you see the etcd members update then you know that you’re on the right path (assuming XXX.XXX.XXX.XXX is the original controller+worker and YYY.YYY.YYY.YYY is the new load balancer address):

root@node-1 ~ # export ETCDCTL_API=3
root@node-1 ~ # etcdctl --endpoints=https://127.0.0.1:2379 --key=/var/lib/k0s/pki/etcd/server.key --cert=/var/lib/k0s/pki/etcd/server.crt --insecure-skip-tls-verify member list
4419ae2057d1f048, started, ctrl-0.k8s, https://YYY.YYY.YYY.YYY:2380, https://127.0.0.1:2379
880b4bba1672d7b3, started, node-1.k8s, https://XXX.XXX.XXX.XXX:2380, https://127.0.0.1:2379

For the very first new controller, set the externalAddress to the old node

It’s very likely that the old master will be picked as the new leader and the other (newly added) nodes will be followers.

At first, you will want to edit /etc/k0s/k0s.yaml such that the old leader is the external address:

---
apiVersion: k0s.k0sproject.io/v1beta1
kind: Cluster
metadata:
  name: k0s
spec:
  telemetry:
    enabled: false

  api:
    externalAddress: XXX.XXX.XXX.XXX #YYY.YYY.YYY.YYY
    sans: ["XXX.XXX.XXX.XXX","YYY.YYY.YYY.YYY"]

NOTE: XXX.XXX.XXX.XXX is the original node and YYY.YYY.YYY.YYY is the new load balancer (at this point with 1 new controller node attached).

I did something like this, then restarted k0scontroller (the systemd service that is installed by k0s) on the nodes.

For all subsequent controllers the config in k0sctl.yaml should work just fine.

Caution: ETCD cluster may be stuck at 3 nodes

At this point you might find that ETCD is stuck at 3 nodes in the member list, rather than 4 – it’s important to make sure all 4 are in.

Do NOT settle for the following:

root@ctrl-2:~# export ETCDCTL_API=3
root@ctrl-2:~# etcdctl --endpoints=https://127.0.0.1:2379 --key=/var/lib/k0s/pki/etcd/server.key --cert=/var/lib/k0s/pki/etcd/server.crt --insecure-skip-tls-verify member list
4419ae2057d1f048, started, ctrl-0.k8s, https://ZZZ.ZZZ.ZZZ.ZZZ:2380, https://127.0.0.1:2379
880b4bba1672d7b3, started, node-1.k8s, https://XXX.XXX.XXX.XXX:2380, https://127.0.0.1:2379
af37193f16df15b4, started, ctrl-2.k8s, https://AAA.AAA.AAA.AAA:2380, https://127.0.0.1:2379

Once you’ve got the 4 node setup, we can move on to removing the extraneous node.

Remove the original node from etcd

At this point we cna atempt to remove the original node from the etcd cluster. This is a destructive operation so be very careful when performing it:

root@ctrl-0:~# etcdctl --endpoints=https://127.0.0.1:2379 --key=/var/lib/k0s/pki/etcd/server.key --cert=/var/lib/k0s/pki/etcd/server.crt --insecure-skip-tls-verify member remove 880b4bba1672d7b3
Member 880b4bba1672d7b3 removed from cluster  b3e7667b2e90a09

After doing this you can check the members again.

Update `k0sctl.yaml`

At this point you’ll want to rewrite k0sctl.yaml with all the new controller nodes with the correct types:

Comment out the old node that was the leader
Update the external address (https://docs.k0sproject.io/main/high-availability/)
Backup your old kubeconfig
Generate a new kubeconfig
Run k0sctl apply

BEWARE: if you partially specify the external address in k0s:config:spec…, you will override all the other stuff with defaults, which could result in trying to switch from calico to kuberouter. (k0scontroller will fail to start though, so you’re safe from that)

Through all this, your workloads should be fine, as they should be happily running on the other cluster worker nodes (and pointed to via DNS).

At this point, you can switch back to your regular kubectl flow, and check on the nodes:

$ kubectl get nodes
NAME         STATUS     ROLES    AGE    VERSION
node-1.k8s   NotReady   <none>   184d   v1.21.1-k0s1
node-2.k8s   NotReady   <none>   43d    v1.21.1-k0s1
node-3.k8s   NotReady   <none>   67d    v1.21.1-k0s1
node-5.k8s   NotReady   <none>   77d    v1.21.1-k0s1

It’s a bit terrifying to see so many NotReadys, but the reason is obviously – we haven’t told the nodes that they should be connecting to the new controller set just yet.

k0sctl sporadically failed during apply but I just ignored that and kept going

Update all worker nodes to connect to the new controllers

Since we want the worker nodes to connect to the new controllers, we need to generate new join tokens.

Two steps should set up a node:

Generate a worker token on an online controller
Copy the worker token to the node

You can find where the token should go by checking the contents of /etc/systemd/system/k0scontroller.service, Ex:

[Unit]
Description=k0s - Zero Friction Kubernetes
Documentation=https://docs.k0sproject.io
ConditionFileIsExecutable=/usr/bin/k0s

After=network-online.target
Wants=network-online.target

[Service]
StartLimitInterval=5
StartLimitBurst=10
ExecStart=/usr/bin/k0s worker --config=/var/lib/k0s/config.yaml --labels="storage=nvme,storage.zfs=true,storage.localpv-zfs=true,storage.longhorn=true" --token-file=/var/lib/k0s/worker.token

RestartSec=120
Delegate=yes
KillMode=process
LimitCORE=infinity
TasksMax=infinity
TimeoutStartSec=0
LimitNOFILE=999999
Restart=always

[Install]
WantedBy=multi-user.target

In the example above the token file is @ /var/lib/k0s/worker.token (take a backup!).

Update the node’s /var/lib/k0s/config.yaml or /etc/k0s/k0s.yaml (if present), with the new address/externalAddress (both should be pointing to the load balancer). Be on the lookout for anywhere the old master node IP is being used. If your external address is the IP address for the API that’s fine too

Next, update /var/lib/k0s/kubelet.conf & /var/lib/kubelet-bootstrap.conf with the new load balancer address. Beware of your old cluster certificate not working (for kubectl).

Here’s an example of what the bootstrap file would look like:

apiVersion: v1
clusters:
- cluster:
    server: https://your.new.api.address:6443 # https://the.old.ip.address
    certificate-authority-data: <redacted>
  name: k0s
contexts:
- context:
    cluster: k0s
    user: kubelet-bootstrap
  name: k0s
current-context: k0s
kind: Config
preferences: {}
users:
- name: kubelet-bootstrap
  user:
    token: xxxxxx.aaaaaaaaaaaaaaaa # redacted

After makingt hese changes, restart (stop, start) k0sworker maually on every machine, and run k0sctl apply to prevent the workers from picking up the old IP addresses.

NOTE during all this all your workloads should continue to run – your ingress controller(s) are sitting on the worker nodes being pointed at by DNS/load balancers and are happily taking incoming traffic and giving it to pods. nothing we’ve done so far should have upset your CNI plugin or pods directly.

Unexpected Error: `bufio.Scanner` token too long

When I was doing this I ran into the following error in worker logs (after k0sctl apply):

Dec 07 10:21:48 node-5.eu-central-1 k0s[3029082]: time="2021-12-07 10:21:48" level=info msg="I1207 10:21:48.163618 3030291 fs.go:131] Filesystem UUIDs: map[014715bb-a61f-469b-80b1-fa38726ea950:/dev/sde 06548ab2-5fed-4c6a-a1ee-b4c4a2cbc833:/dev/sdc 5fbac445-af53-40e4-a37f-975736a8d787:/dev/md2 64c368c1-5769-467e-8626-3881ba8cbe25:/dev/md1 7762831980165748493:/dev/nvme0n1p5 7f43769e-4a6b-4433-9884-968373e7fc50:/dev/sdd 93bf0ee8-0623-4f8b-a369-abafc312b310:/dev/sda 9b79a669-3b62-499c-a994-b068466c95af:/dev/md0 a5ca05fd-4fd8-462e-a3e8-8e3be7590bd1:/dev/sdb d6bd0261-d26e-4009-b448-acb54b02aa37:/dev/sdf f334a003-af04-4a61-b11e-dd912d979517:/dev/zd0]" component=kubelet
Dec 07 10:21:48 node-5.eu-central-1 k0s[3029082]: time="2021-12-07 10:21:48" level=error msg="Error while reading from Writer: bufio.Scanner: token too long" component=kubelet
Dec 07 10:21:48 node-5.eu-central-1 k0s[3029082]: time="2021-12-07 10:21:48" level=warning msg="signal: broken pipe" component=kubelet
Dec 07 10:21:48 node-5.eu-central-1 k0s[3029082]: time="2021-12-07 10:21:48" level=info msg="respawning in 5s" component=kubelet

The way to get past it is to restart the node(s).

I think this had to do with the block devices that a system like longhorn sets up:

root@node-5 ~ # lsblk
NAME        MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda           8:0    0     8G  0 disk  /var/lib/k0s/kubelet/pods/970041a9-24ae-4a13-9d6a-f086f8199c66/volume-subpaths/pvc-c7d9885e-4661-4897-9567-f57cdc705a74/pg/0
sdb           8:16   0     1G  0 disk  /var/lib/k0s/kubelet/pods/a6f25e10-522a-406c-9add-6cc7085e7f34/volumes/kubernetes.io~csi/pvc-75cab383-60ff-488c-b049-096f05e147e4/
sdc           8:32   0     4G  0 disk  /var/lib/k0s/kubelet/pods/22dd0618-82e0-4e23-a84f-6db1180bf7e7/volumes/kubernetes.io~csi/pvc-22586f91-8a47-4dfb-b4d6-c1ac32e6ccf6/
sdd           8:48   0    16G  0 disk  /var/lib/k0s/kubelet/pods/a6f25e10-522a-406c-9add-6cc7085e7f34/volumes/kubernetes.io~csi/pvc-b3c5677e-36f0-4987-9fdf-d305b77d1926/
sde           8:64   0     1G  0 disk  /var/lib/k0s/kubelet/pods/d2d82dc2-5bf2-413f-bd97-86b99390c553/volumes/kubernetes.io~csi/pvc-8a85d0f0-9713-45b3-b6e9-448a69a6859c/
sdf           8:80   0     1G  0 disk  /var/lib/k0s/kubelet/pods/a6f25e10-522a-406c-9add-6cc7085e7f34/volumes/kubernetes.io~csi/pvc-ed13ef28-6a27-4200-8a3d-de7530201d51/
zd0         230:0    0   150G  0 disk  /var/lib/longhorn
nvme1n1     259:0    0   477G  0 disk
├─nvme1n1p1 259:1    0    32G  0 part
│ └─md0       9:0    0    32G  0 raid1 [SWAP]
├─nvme1n1p2 259:2    0     1G  0 part
│ └─md1       9:1    0  1022M  0 raid1 /boot
├─nvme1n1p3 259:3    0   128G  0 part
│ └─md2       9:2    0 127.9G  0 raid1 /
├─nvme1n1p4 259:4    0     1K  0 part
└─nvme1n1p5 259:5    0   316G  0 part
nvme0n1     259:6    0   477G  0 disk
├─nvme0n1p1 259:7    0    32G  0 part
│ └─md0       9:0    0    32G  0 raid1 [SWAP]
├─nvme0n1p2 259:8    0     1G  0 part
│ └─md1       9:1    0  1022M  0 raid1 /boot
├─nvme0n1p3 259:9    0   128G  0 part
│ └─md2       9:2    0 127.9G  0 raid1 /
├─nvme0n1p4 259:10   0     1K  0 part
└─nvme0n1p5 259:11   0   316G  0 part

After a restart lsblk is much shorter:

root@node-5 ~ # lsblk
NAME        MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
nvme0n1     259:0    0   477G  0 disk
├─nvme0n1p1 259:2    0    32G  0 part
│ └─md0       9:0    0    32G  0 raid1 [SWAP]
├─nvme0n1p2 259:3    0     1G  0 part
│ └─md1       9:1    0  1022M  0 raid1 /boot
├─nvme0n1p3 259:4    0   128G  0 part
│ └─md2       9:2    0 127.9G  0 raid1 /
├─nvme0n1p4 259:5    0     1K  0 part
└─nvme0n1p5 259:6    0   316G  0 part
nvme1n1     259:1    0   477G  0 disk
├─nvme1n1p1 259:7    0    32G  0 part
│ └─md0       9:0    0    32G  0 raid1 [SWAP]
├─nvme1n1p2 259:8    0     1G  0 part
│ └─md1       9:1    0  1022M  0 raid1 /boot
├─nvme1n1p3 259:9    0   128G  0 part
│ └─md2       9:2    0 127.9G  0 raid1 /
├─nvme1n1p4 259:10   0     1K  0 part
└─nvme1n1p5 259:11   0   316G  0 part

All those disks will get recreated but it looks like kubelet has a problem if they’re present at startup.

Update the kubernetes service endpoint

If you use calico (and probably if you don’t), none of your workloads will be able to start their networking if you have a stale address in the kubernetes service in default.
https://kubernetes.io/docs/concepts/architecture/control-plane-node-communication/

And we’re done! Wasn’t that easy?

This process was a PITA. Can’t help but wonder if it’s this hard with regular kubeadm but given that the cluster purrs along just about in every other instance, k0s is still my favorite. Most of this complexity is probably endemic, though it was a bit annoying trying to view logs with k0scontroller/k0sworker showing a fast moving mixed stream of logs.

ISSUE: Beware of taints

If you run k0sctl your main node will get tainted – you may need to remove the taints if something goes wrong or you want to revert:

$ k taint nodes node-1.k8s node.kubernetes.io/unreachable:NoSchedule-
node/node-1.k8s untainted
$ k taint nodes node-1.k8s node.kubernetes.io/unreachable:NoExecute-
node/node-1.k8s untainted

Note that very important workloads like Calico or Traefik will get stuck and won’t be able to restart their DaemonSet pods on the previous-controller+worker node (which was part of your cluster).

ISSUE: ETCD failing

ETCD is likely to start failing, despite some certs and k0s being operational:

Dec 06 12:43:19 frodo-ctrl-0 k0s[3260]: {"level":"warn","ts":"2021-12-06T12:43:19.842Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-7611dc43-189d-4ec2-a4e9-31ec8e1304af/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Dec 06 12:43:19 frodo-ctrl-0 k0s[3260]: time="2021-12-06 12:43:19" level=error msg="health-check: Etcd might be down: context deadline exceeded"
Dec 06 12:43:20 frodo-ctrl-0 k0s[3260]: {"level":"warn","ts":"2021-12-06T12:43:20.844Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-1f3fb077-f989-4c8c-aa9a-ee9e5862aae7/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Dec 06 12:43:20 frodo-ctrl-0 k0s[3260]: time="2021-12-06 12:43:20" level=error msg="health-check: Etcd might be down: context deadline exceeded"
Dec 06 12:43:21 frodo-ctrl-0 k0s[3260]: {"level":"warn","ts":"2021-12-06T12:43:21.845Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-18684a14-e6fa-4392-8dcc-35fafe16a5bf/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Dec 06 12:43:21 frodo-ctrl-0 k0s[3260]: time="2021-12-06 12:43:21" level=error msg="health-check: Etcd might be down: context deadline exceeded"

TIP: How to actually talk to ETCD

The single controller node should be running etcd (unless you’ve picked kine, in which you probably don’t have to worry about this at all, congratulations)

root@node-1 ~ # pgrep etcd -la
34537 /var/lib/k0s/bin/etcd --cert-file=/var/lib/k0s/pki/etcd/server.crt --peer-trusted-ca-file=/var/lib/k0s/pki/etcd/ca.crt --log-level=info --advertise-client-urls=https://127.0.0.1:2379 --peer-client-cert-auth=true --initial-advertise-peer-urls=https://XXX.XXX.XXX.XXX:2380 --trusted-ca-file=/var/lib/k0s/pki/etcd/ca.crt --key-file=/var/lib/k0s/pki/etcd/server.key --peer-cert-file=/var/lib/k0s/pki/etcd/peer.crt --name=node-1.eu-central-1 --auth-token=jwt,pub-key=/var/lib/k0s/pki/etcd/jwt.pub,priv-key=/var/lib/k0s/pki/etcd/jwt.key,sign-method=RS512,ttl=10m --listen-client-urls=https://127.0.0.1:2379 --client-cert-auth=true --enable-pprof=false --data-dir=/var/lib/k0s/etcd --listen-peer-urls=https://XXX.XXX.XXX.XXX:2380 --peer-key-file=/var/lib/k0s/pki/etcd/peer.key

You can talk to the etcd node from the the single controller like so:

root@node-1 ~ # export ETCDCTL_API=3
root@node-1 ~ # etcdctl --endpoints=https://127.0.0.1:2379 --key=/var/lib/k0s/pki/etcd/server.key --cert=/var/lib/k0s/pki/etcd/server.crt --insecure-skip-tls-verify member list
880b4bba1672d7b3, started, node-1.eu-central-1, https://XXX.XXX.XXX.XXX:2380, https://127.0.0.1:2379

This will be important later when we want to remove some nodes from the etcd cluster (and in general for diagnostics).

ISSUE: konnectivity crash looping, unable to connect

Konnectivity is a massive pain in my neck, and I figured it would be. Konnectivity’s job is to make control plane -> worker comms possible, but it’s a bit more complicated of a set up than the node-to-node SSH tunnels that previous versions of kubernetes used.

Turns out it will sometmes fail to find a backend for seemingly no reason at all:

Dec 06 13:42:30 frodo-ctrl-0 k0s[5682]: time="2021-12-06 13:42:30" level=info msg="Trace[952303301]: [30.003698446s] [30.003698446s] END" component=kube-apiserver
Dec 06 13:42:30 frodo-ctrl-0 k0s[5682]: time="2021-12-06 13:42:30" level=info msg="I1206 13:42:30.936356    5715 trace.go:205] Trace[176610740]: \"Proxy via grpc protocol over uds\" address:10.96.71.211:443 (06-Dec-2021 13:42:00.932) (total time: 30004ms):" component=kube-apiserver
Dec 06 13:42:30 frodo-ctrl-0 k0s[5682]: time="2021-12-06 13:42:30" level=info msg="Trace[176610740]: [30.004029084s] [30.004029084s] END" component=kube-apiserver
Dec 06 13:42:30 frodo-ctrl-0 k0s[5682]: time="2021-12-06 13:42:30" level=info msg="I1206 13:42:30.936998    5715 trace.go:205] Trace[1123926905]: \"Proxy via grpc protocol over uds\" address:10.96.71.211:443 (06-Dec-2021 13:42:00.932) (total time: 30003ms):" component=kube-apiserver
Dec 06 13:42:30 frodo-ctrl-0 k0s[5682]: time="2021-12-06 13:42:30" level=info msg="Trace[1123926905]: [30.003979291s] [30.003979291s] END" component=kube-apiserver
Dec 06 13:42:30 frodo-ctrl-0 k0s[5682]: time="2021-12-06 13:42:30" level=info msg="I1206 13:42:30.939681    5715 trace.go:205] Trace[735764735]: \"Proxy via grpc protocol over uds\" address:10.96.71.211:443 (06-Dec-2021 13:42:00.932) (total time: 30007ms):" component=kube-apiserver
Dec 06 13:40:03 frodo-ctrl-0 k0s[5682]: time="2021-12-06 13:40:03" level=info msg="I1206 13:40:03.752966    5715 clientconn.go:948] ClientConn switching balancer to \"pick_first\"" component=kube-apiserver
Dec 06 13:40:03 frodo-ctrl-0 k0s[5682]: time="2021-12-06 13:40:03" level=info msg="I1206 13:40:03.753448    5715 clientconn.go:897] blockingPicker: the picked transport is not ready, loop back to repick" component=kube-apiserver
Dec 06 13:40:03 frodo-ctrl-0 k0s[5682]: time="2021-12-06 13:40:03" level=info msg="E1206 13:40:03.754883    5768 server.go:385] \"Failed to get a backend\" err=\"No backend available\"" component=konnectivity
Dec 06 13:40:03 frodo-ctrl-0 k0s[5682]: time="2021-12-06 13:40:03" level=info msg="I1206 13:40:03.752285    5715 clientconn.go:106] parsed scheme: \"\"" component=kube-apiserver
Dec 06 13:40:03 frodo-ctrl-0 k0s[5682]: time="2021-12-06 13:40:03" level=info msg="I1206 13:40:03.753796    5715 clientconn.go:106] scheme \"\" not registered, fallback to default scheme" component=kube-apiserver
Dec 06 13:40:03 frodo-ctrl-0 k0s[5682]: time="2021-12-06 13:40:03" level=info msg="I1206 13:40:03.753842    5715 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{/run/k0s/konnectivity-server/konnectivity-server.sock  <nil> 0 <nil>}] <nil> <nil>}" component=kube-apiserver
Dec 06 13:40:03 frodo-ctrl-0 k0s[5682]: time="2021-12-06 13:40:03" level=info msg="I1206 13:40:03.753918    5715 clientconn.go:948] ClientConn switching balancer to \"pick_first\"" component=kube-apiserver
Dec 06 13:40:03 frodo-ctrl-0 k0s[5682]: time="2021-12-06 13:40:03" level=info msg="I1206 13:40:03.753945    5715 clientconn.go:897] blockingPicker: the picked transport is not ready, loop back to repick" component=kube-apiserver
Dec 06 13:40:03 frodo-ctrl-0 k0s[5682]: time="2021-12-06 13:40:03" level=info msg="I1206 13:40:03.755660    5715 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{/run/k0s/konnectivity-server/konnectivity-server.sock  <nil> 0 <nil>}] <nil> <nil>}" component=kube-apiserver
Dec 06 13:40:04 frodo-ctrl-0 k0s[5682]: time="2021-12-06 13:40:04" level=info msg="I1206 13:40:04.731130    5715 clientconn.go:948] ClientConn switching balancer to \"pick_first\"" component=kube-apiserver
Dec 06 13:40:04 frodo-ctrl-0 k0s[5682]: time="2021-12-06 13:40:04" level=info msg="I1206 13:40:04.731161    5715 clientconn.go:897] blockingPicker: the picked transport is not ready, loop back to repick" component=kube-apiserver
Dec 06 13:40:04 frodo-ctrl-0 k0s[5682]: time="2021-12-06 13:40:04" level=info msg="E1206 13:40:04.736462    5768 server.go:385] \"Failed to get a backend\" err=\"No backend available\"" component=konnectivity
Dec 06 13:40:04 frodo-ctrl-0 k0s[5682]: time="2021-12-06 13:40:04" level=info msg="E1206 13:40:04.738132    5768 server.go:385] \"Failed to get a backend\" err=\"No backend available\"" component=konnectivity
Dec 06 13:40:04 frodo-ctrl-0 k0s[5682]: time="2021-12-06 13:40:04" level=info msg="E1206 13:40:04.738157    5768 server.go:385] \"Failed to get a backend\" err=\"No backend available\"" component=konnectivity
Dec 06 13:40:04 frodo-ctrl-0 k0s[5682]: time="2021-12-06 13:40:04" level=info msg="I1206 13:40:04.740480    5715 client.go:111] DialResp not recognized; dropped" component=kube-apiserver
Dec 06 13:40:04 frodo-ctrl-0 k0s[5682]: time="2021-12-06 13:40:04" level=info msg="I1206 13:40:04.740942    5715 client.go:111] DialResp not recognized; dropped" component=kube-apiserver
Dec 06 13:40:04 frodo-ctrl-0 k0s[5682]: time="2021-12-06 13:40:04" level=info msg="I1206 13:40:04.741034    5715 client.go:111] DialResp not recognized; dropped" component=kube-apiserver
Dec 06 13:40:04 frodo-ctrl-0 k0s[5682]: time="2021-12-06 13:40:04" level=info msg="E1206 13:40:04.741788    5768 server.go:385] \"Failed to get a backend\" err=\"No backend available\"" component=konnectivity
Dec 06 13:40:04 frodo-ctrl-0 k0s[5682]: time="2021-12-06 13:40:04" level=info msg="E1206 13:40:04.742527    5768 server.go:354] \"Stream read from frontend failure\" err=\"rpc error: code = Canceled desc = context canceled\"" component=konnectivity
Dec 06 13:40:04 frodo-ctrl-0 k0s[5682]: time="2021-12-06 13:40:04" level=info msg="E1206 13:40:04.742581    5768 server.go:385] \"Failed to get a backend\" err=\"No backend available\"" component=konnectivity
Dec 06 13:40:04 frodo-ctrl-0 k0s[5682]: time="2021-12-06 13:40:04" level=info msg="E1206 13:40:04.742854    5768 server.go:354] \"Stream read from frontend failure\" err=\"rpc error: code = Canceled desc = context canceled\"" component=konnectivity
Dec 06 13:40:04 frodo-ctrl-0 k0s[5682]: time="2021-12-06 13:40:04" level=info msg="I1206 13:40:04.742870    5715 client.go:111] DialResp not recognized; dropped" component=kube-apiserver
Dec 06 13:40:04 frodo-ctrl-0 k0s[5682]: time="2021-12-06 13:40:04" level=info msg="E1206 13:40:04.743197    5768 server.go:354] \"Stream read from frontend failure\" err=\"rpc error: code = Canceled desc = context canceled\"" component=konnectivity
Dec 06 13:40:04 frodo-ctrl-0 k0s[5682]: time="2021-12-06 13:40:04" level=info msg="I1206 13:40:04.743160    5715 client.go:111] DialResp not recognized; dropped" component=kube-apiserver
Dec 06 13:40:04 frodo-ctrl-0 k0s[5682]: time="2021-12-06 13:40:04" level=info msg="E1206 13:40:04.743419    5768 server.go:354] \"Stream read from frontend failure\" err=\"rpc error: code = Canceled desc = context canceled\"" component=konnectivity
Dec 06 13:40:04 frodo-ctrl-0 k0s[5682]: time="2021-12-06 13:40:04" level=info msg="E1206 13:40:04.743453    5768 server.go:354] \"Stream read from frontend failure\" err=\"rpc error: code = Canceled desc = context canceled\"" component=konnectivity

RECOVERY: If things go bad, CAREFULLY reset (on the new node)

If things go bad, know that you can always reset on the newer controller node (NOT the original controller+worker node!):

root@ctrl-0:~# k0s reset
INFO[2021-12-06 13:01:43] Uninstalling the k0s service
INFO[2021-12-06 13:01:43] no config file given, using defaults
INFO[2021-12-06 13:01:43] deleting user: etcd
INFO[2021-12-06 13:01:43] deleting user: kube-apiserver
INFO[2021-12-06 13:01:43] deleting user: konnectivity-server
INFO[2021-12-06 13:01:43] deleting user: kube-scheduler
INFO[2021-12-06 13:01:43] deleting k0s generated data-dir (/var/lib/k0s) and run-dir (/run/k0s)
INFO[2021-12-06 13:01:43] k0s cleanup operations done. To ensure a full reset, a node reboot is recommended.
root@ctrl-0:~# rm -rf /etc/systemd/system/k0scontroller.service
root@ctrl-0:~# systemctl daemon-reload

You should see something like the above for removing the new controller node from the cluster.

RECOVERY: recovering from k0s backup

If anything goes catastrophically wrong, remove all the new nodes and restore k0s from backup.

When you start the main node you can:

Use the experimental restart as single node option.
Steal all the options from pgrep -la

Add ETCD peer URLs for the first new controller node manually

The ETCD peer address in /etc/k0s/k0s.yaml on the new node being added

Run `k0sctl apply`

Update `externalAddress` on every node individually

The new controller nodes need to start using the konnectivity interface one by one and make connections to the services so that they can communicate…

See the relevant issue on k0s

DEBUG NOTES: Calico networking is down

While I was doing this myself I had two major pieces of infrastructure go awry:

I ran an apt-get upgrade, and due to improper DKMS installation, in-kernel ZFS regressed
Calico seems to have shit the bed on itself for no apparent reason, maybe related to not having storage, but nope because it doesn’t use ZFS

For this reason 10.96.0.1 was not accessible – that address is the API server itself. It seemed like Calico wasn’t pulling together the networking quite right.

Digging out of the hole – get `crictl`

Learn aout and download critctl from the kubernetes-sigs/cri-tools repo.

The setup should look like this:

root@node-2 ~ # cat /etc/crictl.yaml
runtime-endpoint: unix:///run/k0s/containerd.sock
image-endpoint: unix:///run/k0s/containerd.sock
timeout: 2
debug: true
pull-image-on-create: false

It looks like calico is having the same problem as everything else, can’t connect to the API server

2021-12-07 14:45:29.823 [INFO][9] startup/startup.go 458: Hit error connecting to datastore - retry error=Get "https://10.96.0.1:443/api/v1/nodes/foo": dial tcp 10.96.0.1:443: connect: connection refused
2021-12-07 14:45:30.824 [INFO][9] startup/startup.go 458: Hit error connecting to datastore - retry error=Get "https://10.96.0.1:443/api/v1/nodes/foo": dial tcp 10.96.0.1:443: connect: connection refused
2021-12-07 14:45:31.825 [INFO][9] startup/startup.go 458: Hit error connecting to datastore - retry error=Get "https://10.96.0.1:443/api/v1/nodes/foo": dial tcp 10.96.0.1:443: connect: connection refused
2021-12-07 14:45:32.825 [INFO][9] startup/startup.go 458: Hit error connecting to datastore - retry error=Get "https://10.96.0.1:443/api/v1/nodes/foo": dial tcp 10.96.0.1:443: connect: connection refused
2021-12-07 14:45:33.826 [INFO][9] startup/startup.go 458: Hit error connecting to datastore - retry error=Get "https://10.96.0.1:443/api/v1/nodes/foo": dial tcp 10.96.0.1:443: connect: connection refused

The errors I’m seeing

The errors that were pointing to an issue with Calico looked like the following:

Dec 07 14:56:58 node-3.eu-central-1 k0s[363389]: time="2021-12-07 14:56:58" level=info msg="E1207 14:56:58.732060  363426 kuberuntime_manager.go:729] \"killPodWithSyncResult failed\" err=\"failed to \\\"KillPodSandbox\\\" for \\\"88d25b61-00cd-4940-8969-17a72f08f954\\\" with KillPodSandboxError: \\\"rpc error: code = Unknown desc = failed to destroy network for sandbox \\\\\\\"8832012d4bbb773b4b361c6d59e1b2a75ac60e8ece9420e7b43ef33345b7c160\\\\\\\": error getting ClusterInformation: Get \\\\\\\"https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default\\\\\\\": dial tcp 10.96.0.1:443: connect: connection refused\\\"\"" component=kubelet
Dec 07 14:56:58 node-3.eu-central-1 k0s[363389]: time="2021-12-07 14:56:58" level=info msg="E1207 14:56:58.732119  363426 pod_workers.go:190] \"Error syncing pod, skipping\" err=\"failed to \\\"KillPodSandbox\\\" for \\\"88d25b61-00cd-4940-8969-17a72f08f954\\\" with KillPodSandboxError: \\\"rpc error: code = Unknown desc = failed to destroy network for sandbox \\\\\\\"8832012d4bbb773b4b361c6d59e1b2a75ac60e8ece9420e7b43ef33345b7c160\\\\\\\": error getting ClusterInformation: Get \\\\\\\"https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default\\\\\\\": dial tcp 10.96.0.1:443: connect: connection refused\\\"\" pod=\"kubevirt/virt-handler-5grtb\" podUID=88d25b61-00cd-4940-8969-17a72f08f954" component=kubelet
Dec 07 14:56:58 node-3.eu-central-1 k0s[363389]: time="2021-12-07 14:56:58" level=info msg="time=\"2021-12-07T14:56:58.758461835+01:00\" level=error msg=\"StopPodSandbox for \\\"745155de13d48c4ad58201c13fb1e8fee02fc746a94aed96abd0cd70b127be61\\\" failed\" error=\"failed to destroy network for sandbox \\\"745155de13d48c4ad58201c13fb1e8fee02fc746a94aed96abd0cd70b127be61\\\": error getting ClusterInformation: Get \\\"https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default\\\": dial tcp 10.96.0.1:443: connect: connection refused\"" component=containerd
Dec 07 14:56:58 node-3.eu-central-1 k0s[363389]: time="2021-12-07 14:56:58" level=info msg="E1207 14:56:58.758643  363426 remote_runtime.go:144] \"StopPodSandbox from runtime service failed\" err=\"rpc error: code = Unknown desc = failed to destroy network for sandbox \\\"745155de13d48c4ad58201c13fb1e8fee02fc746a94aed96abd0cd70b127be61\\\": error getting ClusterInformation: Get \\\"https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default\\\": dial tcp 10.96.0.1:443: connect: connection refused\" podSandboxID=\"745155de13d48c4ad58201c13fb1e8fee02fc746a94aed96abd0cd70b127be61\"" component=kubelet
Dec 07 14:56:58 node-3.eu-central-1 k0s[363389]: time="2021-12-07 14:56:58" level=info msg="E1207 14:56:58.758675  363426 kuberuntime_manager.go:958] \"Failed to stop sandbox\" podSandboxID={Type:containerd ID:745155de13d48c4ad58201c13fb1e8fee02fc746a94aed96abd0cd70b127be61}" component=kubelet
Dec 07 14:56:58 node-3.eu-central-1 k0s[363389]: time="2021-12-07 14:56:58" level=info msg="E1207 14:56:58.758709  363426 kuberuntime_manager.go:729] \"killPodWithSyncResult failed\" err=\"failed to \\\"KillPodSandbox\\\" for \\\"c975301e-1261-4ad7-a607-c569dac68094\\\" with KillPodSandboxError: \\\"rpc error: code = Unknown desc = failed to destroy network for sandbox \\\\\\\"745155de13d48c4ad58201c13fb1e8fee02fc746a94aed96abd0cd70b127be61\\\\\\\": error getting ClusterInformation: Get \\\\\\\"https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default\\\\\\\": dial tcp 10.96.0.1:443: connect: connection refused\\\"\"" component=kubelet
Dec 07 14:56:58 node-3.eu-central-1 k0s[363389]: time="2021-12-07 14:56:58" level=info msg="E1207 14:56:58.758735  363426 pod_workers.go:190] \"Error syncing pod, skipping\" err=\"failed to \\\"KillPodSandbox\\\" for \\\"c975301e-1261-4ad7-a607-c569dac68094\\\" with KillPodSandboxError: \\\"rpc error: code = Unknown desc = failed to destroy network for sandbox \\\\\\\"745155de13d48c4ad58201c13fb1e8fee02fc746a94aed96abd0cd70b127be61\\\\\\\": error getting ClusterInformation: Get \\\\\\\"https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default\\\\\\\": dial tcp 10.96.0.1:443: connect: connection refused\\\"\" pod=\"monitoring/node-exporter-5f4rw\" podUID=c975301e-1261-4ad7-a607-c569dac68094" component=kubelet

Every workload that needs CNI network is failing to start, all at the same time. Some of them are not failing, like kube-proxy, since it’s a fundamental part of the network setup and works somewhat at the same level of CNI, it’s not dependent on CNI running properly

Well, is this 10.96.0.1? Of course, it’s the API server, which normally sits at the very beginning of the service CIDR (my service CIDR is 10.96.0.0/12), it’s the kubernetes API server.

Looks like connections are failing to get to the kubernetes API server inside the cluster.

Fix: stale endpoint on one itty bitty service in default.

After HOURS of being lost, it turns out the issue is that the API server (a Service named kubernetes, inside your cluster) which lives in the default namespace had an old Endpoint:

$ kubectl get svc -n default
NAME         TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
kubernetes   ClusterIP   10.96.0.1       <none>        443/TCP    185d
postgres     ClusterIP   10.103.44.224   <none>        5432/TCP   3d12h

There’s that address that none of the workloads can hit (10.96.0.1), and here’s the endpoints:

$ kubectl get endpoints -n default
NAME         ENDPOINTS              AGE
kubernetes   XXX.XXX.XXX.XXX:6443   185d
postgres     <none>                 3d12h

With the new HA setup we’re going into, this is obviously the wrong way to access the cluster – at this point we should be using the load balancer. That address shouldn’t be XXX.XXX.XXX.XXX, it should be YYY.YYY.YYY.YYY (the new load balancer).

Changing the endpoint manually via kubectl edit endpoint kubernetes -n default was enough to fix the kubernetes service. After fixing this, every workload that was stuck was able to resolve itself.

Scale issue: The load is too great for CPX11

It looks like you do sometimes get what you pay for – 3 “2 vCPU” nodes weren’t enough to run the relatively little traffic that my API server sees. I do have quite a few operators going so maybe this is reasonable but usage was spiking to 100% and 200%.

I rescaled from CPX11 to CX21s, one by one. Before scaling the usage looked like this:

usage before scaling up from CPX11 to CX21

And after:

usage after scaling up from CPX11 to CX21

At this point I’m starting to think that I should have just gone with a single controller node after all.

Update on cost (“TCO”)

At this point the TCO for this HA setup is actually around 19.60EUR (~$24 USD as of 12/08/2021):

4.90EUR Hetzner Cloud CX21 x 3
4.90EUR Hetzner Cloud Load Balancer x 1

TCO for the cheapest single node controller I could get my hands on would be:

23.53EUR Hetzner Robot Auction server x 1

That was the cheapest auction server I could find as of 12/08, but it does come with 8 dedicated vCores, and 32GB of RAM. Plenty powerful for a cluster of <5 worker nodes that needed to be orchestrated.

On the slightly more expensive side I could also have gone with

In retrospect I think going with the single node with server auction drives was the best choice… Technically the cheapest option is 3 nodes + load balancer, but the pain of managing 3 ETCD instances, the load balancer, and added complexity is not ideal.

Cleanup: Go through every workload and clean up/restart stuff

You’re probably going to have to go through every workload on k8s and rollout restart them.

You’re also going to have to forcibly remove terminated but persistent pods (when a pod is 0/1 containers and stuck in Terminating but is still not gone). These can hold up long-running replicasets/deployments/daemonsets.

A lot of the time what will tip you off is that the services will not have endpoints, or there will be n-1 pods for a given DS when there are n nodes.

BONUS: k8s bug uncovered – too many mounts will fail to be printed

While working on this, it looks like there’s a bug with k8s. The symptoms look like the following:

Dec 08 14:31:46 node-5.eu-central-1 k0s[4057424]: time="2021-12-08 14:31:46" level=info msg="I1208 14:31:46.992802 4160960 dynamic_cafile_content.go:167] Starting client-ca-bundle::/var/lib/k0s/pki/ca.crt" component=kubelet
Dec 08 14:31:46 node-5.eu-central-1 k0s[4057424]: time="2021-12-08 14:31:46" level=info msg="I1208 14:31:46.994200 4160960 manager.go:165] cAdvisor running in container: \"/sys/fs/cgroup/cpu,cpuacct/system.slice/k0sworker.service\"" component=kubelet

Dec 08 14:31:51 node-5.eu-central-1 k0s[4057424]: time="2021-12-08 14:31:51" level=info msg="I1208 14:31:51.999124 4160960 fs.go:131] Filesystem UUIDs: map[5fbac445-af53-40e4-a37f-975736a8d787:/dev/md2 64c368c1-5769-467e-8626-3881ba8cbe25:/dev/md1 7762831980165748493:/dev/nvme0n1p5 9b79a669-3b62-499c-a994-b068466c95af:/dev/md0 f334a003-af04-4a61-b11e-dd912d979517:/dev/zd0]" component=kubelet
Dec 08 14:31:52 node-5.eu-central-1 k0s[4057424]: time="2021-12-08 14:31:52" level=error msg="Error while reading from Writer: bufio.Scanner: token too long" component=kubelet
Dec 08 14:31:52 node-5.eu-central-1 k0s[4057424]: time="2021-12-08 14:31:52" level=warning msg="signal: broken pipe" component=kubelet
Dec 08 14:31:52 node-5.eu-central-1 k0s[4057424]: time="2021-12-08 14:31:52" level=info msg="respawning in 5s" component=kubelet

Dec 08 14:31:57 node-5.eu-central-1 k0s[4057424]: time="2021-12-08 14:31:57" level=info msg=Restarted component=kubelet

Can you figure out what’s wrong there? How about where that line is coming from? Well I couldn’t, I had to actually clone the kubernetes code base so I could better search and find the issue.

The inner part of the line (Error while reading from writer) comes from kubernetes/vendor/github.com/sirupsen/logrus/writer.go. Luckily the code is pretty easy to read and understand (thanks Golang!).

Based on where the error is occurring, we can probably assume that the logger is choking on a line… that’s printed near “Filesystem UUIDs”… That we can find in the cadvisor codebase @ kubernetes/vendor/github.com/google/cadvisor/fs/fs.go (again, thanks to vendored deps in Golang). You can view the relevant line in fs.go online.

Two ideas immediately came to mind:

Can I disable cadvisor from k0s completely?
Based on the code in cADvisor it looks like klog only tries to print this line if the verbosity is at least(?) level 1, so an easy way to test this is to set it to 0 and see if I can get away with it

It looks like the quick workaround for me is setting kubelet to 0 verbosity. I filed an issue against google/cadvisor and called it a day.

Wrapup

Well, I’m not sure who will read this eventually, but if you ever think about doing the swap from 1 controller to 3 yourself, don’t do it. Just make a new cluster, your life will be easier. Maybe while you’re there make sure your workloads are easier to migrate anyway.

For those who will ignore the advice above and try to convert the existing cluster, hopefully these notes are helpful to you. My pain, your gain.

Categories

Table of Contents