tl;dr - I took down my k8s cluster by letting it’s TLS certificates expire. Regenerating certificates, deleting /var/lib/kubelet/pki/kubelet-client-current, restarting the kubelet, recreating service accounts and restarting pods/services/deployments/daemonsets was what got me back to a working system without blowing everything away.

Towards the end of 2019 I was visited by a small bit of ~~failure~~ adventure – resuscitating my tiny Kubernetes cluster after it’s TLS certificates had expired. A few error message

This post is written in a mix of stream-of-consciousness and retrospective (as I often do) as these notes were taken as I was going through the actual issue, but this post is being made long after. Feel free to skim more so than read (the Table of Contents above is there for you, not me), and jump around. Hopefully someone (including future me) learns from this.

Leadup to the outage

The cluster was actually running perfectly fine (the applications it hosted were up, containers were running, etc) when I realized that for some reason local kubectl commands were not properly deploying applications – I was receiving a very weird error about authentication failing. The error looked something like this:

Unable to authenticate the request due to an error: [x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes"), x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes")]

This really confused me because I knew that I hadn’t changed the kubectl or k8s configuration at all recently (the latest thing I did was an upgrade to a newer k8s version with the help of kubeadm), so after some head scratching I figured this must be some sort of local problem – the kubectl on the actual server should be able to easily communicate with the API server – since “it’s right there”. As one would expect, I was greeted with the same error.

So now I’m panicking a little bit, and I check the kubelet process from systemd’s perspective (systemctl status kubelet) and it looks to be running just fine… Maybe this is something that would be fixed with a restart? Possibly some sort of effect from the last upgrade? I went ahead and rebooted the server – Well that was a huge mistake, because as soon as I restarted the server, kubelet itself didn’t come back up. This is the zero hour of the outage.

Starting to debug the outage

As the realization that I was in the middle of an outage was dawning on me, I drew my wits about me and mentally slapped myself – “Don’t panic”. I started thinking of all the things I knew to do with a misbehaving cluster. Since kubelet stopped running after the restart it’s an obvious place to start, but not the only possible option. A while back I switched from a systemd managed control-plane (so kube-apiserver, controller-manager, etc managed by systemd as individual Units) to the kubelet-managed control plane (you specify “manifests” in /etc/kubernetes/manifests that are started up by kubelet). This setup means that if something is going wrong, it’s 99% of the time going to be visible through the kubelet first (since it’s the kubelet’s job to start everything else), as long as systemd is doing it’s job (and running the kubelet unit) it’s not to blame.

Here’s the output I got from looking at the journalctl output for kubelet:

root@Ubuntu-1810-cosmic-64-minimal ~ # journalctl -xe -u kubelet
-- Subject: Automatic restarting of a unit has been scheduled
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- Automatic restarting of the unit kubelet.service has been scheduled, as the result for
-- the configured Restart= setting for the unit.
Dec 21 04:04:26 Ubuntu-1810-cosmic-64-minimal systemd [1]: Stopped kubelet: The Kubernetes Node Agent.
-- Subject: Unit kubelet.service has finished shutting down
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- Unit kubelet.service has finished shutting down.
Dec 21 04:04:26 Ubuntu-1810-cosmic-64-minimal systemd [1]: Started kubelet: The Kubernetes Node Agent.
-- Subject: Unit kubelet.service has finished start-up
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- Unit kubelet.service has finished starting up.
--
-- The start-up result is RESULT.
Dec 21 04:04:26 Ubuntu-1810-cosmic-64-minimal kubelet [3871]: Flag --resolv-conf has been deprecated, This parameter should be set via the config file specified by the Kubelet's --confi
Dec 21 04:04:26 Ubuntu-1810-cosmic-64-minimal kubelet [3871]: Flag --resolv-conf has been deprecated, This parameter should be set via the config file specified by the Kubelet's --confi
Dec 21 04:04:26 Ubuntu-1810-cosmic-64-minimal systemd [1]: kubelet.service: Main process exited, code=exited, status=255/n/a
Dec 21 04:04:27 Ubuntu-1810-cosmic-64-minimal kubelet [3871]: I1221 04:04:26.949931    3871 server.go:410

OK so now we’ve got some information (with the help of systemd and journald) to go on as to why the kubelet unit was failing. There are two concerning messages there, let’s dive into each one.

Red Herring #1: resolv-conf flag deprecation

The message regarding --resolv-conf’s deprecation was the first one in the logs so the first one I tackled. Trying to track down this issue lead me to a promising github issue which was extremely helpful.

The fix was easy, excet for a few things:

My first instinct was to check /etc/systemd/system for any systemd unit configuration overrides, as I didn’t remember specifically setting resolv-conf (the path was /etc/systemd/system/kubelet.conf.d/10-kubelet.conf for my setup)
The configuration was actually in /var/lib/kubelet/kubeadm-flags.env which is used by kubeadm

There was quite a bit of head scratching and tracing of ENV variables, and running systemctl show <service name> to try to track down where the --resolv-conf flag could have been being passed to kubelet. Once I found the kubeadm-flags.env file I backed it up (a simple cp kubeadm-flags.env kubeadm-flags.env.bak) and took a look inside to see why it was being used. Well it turns out there’s a very good reason – the known limitations of linux’s libc. kubeadm is actually doing the right thing here, I removed the line just to make the flag go away (something I’d later go on to regret).

Red Herring #2: unable to find bootstrap config

It wasn’t shown above but there was actually another error – one about failing to find a “bootstrap config” (the below isn’t my log line but one from a similar github issue):

Nov 27 09:00:16 node1 kubelet[27284]: F1127 09:00:16.566510   27284 server.go:262] failed to run Kubelet: unable to load bootstrap kubeconfig: stat /etc/kubernetes/bootstrap-kubelet.conf: no such file or directory

At first I thought this was a real issue, but it didn’t make sense – k8s did not need to bootstrap, it’d been running for a long time. This error message was safely ignored.

A whiff of the real bug: Expired Certificates

Turns out the real bug is that the certificates that protect the API server were expired. Unfortunately the relevant kubeadm diagnostic tools were reporting that they were good until 2020, which was very confusing. The only realistic fix for this is to backup & then completely remove the old certs and configuration and replace it. Some links helped convince me of that:

https://stackoverflow.com/questions/56320930/renew-kubernetes-pki-after-expired/56334732
http://kb.cloudblue.com/en/134072 (good write up but the resolution commands are outdated and don’t work)

Basically by doing this I’ve essentially done a certificate rotation, so none of the ~/.kubeconfigs (whether on the server or from my computer) I’d been using would work anymore. Figured I’d just back it up and copy over the new config:

root@Ubuntu-1810-cosmic-64-minimal ~ # cp ~/.kube/config ~/.kube/config.bak
root@Ubuntu-1810-cosmic-64-minimal ~ # cp /et/kubernetes/admin.conf ~/.kube/config

And with that, the kubeadm commands were super helpful for generating new certs for everything – with this I thought the problem was instantly solved – the kubelet now started properly!

Another bug pops up: The node name seems to have changed

Shortly after the kubelet started running again, I noticed that there were quite a few errors –

Dec 21 04:28:15 Ubuntu-1810-cosmic-64-minimal kubelet[1697]: E1221 04:28:15.895330    1697 reflector.go:123] k8s.io/kubernetes/pkg/kubelet/kubelet.go:450: Failed to list *v1.Service: Unauthorized
Dec 21 04:28:15 Ubuntu-1810-cosmic-64-minimal kubelet[1697]: E1221 04:28:15.915759    1697 kubelet.go:2267] node "ubuntu-1810-cosmic-64-minimal" not found
Dec 21 04:28:16 Ubuntu-1810-cosmic-64-minimal kubelet[1697]: E1221 04:28:16.016031    1697 kubelet.go:2267] node "ubuntu-1810-cosmic-64-minimal" not found
Dec 21 04:28:16 Ubuntu-1810-cosmic-64-minimal kubelet[1697]: E1221 04:28:16.096650    1697 reflector.go:123] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:46: Failed to list *v1.Pod: Unauthorized

If you’re a super closer reader, you might notice that there’s a difference between the node name and the actual machine name – the node name changed from ubuntu-... to Ubuntu-..., and I thought this was enough to throw kubelet off and be the reason it was failing. After changing the node name and rebooting I quickly found that wasn’t the problem, but the key was a little bit higher up in the logs:

Dec 21 04:41:48 ubuntu-1810-cosmic-64-minimal kubelet[1684]: I1221 04:41:48.051840    1684 kubelet_node_status.go:72] Attempting to register node ubuntu-1810-cosmic-64-minimal
Dec 21 04:41:48 ubuntu-1810-cosmic-64-minimal kubelet[1684]: E1221 04:41:48.052907    1684 kubelet_node_status.go:94] Unable to register node "ubuntu-1810-cosmic-64-minimal" with API server: Unauthorized
Dec 21 04:41:48 ubuntu-1810-cosmic-64-minimal kubelet[1684]: E1221 04:41:48.132407    1684 kubelet.go:2267] node "ubuntu-1810-cosmic-64-minimal" not found
Dec 21 04:41:48 ubuntu-1810-cosmic-64-minimal kubelet[1684]: E1221 04:41:48.232657    1684 kubelet.go:2267] node "ubuntu-1810-cosmic-64-minimal" not found
Dec 21 04:41:48 ubuntu-1810-cosmic-64-minimal kubelet[1684]: E1221 04:41:48.247440    1684 reflector.go:123] k8s.io/kubernetes/pkg/kubelet/kubelet.go:450: Failed to list *v1.Service: Unauthorized
Dec 21 04:41:48 ubuntu-1810-cosmic-64-minimal kubelet[1684]: E1221 04:41:48.332889    1684 kubelet.go:2267] node "ubuntu-1810-cosmic-64-minimal" not found
Dec 21 04:41:48 ubuntu-1810-cosmic-64-minimal kubelet[1684]: E1221 04:41:48.433163    1684 kubelet.go:2267] node "ubuntu-1810-cosmic-64-minimal" not found
Dec 21 04:41:48 ubuntu-1810-cosmic-64-minimal kubelet[1684]: E1221 04:41:48.447444    1684 reflector.go:123] k8s.io/kubernetes/pkg/kubelet/kubelet.go:459: Failed to list *v1.Node: Unauthorized

Clearly the real problem is the Unable to register node "...." message – basically the problem that I’d had at the very outset of this debacle. At this point it looked like I might have to do some of the etcd fixes from the linked resources from earlier, but upon some closer inspection the fixes were not really for “replacing” etcd data per-say so I gave up on that plan (also I didn’t want to upset anything unnecessarily inside the ETCD cluster for fear of corrupting state).

After lots of head scratching, I realized that the the certs must not have been properly fixed, and some resources help me towards this realization:

So basically, no matter how kubeadm’s cert commands helped me rotate the certs, the problem is that the API server itself is not receiving commands from the node (kubelet instance) itself. So rotating the certificates on disk clearly did not actually reach the API server (at this point I couldn’t think of anywhere else that contained secret information like this outside of /etc/kubernetes).

Chasing down the bug: What does the API server have to say?

So now that kubelet is running (and erroring), it’s also running at least some of the containers that need to run – in particular the kube-apiserver itself! I can investigate what’s happening with the API server by attaching myself to the container and reading the logs. Normally I’d do this with kubectl but since the API server isn’t allowing access, I have to go in through the container runtime itself. I happen to run (and am very happy with) containerd, which uses crictl for management so I needed to do a little bit more than just run docker ps:

root@ubuntu-1810-cosmic-64-minimal ~ # crictl --runtime-endpoint=unix:///var/run/containerd/containerd.sock ps

This output of that command (ps) is the list of containers that were running, as one might expect:

CONTAINER ID        IMAGE               CREATED             STATE               NAME                      ATTEMPT             POD ID
e56889fe6d0fd       201c7a8403125       25 minutes ago      Running             kube-apiserver            1                   946fe70f25e63
5554ada7512ff       2d3813851e874       26 minutes ago      Running             kube-scheduler            7                   ca6192eccc282
10b0055143639       8328bb49b6529       26 minutes ago      Running             kube-controller-manager   7                   58cbb0d0dd566
46c794d57a713       2c4adeb21b4ff       About an hour ago   Running             etcd                      4                   d3332a08acb21

As you can see, there are some containers running – the kube-apiserver, scheduler, controller-manager and etcd are actually all running – but clearly they’re having some issues… Lets dig dig into it:

# crictl --config=/root/.crictl.yaml logs e56889fe6d0fd # this is a different ID but pretend it's the same as the apiserver above
<lots of output that zoomed by super fast>

I wanted to read the output from the top so that’s where I brought in the super useful unix utility less:

# crictl --config=/root/.crictl.yaml logs e56889fe6d0fd 2>&1 | less

(NOTE: stderr doesn’t go to stdout normally which is what less reads so you have to redirect it)

This lead me to what seemed like the smoking gun:

E1221 05:41:36.590187       1 controller.go:148] Unable to remove old endpoints from kubernetes service: StorageError: key not found, Code: 1, Key: /registry/masterleases/<ip of my machine>, ResourceVersion: 0, AdditionalErrorMsg:
E1221 05:41:36.625877       1 authentication.go:65] Unable to authenticate the request due to an error: [x509: certificate signed by unknown authority, x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes")]

And cue the confusion – the problem is that the self signed certificate that is being used by every fucking piece of kubernetes is not being accepted by the api server??? Here’s my thought process:

“everything” pulls (certs/config/etc) from /etc/kubernetes
I can see the manifest for kube-apiserver (/etc/kubernetes/manifests) and it seems to only be looking in /etc/kubernetes
The kube-apiserver and kubelet have different certs?

If we’ve replaced the certificates in /etc/kubernetes what else could there possibly be? Well eventually I found the manual certificate renewal guide, which was a great step in the right direction. Then I inspected the kubelet config file documentation which was a step backwards but a good resource to have for the future. I expected to find something in here that might point to a setting that explained how the kubelet was getting a different cert, but only the CAFile was specified (and it was the same one I expected).

After a ton of looking around I ran into a message that mentioned a file that was supposed to be present at kubelet startup: /var/lib/kubelet/pki/kubelet-client-current.pem. That’s when it hit me, I totally forgot about the /var/lib/kubelet area for storing information, evidently:

root@ubuntu-1810-cosmic-64-minimal ~ # ls -la /var/lib/kubelet/pki/
total 28
drwxr-xr-x 2 root root 4096 Sep 11 19:39 .
drwxr-xr-x 9 root root 4096 Dec 21 07:19 ..
-rw------- 1 root root 2794 Dec  9  2018 kubelet-client-2018-12-09-12-38-41.pem
-rw------- 1 root root 1143 Dec  9  2018 kubelet-client-2018-12-09-12-39-12.pem
-rw------- 1 root root 1143 Sep 11 19:39 kubelet-client-2019-09-11-19-39-37.pem
lrwxrwxrwx 1 root root   59 Sep 11 19:39 kubelet-client-current.pem -> /var/lib/kubelet/pki/kubelet-client-2019-09-11-19-39-37.pem
-rw-r--r-- 1 root root 2307 Dec  9  2018 kubelet.crt
-rw------- 1 root root 1679 Dec  9  2018 kubelet.key

The file listing of the folder makes it really clear – look at how old kubelet-client-current.pem is!

So the kubelet was actually using the entirely wrong cert to connect as a client to the API server, not the other way around. A very helpful kubeadm issue also helped draw me to the usage of the file – it looks like /etc/kubernetes/apiserver-kubelet-client is the client that the API server SENDS to kubelet, and /var/lib/kubelet/pki/kubelet-client*.pem is the client that the kubelet sends to the API server. One more very helpful SO post also helped me realize this.

A wild solution appears!

So now that I know what the real issue is (also corroborated by a Github issue), the real solution was actually just to delete kubelet-client-current.pem and restart the kubelet. After restarting the kubelet I was greeted with this output:

Dec 21 07:48:35 ubuntu-1810-cosmic-64-minimal kubelet[16856]: W1221 07:48:35.105958   16856 util_unix.go:103] Using "/run/containerd/containerd.sock" as endpoint is deprecated, please consider using full url format "unix:///run/containerd/containerd.sock".
Dec 21 07:48:35 ubuntu-1810-cosmic-64-minimal kubelet[16856]: I1221 07:48:35.105976   16856 remote_image.go:50] parsed scheme: ""
Dec 21 07:48:35 ubuntu-1810-cosmic-64-minimal kubelet[16856]: I1221 07:48:35.105985   16856 remote_image.go:50] scheme "" not registered, fallback to default scheme
Dec 21 07:48:35 ubuntu-1810-cosmic-64-minimal kubelet[16856]: I1221 07:48:35.106002   16856 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{/run/containerd/containerd.sock 0  <nil>}] <nil>}
Dec 21 07:48:35 ubuntu-1810-cosmic-64-minimal kubelet[16856]: I1221 07:48:35.106010   16856 clientconn.go:577] ClientConn switching balancer to "pick_first"
Dec 21 07:48:45 ubuntu-1810-cosmic-64-minimal kubelet[16856]: I1221 07:48:45.097044   16856 transport.go:132] certificate rotation detected, shutting down client connections to start using new credentials
Dec 21 07:48:55 ubuntu-1810-cosmic-64-minimal kubelet[16856]: E1221 07:48:55.394963   16856 aws_credentials.go:77] while getting AWS credentials NoCredentialProviders: no valid providers in chain. Deprecated.
Dec 21 07:48:55 ubuntu-1810-cosmic-64-minimal kubelet[16856]:         For verbose messaging see aws.Config.CredentialsChainVerboseErrors
Dec 21 07:48:55 ubuntu-1810-cosmic-64-minimal kubelet[16856]: I1221 07:48:55.395963   16856 kuberuntime_manager.go:207] Container runtime containerd initialized, version: v1.2.7, apiVersion: v1alpha2
Dec 21 07:48:55 ubuntu-1810-cosmic-64-minimal kubelet[16856]: I1221 07:48:55.396870   16856 server.go:1065] Started kubelet
Dec 21 07:48:55 ubuntu-1810-cosmic-64-minimal kubelet[16856]: I1221 07:48:55.397061   16856 server.go:145] Starting to listen on 0.0.0.0:10250

And just like that, no more authentication errors! The cluster is working now right? NOPE. With this impromptu rotation of all the cluster secrets, there’s a small briefly discussed in the cloudblue knowledgebase article – the service accounts for the cluster are borked all over – important services like kube-router, coredns cannot pull from the API because they have old secrets with old credentials.

At this point kubelet is starting up and running the containers that etcd (via the API server) tells it should be running, but a bunch of them won’t run correctly because they’re relying on old, incorrect credentials to connect to the API server.

Debugging `kube-router`

At first check the pods in kube-system look to be functioning correctly:

root@ubuntu-1810-cosmic-64-minimal /var/lib/kubelet/pki # k get pods -n kube-system
NAME                                                    READY   STATUS    RESTARTS   AGE
cert-manager-7648c6f789-79wwv                           1/1     Running   0          171m
coredns-584795fc57-8t775                                0/1     Running   5          5m50s
coredns-584795fc57-9d92w                                0/1     Running   3          3m
etcd-ubuntu-1810-cosmic-64-minimal                      1/1     Running   6          49d
kube-apiserver-ubuntu-1810-cosmic-64-minimal            1/1     Running   27         29m
kube-controller-manager-ubuntu-1810-cosmic-64-minimal   1/1     Running   27         27m
kube-router-5pnrk                                       1/1     Running   7          15m
kube-scheduler-ubuntu-1810-cosmic-64-minimal            1/1     Running   27         29m

But after a few seconds I realize how wrong things are – coredns, kube-router, and kube-proxy (which shouldn’t have even been present at all) were having issues and continuously restarting. Thanks to the CloudBlue kb article I realized that I had to re-create the service accounts (and possibly do other things) to get these applications on to the new certs.

Unfortunately it wasn’t as simple as running kubectl delete sa kube-router -n kube-system, because there were more broken systems – that was enough to fix kube-router, but there’s one more pretty big important piece of routing in a k8s cluster – CoreDNS.

Debugging CoreDNS

As I noticed coredns was failing I looked at the logs:

$ k logs coredns-584795fc57-8t775 -n kube-system
.:53
2019-12-21T08:18:09.223Z [INFO] CoreDNS-1.3.1
2019-12-21T08:18:09.223Z [INFO] linux/amd64, go1.11.4, 6b56a9c
CoreDNS-1.3.1
linux/amd64, go1.11.4, 6b56a9c
2019-12-21T08:18:09.223Z [INFO] plugin/reload: Running configuration MD5 = f65c4821c8a9b7b5eb30fa4fbc167769
2019-12-21T08:18:29.224Z [ERROR] plugin/errors: 2 7657599066091208900.4473028311675897983. HINFO: unreachable backend: read udp 10.244.0.188:39977->213.133.99.99:53: i/o timeout
2019-12-21T08:18:32.224Z [ERROR] plugin/errors: 2 7657599066091208900.4473028311675897983. HINFO: unreachable backend: read udp 10.244.0.188:49238->213.133.99.99:53: i/o timeout
E1221 08:18:34.223023       1 reflector.go:134] github.com/coredns/coredns/plugin/kubernetes/controller.go:315: Failed to list *v1.Service: Get https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
E1221 08:18:34.223023       1 reflector.go:134] github.com/coredns/coredns/plugin/kubernetes/controller.go:315: Failed to list *v1.Service: Get https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
log: exiting because of error: log: cannot create log: open /tmp/coredns.coredns-584795fc57-8t775.unknownuser.log.ERROR.20191221-081834.1: no such file or directory

Well some of those errors I have no idea about, but it looks like coredns is having some problems contacting something @ 10.96.0.1 (weirdly enough, no authenttication issues, just a straight timeout). I’m assuming this is the API server but not sure, because 10.96.0.1 is not where I expect the API server to be accessible from based on how the cluster is set up (it’s not one of the cluster ranges, for services or anything)… Let’s do some digging:

$ k get pods -n kube-system -o=wide
NAME                                                    READY   STATUS             RESTARTS   AGE     IP             NODE                            NOMINATED NODE   READINESS GATES
cert-manager-7648c6f789-79wwv                           1/1     Running            0          174m    10.244.0.182   ubuntu-1810-cosmic-64-minimal   <none>           <none>
coredns-584795fc57-8t775                                0/1     CrashLoopBackOff   8          9m12s   10.244.0.188   ubuntu-1810-cosmic-64-minimal   <none>           <none>
coredns-584795fc57-9d92w                                0/1     Error              6          6m22s   10.244.0.190   ubuntu-1810-cosmic-64-minimal   <none>           <none>
etcd-ubuntu-1810-cosmic-64-minimal                      1/1     Running            6          49d     <server ip>    ubuntu-1810-cosmic-64-minimal   <none>           <none>
kube-apiserver-ubuntu-1810-cosmic-64-minimal            1/1     Running            27         32m     <server ip>    ubuntu-1810-cosmic-64-minimal   <none>           <none>
kube-controller-manager-ubuntu-1810-cosmic-64-minimal   1/1     Running            27         31m     <server ip>    ubuntu-1810-cosmic-64-minimal   <none>           <none>
kube-router-5pnrk                                       0/1     Error              9          19m     <server ip>    ubuntu-1810-cosmic-64-minimal   <none>           <none>
kube-scheduler-ubuntu-1810-cosmic-64-minimal            1/1     Running            27         32m     <server ip>    ubuntu-1810-cosmic-64-minimal   <none>           <none>

Well what about the services for kube-system?

$ k get svc -n kube-system
NAME       TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                  AGE
kube-dns   ClusterIP   10.96.0.10   <none>        53/UDP,53/TCP,9153/TCP   376d

Only the one, which seems right, but more importantly there’s that IP range I didn’t recognize – the DNS service is listening at 10.96.0.10. Doesn’t seem too impossible now that the API server would be running at 10.96.0.1 though I wouldn’t have expected it there, I do know that the API server runs at a xxx.xxx.xxx.1 IP generally. At this point I thought that maybe running into kube-proxy being set up (when I run with kube-router) might have had something to do with this, so I figured I’d just do a restart in case any IP tables shenanigans were happening and going badly.

DNS was one of the new problems

Restarting didn’t solve all my problems, as usual, so time to dig into the DNS issues more. Remember that note earlier about systemd’s special /etc/resolv.conf settings?? welp this might be the time to point out that an error coredns was seeing a lot of was:

Dec 21 08:32:59 ubuntu-1810-cosmic-64-minimal kubelet[18317]: E1221 08:32:59.027711   18317 dns.go:135] Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 213.133.98.98 213.133.99.99 213.133.100.100

This seems pretty related, but I was completely stumped at this point – putting the --resolv-conf argument (which pointed to the special systemd resolv.conf) didn’t work, and nothing seemed to get through the DNS.

At this point I’d been debugging for like ~4-6 hours straight (holy downtime batman) and it was time to take a break. So I unwinded with some of my favorite frustration, Counter Strike: Global Offensive. After I was sufficiently frustrated with my inability to consistently BOOM HEADSHOT, I came back to the problem, luckily finding the solution really quickly, since time away is often a great help to problem solving.

Fixing `kube-router`’s private kubeconfig

It turns out kube-router has it’s own kubeconfig that it uses to access the cluster and this needed to be updated as well! coredns isn’t going to work if the routing layer (what performed the kube-proxy role essentially) can’t access the API server! It was really easy to remedy the problem:

# cp /var/lib/kube-router/kubeconfig /var/lib/kube-router/kubeconfig.bak
# cp /etc/kubernetes/admin.conf /var/lib/kube-router/kubeconfig

With that done, all I needed to do was delete the old kube-router pod, and as soon as a new one spun up in it’s place, the errors were gone – kube-router could properly route the cluster traffic, and that meant that coredns could reach what it needed to (the API server) and do it’s job.

Debugging more infrastructure pieces (`traefik`, `cert-manager`)

After the internal services seemed to be working fine, I tried to access some of my websites (like this blog) and noticed that I was getting to a page (yay), but that page had the wrong certificate! I use Traefik as my external-to-internal Ingress Controller, and I was getting the Traefik default certificate. jetstack/cert-manager is a fantastic project and is nearly bullet proof so I didn’t figure it was a problem with the right certs not being there, bu trather a problem with the ingress controller not being able to do it’s job somehow (probably, not being able to talk to the API server).

Of course, a quick check of the logs for traefik confirmed this:

E1221 12:31:58.471754       1 reflector.go:205] github.com/containous/traefik/vendor/k8s.io/client-go/informers/factory.go:86: Failed to list *v1.Service: Unauthorized
E1221 12:31:58.472603       1 reflector.go:205] github.com/containous/traefik/vendor/k8s.io/client-go/informers/factory.go:86: Failed to list *v1.Endpoints: Unauthorized

I had to delete and re-make the service account for traefik and then delete the DS for the ingress controller itself – luckily my cluster is all checked into source control (I’m following the makeinfra pattern), so it was only a make away. After getting traefik working, I also checked cert-manager just in case and guess what I found:

E1221 12:43:42.150531       1 leaderelection.go:224] error retrieving resource lock kube-system/cert-manager-controller: Unauthorized
E1221 12:43:46.414256       1 leaderelection.go:224] error retrieving resource lock kube-system/cert-manager-controller: Unauthorized
E1221 12:43:49.415112       1 leaderelection.go:224] error retrieving resource lock kube-system/cert-manager-controller: Unauthorized

As you might expect, cert-manager also has to talk to the API server, and it was getting the same errors. In general what I had to do here was re-create the appropriate service acocunts, then kill the related pods/daemonsets and re-apply them. The service account invalidations/bugs that were caused by the expiration mean that every single namespace that has service accounts that might have used the k8s API is in danger.

~8-12 hours later, success

After roughly ~8-12 hours of downtime, I was finally serving traffic again! This was an absolutely wild ride but I was glad to put it behind me.

for now I’ve checked my most important projects and they seem to be fine but this was a WILD wild ride – downtime of roughly ~12+ hours

Retrospective: How do I prevent this from happening again?

This is one of those things that you kind of have to beat with discipline it feels like – not letting a cert expire requires good automated rollover reminders. kubeadm actually has automatic cert renegotiation, but I don’t think it was enabled on the kubernetes version I was using previously. A systemd timer might help here, but it’d be easy to forget if I ever tear down the cluster? I guess a regular calendar reminder is good enough for now – to at least check and ensure that the negotiation will happen automatically next time.

The ability to blow everything away and completely rebuild the system (in the immutable infrastructure sense) would have been great, but it’s really hard to take backups without access to the cluster – I’m using dynamically provisioned local volumes thanks to OpenEBS, and it’s more than a bit of a pain to start migrating the storage over.

Since newer versions of k8s don’t have this problem I don’t think I’ll have to worry about it again, but I guess I’m going to find out next year.

Aside: Resources that helped

Here is a listing of some of the resources from the article that helped me and might help you if you run into this:

Aside: Are there any truly simple pre-shared-key secure communication schemes out there?

I really, really want a way to do secure communication without dealing with the complexity of TLS. It’s such a source of complexity that I would greatly appreciate a way to secure communications just using pre-shared keys. Of course, TLS already has this via the rarely used TLS-PSK option, but it seems like there’s a minimal communication encryption scheme out there that could work given pre-shared symmetric/asymmetric keys, but it seems like no one has tried to do it and standardize it. The closest possibly trustable thing I’ve found is COSE (CBOR), but it actually doesn’t seem simple enough to warrant moving away from TLS.

If you know some convincing reasons why this isn’t a good idea please send me an email – would it just turn into trying to re-build TLS? Is a simple preshared key + nonce not enough? I’m thinking something simple like:

Have your unicode-encoded message text bytestring
Add a nonce
Add a MAC
Sign & Encrypt

Maybe it would be enough to just do something crazy like require a Wireguard setup w/ unsecured HTTP1/2/3for deploying that piece of software? All I want is for a way for two programs to talk to each other without requiring the complexity of x509 certificates. Somehow carting around priv/pub keys is more interesting to me – and especially in the case where I don’t need the 3rd party attestation for possibly unknown clients.

Wrapup

Well this was certainly quite an experience for me, while I ended up mostly learning about kubeadm’s defaults for setting up clusters and some server configuration things, this was an outage I certainly wouldn’t want to recur.

Hopefully someone out there sees this just in the nick of time and has an idea of how to recover their clusters! Or in the best case, you have easy-to-shift storage clusters and can just completely blow away the cluster itself and avoid this completely (though I think it’d be hard, if no one could talk to your control plane because it’s certs were expired).

VADOSWARE

Living in a yak shaver's paradise.

2019-12 K8s certificate expiration outage

Categories

Table of Contents

Leadup to the outage

Starting to debug the outage

Red Herring #1: resolv-conf flag deprecation

Red Herring #2: unable to find bootstrap config

A whiff of the real bug: Expired Certificates

Another bug pops up: The node name seems to have changed

Chasing down the bug: What does the API server have to say?

A wild solution appears!

Debugging `kube-router`

Debugging CoreDNS

DNS was one of the new problems

Fixing `kube-router`’s private kubeconfig

Debugging more infrastructure pieces (`traefik`, `cert-manager`)

~8-12 hours later, success

Retrospective: How do I prevent this from happening again?

Aside: Resources that helped

Aside: Are there any truly simple pre-shared-key secure communication schemes out there?

Wrapup

Categories

Table of Contents

Leadup to the outage

Starting to debug the outage

Red Herring #1: resolv-conf flag deprecation

Red Herring #2: unable to find bootstrap config

A whiff of the real bug: Expired Certificates

Another bug pops up: The node name seems to have changed

Chasing down the bug: What does the API server have to say?

A wild solution appears!

Debugging kube-router

Debugging CoreDNS

DNS was one of the new problems

Fixing kube-router’s private kubeconfig

Debugging more infrastructure pieces (traefik, cert-manager)

~8-12 hours later, success

Retrospective: How do I prevent this from happening again?

Aside: Resources that helped

Aside: Are there any truly simple pre-shared-key secure communication schemes out there?

Wrapup

Debugging `kube-router`

Fixing `kube-router`’s private kubeconfig

Debugging more infrastructure pieces (`traefik`, `cert-manager`)