tl;dr - I took down my k8s cluster by letting it’s TLS certificates expire. Regenerating certificates, deleting /var/lib/kubelet/pki/kubelet-client-current
, restarting the kubelet
, recreating service accounts and restarting pods/services/deployments/daemonsets was what got me back to a working system without blowing everything away.
Towards the end of 2019 I was visited by a small bit of failure adventure – resuscitating my tiny Kubernetes cluster after it’s TLS certificates had expired. A few error message
This post is written in a mix of stream-of-consciousness and retrospective (as I often do) as these notes were taken as I was going through the actual issue, but this post is being made long after. Feel free to skim more so than read (the Table of Contents above is there for you, not me), and jump around. Hopefully someone (including future me) learns from this.
The cluster was actually running perfectly fine (the applications it hosted were up, containers were running, etc) when I realized that for some reason local kubectl
commands were not properly deploying applications – I was receiving a very weird error about authentication failing. The error looked something like this:
Unable to authenticate the request due to an error: [x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes"), x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes")]
This really confused me because I knew that I hadn’t changed the kubectl
or k8s configuration at all recently (the latest thing I did was an upgrade to a newer k8s version with the help of kubeadm
), so after some head scratching I figured this must be some sort of local problem – the kubectl
on the actual server should be able to easily communicate with the API server – since “it’s right there”. As one would expect, I was greeted with the same error.
So now I’m panicking a little bit, and I check the kubelet
process from systemd
’s perspective (systemctl status kubelet
) and it looks to be running just fine… Maybe this is something that would be fixed with a restart? Possibly some sort of effect from the last upgrade? I went ahead and rebooted the server – Well that was a huge mistake, because as soon as I restarted the server, kubelet
itself didn’t come back up. This is the zero hour of the outage.
As the realization that I was in the middle of an outage was dawning on me, I drew my wits about me and mentally slapped myself – “Don’t panic”. I started thinking of all the things I knew to do with a misbehaving cluster. Since kubelet
stopped running after the restart it’s an obvious place to start, but not the only possible option. A while back I switched from a systemd
managed control-plane (so kube-apiserver
, controller-manager
, etc managed by systemd
as individual Units) to the kubelet
-managed control plane (you specify “manifests” in /etc/kubernetes/manifests
that are started up by kubelet). This setup means that if something is going wrong, it’s 99% of the time going to be visible through the kubelet
first (since it’s the kubelet
’s job to start everything else), as long as systemd
is doing it’s job (and running the kubelet
unit) it’s not to blame.
Here’s the output I got from looking at the journalctl
output for kubelet
:
root@Ubuntu-1810-cosmic-64-minimal ~ # journalctl -xe -u kubelet
-- Subject: Automatic restarting of a unit has been scheduled
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- Automatic restarting of the unit kubelet.service has been scheduled, as the result for
-- the configured Restart= setting for the unit.
Dec 21 04:04:26 Ubuntu-1810-cosmic-64-minimal systemd [1]: Stopped kubelet: The Kubernetes Node Agent.
-- Subject: Unit kubelet.service has finished shutting down
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- Unit kubelet.service has finished shutting down.
Dec 21 04:04:26 Ubuntu-1810-cosmic-64-minimal systemd [1]: Started kubelet: The Kubernetes Node Agent.
-- Subject: Unit kubelet.service has finished start-up
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- Unit kubelet.service has finished starting up.
--
-- The start-up result is RESULT.
Dec 21 04:04:26 Ubuntu-1810-cosmic-64-minimal kubelet [3871]: Flag --resolv-conf has been deprecated, This parameter should be set via the config file specified by the Kubelet's --confi
Dec 21 04:04:26 Ubuntu-1810-cosmic-64-minimal kubelet [3871]: Flag --resolv-conf has been deprecated, This parameter should be set via the config file specified by the Kubelet's --confi
Dec 21 04:04:26 Ubuntu-1810-cosmic-64-minimal systemd [1]: kubelet.service: Main process exited, code=exited, status=255/n/a
Dec 21 04:04:27 Ubuntu-1810-cosmic-64-minimal kubelet [3871]: I1221 04:04:26.949931 3871 server.go:410
OK so now we’ve got some information (with the help of systemd
and journald
) to go on as to why the kubelet
unit was failing. There are two concerning messages there, let’s dive into each one.
The message regarding --resolv-conf
’s deprecation was the first one in the logs so the first one I tackled. Trying to track down this issue lead me to a promising github issue which was extremely helpful.
The fix was easy, excet for a few things:
/etc/systemd/system
for any systemd unit configuration overrides, as I didn’t remember specifically setting resolv-conf
(the path was /etc/systemd/system/kubelet.conf.d/10-kubelet.conf
for my setup)/var/lib/kubelet/kubeadm-flags.env
which is used by kubeadm
There was quite a bit of head scratching and tracing of ENV variables, and running systemctl show <service name>
to try to track down where the --resolv-conf
flag could have been being passed to kubelet
. Once I found the kubeadm-flags.env
file I backed it up (a simple cp kubeadm-flags.env kubeadm-flags.env.bak
) and took a look inside to see why it was being used. Well it turns out there’s a very good reason – the known limitations of linux’s libc. kubeadm
is actually doing the right thing here, I removed the line just to make the flag go away (something I’d later go on to regret).
It wasn’t shown above but there was actually another error – one about failing to find a “bootstrap config” (the below isn’t my log line but one from a similar github issue):
Nov 27 09:00:16 node1 kubelet[27284]: F1127 09:00:16.566510 27284 server.go:262] failed to run Kubelet: unable to load bootstrap kubeconfig: stat /etc/kubernetes/bootstrap-kubelet.conf: no such file or directory
At first I thought this was a real issue, but it didn’t make sense – k8s did not need to bootstrap, it’d been running for a long time. This error message was safely ignored.
Turns out the real bug is that the certificates that protect the API server were expired. Unfortunately the relevant kubeadm
diagnostic tools were reporting that they were good until 2020, which was very confusing. The only realistic fix for this is to backup & then completely remove the old certs and configuration and replace it. Some links helped convince me of that:
Basically by doing this I’ve essentially done a certificate rotation, so none of the ~/.kubeconfig
s (whether on the server or from my computer) I’d been using would work anymore. Figured I’d just back it up and copy over the new config:
root@Ubuntu-1810-cosmic-64-minimal ~ # cp ~/.kube/config ~/.kube/config.bak
root@Ubuntu-1810-cosmic-64-minimal ~ # cp /et/kubernetes/admin.conf ~/.kube/config
And with that, the kubeadm
commands were super helpful for generating new certs for everything – with this I thought the problem was instantly solved – the kubelet
now started properly!
Shortly after the kubelet
started running again, I noticed that there were quite a few errors –
Dec 21 04:28:15 Ubuntu-1810-cosmic-64-minimal kubelet[1697]: E1221 04:28:15.895330 1697 reflector.go:123] k8s.io/kubernetes/pkg/kubelet/kubelet.go:450: Failed to list *v1.Service: Unauthorized
Dec 21 04:28:15 Ubuntu-1810-cosmic-64-minimal kubelet[1697]: E1221 04:28:15.915759 1697 kubelet.go:2267] node "ubuntu-1810-cosmic-64-minimal" not found
Dec 21 04:28:16 Ubuntu-1810-cosmic-64-minimal kubelet[1697]: E1221 04:28:16.016031 1697 kubelet.go:2267] node "ubuntu-1810-cosmic-64-minimal" not found
Dec 21 04:28:16 Ubuntu-1810-cosmic-64-minimal kubelet[1697]: E1221 04:28:16.096650 1697 reflector.go:123] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:46: Failed to list *v1.Pod: Unauthorized
If you’re a super closer reader, you might notice that there’s a difference between the node name and the actual machine name – the node name changed from ubuntu-...
to Ubuntu-...
, and I thought this was enough to throw kubelet
off and be the reason it was failing. After changing the node name and rebooting I quickly found that wasn’t the problem, but the key was a little bit higher up in the logs:
Dec 21 04:41:48 ubuntu-1810-cosmic-64-minimal kubelet[1684]: I1221 04:41:48.051840 1684 kubelet_node_status.go:72] Attempting to register node ubuntu-1810-cosmic-64-minimal
Dec 21 04:41:48 ubuntu-1810-cosmic-64-minimal kubelet[1684]: E1221 04:41:48.052907 1684 kubelet_node_status.go:94] Unable to register node "ubuntu-1810-cosmic-64-minimal" with API server: Unauthorized
Dec 21 04:41:48 ubuntu-1810-cosmic-64-minimal kubelet[1684]: E1221 04:41:48.132407 1684 kubelet.go:2267] node "ubuntu-1810-cosmic-64-minimal" not found
Dec 21 04:41:48 ubuntu-1810-cosmic-64-minimal kubelet[1684]: E1221 04:41:48.232657 1684 kubelet.go:2267] node "ubuntu-1810-cosmic-64-minimal" not found
Dec 21 04:41:48 ubuntu-1810-cosmic-64-minimal kubelet[1684]: E1221 04:41:48.247440 1684 reflector.go:123] k8s.io/kubernetes/pkg/kubelet/kubelet.go:450: Failed to list *v1.Service: Unauthorized
Dec 21 04:41:48 ubuntu-1810-cosmic-64-minimal kubelet[1684]: E1221 04:41:48.332889 1684 kubelet.go:2267] node "ubuntu-1810-cosmic-64-minimal" not found
Dec 21 04:41:48 ubuntu-1810-cosmic-64-minimal kubelet[1684]: E1221 04:41:48.433163 1684 kubelet.go:2267] node "ubuntu-1810-cosmic-64-minimal" not found
Dec 21 04:41:48 ubuntu-1810-cosmic-64-minimal kubelet[1684]: E1221 04:41:48.447444 1684 reflector.go:123] k8s.io/kubernetes/pkg/kubelet/kubelet.go:459: Failed to list *v1.Node: Unauthorized
Clearly the real problem is the Unable to register node "...."
message – basically the problem that I’d had at the very outset of this debacle. At this point it looked like I might have to do some of the etcd
fixes from the linked resources from earlier, but upon some closer inspection the fixes were not really for “replacing” etcd
data per-say so I gave up on that plan (also I didn’t want to upset anything unnecessarily inside the ETCD cluster for fear of corrupting state).
After lots of head scratching, I realized that the the certs must not have been properly fixed, and some resources help me towards this realization:
So basically, no matter how kubeadm
’s cert commands helped me rotate the certs, the problem is that the API server itself is not receiving commands from the node (kubelet
instance) itself. So rotating the certificates on disk clearly did not actually reach the API server (at this point I couldn’t think of anywhere else that contained secret information like this outside of /etc/kubernetes
).
So now that kubelet
is running (and erroring), it’s also running at least some of the containers that need to run – in particular the kube-apiserver
itself! I can investigate what’s happening with the API server by attaching myself to the container and reading the logs. Normally I’d do this with kubectl
but since the API server isn’t allowing access, I have to go in through the container runtime itself. I happen to run (and am very happy with) containerd
, which uses crictl
for management so I needed to do a little bit more than just run docker ps
:
root@ubuntu-1810-cosmic-64-minimal ~ # crictl --runtime-endpoint=unix:///var/run/containerd/containerd.sock ps
This output of that command (ps
) is the list of containers that were running, as one might expect:
CONTAINER ID IMAGE CREATED STATE NAME ATTEMPT POD ID
e56889fe6d0fd 201c7a8403125 25 minutes ago Running kube-apiserver 1 946fe70f25e63
5554ada7512ff 2d3813851e874 26 minutes ago Running kube-scheduler 7 ca6192eccc282
10b0055143639 8328bb49b6529 26 minutes ago Running kube-controller-manager 7 58cbb0d0dd566
46c794d57a713 2c4adeb21b4ff About an hour ago Running etcd 4 d3332a08acb21
As you can see, there are some containers running – the kube-apiserver
, scheduler
, controller-manager
and etcd
are actually all running – but clearly they’re having some issues… Lets dig dig into it:
# crictl --config=/root/.crictl.yaml logs e56889fe6d0fd # this is a different ID but pretend it's the same as the apiserver above
<lots of output that zoomed by super fast>
I wanted to read the output from the top so that’s where I brought in the super useful unix utility less
:
# crictl --config=/root/.crictl.yaml logs e56889fe6d0fd 2>&1 | less
(NOTE: stderr
doesn’t go to stdout
normally which is what less
reads so you have to redirect it)
This lead me to what seemed like the smoking gun:
E1221 05:41:36.590187 1 controller.go:148] Unable to remove old endpoints from kubernetes service: StorageError: key not found, Code: 1, Key: /registry/masterleases/<ip of my machine>, ResourceVersion: 0, AdditionalErrorMsg:
E1221 05:41:36.625877 1 authentication.go:65] Unable to authenticate the request due to an error: [x509: certificate signed by unknown authority, x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes")]
And cue the confusion – the problem is that the self signed certificate that is being used by every fucking piece of kubernetes is not being accepted by the api server??? Here’s my thought process:
/etc/kubernetes
kube-apiserver
(/etc/kubernetes/manifests
) and it seems to only be looking in /etc/kubernetes
kube-apiserver
and kubelet
have different certs?If we’ve replaced the certificates in /etc/kubernetes
what else could there possibly be? Well eventually I found the manual certificate renewal guide, which was a great step in the right direction. Then I inspected the kubelet config file documentation which was a step backwards but a good resource to have for the future. I expected to find something in here that might point to a setting that explained how the kubelet
was getting a different cert, but only the CAFile was specified (and it was the same one I expected).
After a ton of looking around I ran into a message that mentioned a file that was supposed to be present at kubelet startup: /var/lib/kubelet/pki/kubelet-client-current.pem
. That’s when it hit me, I totally forgot about the /var/lib/kubelet
area for storing information, evidently:
root@ubuntu-1810-cosmic-64-minimal ~ # ls -la /var/lib/kubelet/pki/
total 28
drwxr-xr-x 2 root root 4096 Sep 11 19:39 .
drwxr-xr-x 9 root root 4096 Dec 21 07:19 ..
-rw------- 1 root root 2794 Dec 9 2018 kubelet-client-2018-12-09-12-38-41.pem
-rw------- 1 root root 1143 Dec 9 2018 kubelet-client-2018-12-09-12-39-12.pem
-rw------- 1 root root 1143 Sep 11 19:39 kubelet-client-2019-09-11-19-39-37.pem
lrwxrwxrwx 1 root root 59 Sep 11 19:39 kubelet-client-current.pem -> /var/lib/kubelet/pki/kubelet-client-2019-09-11-19-39-37.pem
-rw-r--r-- 1 root root 2307 Dec 9 2018 kubelet.crt
-rw------- 1 root root 1679 Dec 9 2018 kubelet.key
The file listing of the folder makes it really clear – look at how old kubelet-client-current.pem
is!
So the kubelet was actually using the entirely wrong cert to connect as a client to the API server, not the other way around. A very helpful kubeadm
issue also helped draw me to the usage of the file – it looks like /etc/kubernetes/apiserver-kubelet-client
is the client that the API server SENDS to kubelet, and /var/lib/kubelet/pki/kubelet-client*.pem
is the client that the kubelet sends to the API server. One more very helpful SO post also helped me realize this.
So now that I know what the real issue is (also corroborated by a Github issue), the real solution was actually just to delete kubelet-client-current.pem
and restart the kubelet
. After restarting the kubelet
I was greeted with this output:
Dec 21 07:48:35 ubuntu-1810-cosmic-64-minimal kubelet[16856]: W1221 07:48:35.105958 16856 util_unix.go:103] Using "/run/containerd/containerd.sock" as endpoint is deprecated, please consider using full url format "unix:///run/containerd/containerd.sock".
Dec 21 07:48:35 ubuntu-1810-cosmic-64-minimal kubelet[16856]: I1221 07:48:35.105976 16856 remote_image.go:50] parsed scheme: ""
Dec 21 07:48:35 ubuntu-1810-cosmic-64-minimal kubelet[16856]: I1221 07:48:35.105985 16856 remote_image.go:50] scheme "" not registered, fallback to default scheme
Dec 21 07:48:35 ubuntu-1810-cosmic-64-minimal kubelet[16856]: I1221 07:48:35.106002 16856 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{/run/containerd/containerd.sock 0 <nil>}] <nil>}
Dec 21 07:48:35 ubuntu-1810-cosmic-64-minimal kubelet[16856]: I1221 07:48:35.106010 16856 clientconn.go:577] ClientConn switching balancer to "pick_first"
Dec 21 07:48:45 ubuntu-1810-cosmic-64-minimal kubelet[16856]: I1221 07:48:45.097044 16856 transport.go:132] certificate rotation detected, shutting down client connections to start using new credentials
Dec 21 07:48:55 ubuntu-1810-cosmic-64-minimal kubelet[16856]: E1221 07:48:55.394963 16856 aws_credentials.go:77] while getting AWS credentials NoCredentialProviders: no valid providers in chain. Deprecated.
Dec 21 07:48:55 ubuntu-1810-cosmic-64-minimal kubelet[16856]: For verbose messaging see aws.Config.CredentialsChainVerboseErrors
Dec 21 07:48:55 ubuntu-1810-cosmic-64-minimal kubelet[16856]: I1221 07:48:55.395963 16856 kuberuntime_manager.go:207] Container runtime containerd initialized, version: v1.2.7, apiVersion: v1alpha2
Dec 21 07:48:55 ubuntu-1810-cosmic-64-minimal kubelet[16856]: I1221 07:48:55.396870 16856 server.go:1065] Started kubelet
Dec 21 07:48:55 ubuntu-1810-cosmic-64-minimal kubelet[16856]: I1221 07:48:55.397061 16856 server.go:145] Starting to listen on 0.0.0.0:10250
And just like that, no more authentication errors! The cluster is working now right? NOPE. With this impromptu rotation of all the cluster secrets, there’s a small briefly discussed in the cloudblue knowledgebase article – the service accounts for the cluster are borked all over – important services like kube-router
, coredns
cannot pull from the API because they have old secrets with old credentials.
At this point kubelet
is starting up and running the containers that etcd
(via the API server) tells it should be running, but a bunch of them won’t run correctly because they’re relying on old, incorrect credentials to connect to the API server.
kube-router
At first check the pods in kube-system
look to be functioning correctly:
root@ubuntu-1810-cosmic-64-minimal /var/lib/kubelet/pki # k get pods -n kube-system
NAME READY STATUS RESTARTS AGE
cert-manager-7648c6f789-79wwv 1/1 Running 0 171m
coredns-584795fc57-8t775 0/1 Running 5 5m50s
coredns-584795fc57-9d92w 0/1 Running 3 3m
etcd-ubuntu-1810-cosmic-64-minimal 1/1 Running 6 49d
kube-apiserver-ubuntu-1810-cosmic-64-minimal 1/1 Running 27 29m
kube-controller-manager-ubuntu-1810-cosmic-64-minimal 1/1 Running 27 27m
kube-router-5pnrk 1/1 Running 7 15m
kube-scheduler-ubuntu-1810-cosmic-64-minimal 1/1 Running 27 29m
But after a few seconds I realize how wrong things are – coredns
, kube-router
, and kube-proxy
(which shouldn’t have even been present at all) were having issues and continuously restarting. Thanks to the CloudBlue kb article I realized that I had to re-create the service accounts (and possibly do other things) to get these applications on to the new certs.
Unfortunately it wasn’t as simple as running kubectl delete sa kube-router -n kube-system
, because there were more broken systems – that was enough to fix kube-router
, but there’s one more pretty big important piece of routing in a k8s cluster – CoreDNS.
As I noticed coredns
was failing I looked at the logs:
$ k logs coredns-584795fc57-8t775 -n kube-system
.:53
2019-12-21T08:18:09.223Z [INFO] CoreDNS-1.3.1
2019-12-21T08:18:09.223Z [INFO] linux/amd64, go1.11.4, 6b56a9c
CoreDNS-1.3.1
linux/amd64, go1.11.4, 6b56a9c
2019-12-21T08:18:09.223Z [INFO] plugin/reload: Running configuration MD5 = f65c4821c8a9b7b5eb30fa4fbc167769
2019-12-21T08:18:29.224Z [ERROR] plugin/errors: 2 7657599066091208900.4473028311675897983. HINFO: unreachable backend: read udp 10.244.0.188:39977->213.133.99.99:53: i/o timeout
2019-12-21T08:18:32.224Z [ERROR] plugin/errors: 2 7657599066091208900.4473028311675897983. HINFO: unreachable backend: read udp 10.244.0.188:49238->213.133.99.99:53: i/o timeout
E1221 08:18:34.223023 1 reflector.go:134] github.com/coredns/coredns/plugin/kubernetes/controller.go:315: Failed to list *v1.Service: Get https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
E1221 08:18:34.223023 1 reflector.go:134] github.com/coredns/coredns/plugin/kubernetes/controller.go:315: Failed to list *v1.Service: Get https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
log: exiting because of error: log: cannot create log: open /tmp/coredns.coredns-584795fc57-8t775.unknownuser.log.ERROR.20191221-081834.1: no such file or directory
Well some of those errors I have no idea about, but it looks like coredns
is having some problems contacting something @ 10.96.0.1
(weirdly enough, no authenttication issues, just a straight timeout). I’m assuming this is the API server but not sure, because 10.96.0.1
is not where I expect the API server to be accessible from based on how the cluster is set up (it’s not one of the cluster ranges, for services or anything)… Let’s do some digging:
$ k get pods -n kube-system -o=wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
cert-manager-7648c6f789-79wwv 1/1 Running 0 174m 10.244.0.182 ubuntu-1810-cosmic-64-minimal <none> <none>
coredns-584795fc57-8t775 0/1 CrashLoopBackOff 8 9m12s 10.244.0.188 ubuntu-1810-cosmic-64-minimal <none> <none>
coredns-584795fc57-9d92w 0/1 Error 6 6m22s 10.244.0.190 ubuntu-1810-cosmic-64-minimal <none> <none>
etcd-ubuntu-1810-cosmic-64-minimal 1/1 Running 6 49d <server ip> ubuntu-1810-cosmic-64-minimal <none> <none>
kube-apiserver-ubuntu-1810-cosmic-64-minimal 1/1 Running 27 32m <server ip> ubuntu-1810-cosmic-64-minimal <none> <none>
kube-controller-manager-ubuntu-1810-cosmic-64-minimal 1/1 Running 27 31m <server ip> ubuntu-1810-cosmic-64-minimal <none> <none>
kube-router-5pnrk 0/1 Error 9 19m <server ip> ubuntu-1810-cosmic-64-minimal <none> <none>
kube-scheduler-ubuntu-1810-cosmic-64-minimal 1/1 Running 27 32m <server ip> ubuntu-1810-cosmic-64-minimal <none> <none>
Well what about the services for kube-system
?
$ k get svc -n kube-system
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP,9153/TCP 376d
Only the one, which seems right, but more importantly there’s that IP range I didn’t recognize – the DNS service is listening at 10.96.0.10
. Doesn’t seem too impossible now that the API server would be running at 10.96.0.1
though I wouldn’t have expected it there, I do know that the API server runs at a xxx.xxx.xxx.1
IP generally. At this point I thought that maybe running into kube-proxy
being set up (when I run with kube-router
) might have had something to do with this, so I figured I’d just do a restart in case any IP tables shenanigans were happening and going badly.
Restarting didn’t solve all my problems, as usual, so time to dig into the DNS issues more. Remember that note earlier about systemd’s special /etc/resolv.conf settings?? welp this might be the time to point out that an error coredns
was seeing a lot of was:
Dec 21 08:32:59 ubuntu-1810-cosmic-64-minimal kubelet[18317]: E1221 08:32:59.027711 18317 dns.go:135] Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 213.133.98.98 213.133.99.99 213.133.100.100
This seems pretty related, but I was completely stumped at this point – putting the --resolv-conf
argument (which pointed to the special systemd resolv.conf) didn’t work, and nothing seemed to get through the DNS.
At this point I’d been debugging for like ~4-6 hours straight (holy downtime batman) and it was time to take a break. So I unwinded with some of my favorite frustration, Counter Strike: Global Offensive. After I was sufficiently frustrated with my inability to consistently BOOM HEADSHOT, I came back to the problem, luckily finding the solution really quickly, since time away is often a great help to problem solving.
kube-router
’s private kubeconfigIt turns out kube-router
has it’s own kubeconfig
that it uses to access the cluster and this needed to be updated as well! coredns
isn’t going to work if the routing layer (what performed the kube-proxy
role essentially) can’t access the API server! It was really easy to remedy the problem:
# cp /var/lib/kube-router/kubeconfig /var/lib/kube-router/kubeconfig.bak
# cp /etc/kubernetes/admin.conf /var/lib/kube-router/kubeconfig
With that done, all I needed to do was delete the old kube-router
pod, and as soon as a new one spun up in it’s place, the errors were gone – kube-router
could properly route the cluster traffic, and that meant that coredns
could reach what it needed to (the API server) and do it’s job.
traefik
, cert-manager
)After the internal services seemed to be working fine, I tried to access some of my websites (like this blog) and noticed that I was getting to a page (yay), but that page had the wrong certificate! I use Traefik as my external-to-internal Ingress Controller, and I was getting the Traefik default certificate. jetstack/cert-manager
is a fantastic project and is nearly bullet proof so I didn’t figure it was a problem with the right certs not being there, bu trather a problem with the ingress controller not being able to do it’s job somehow (probably, not being able to talk to the API server).
Of course, a quick check of the logs for traefik
confirmed this:
E1221 12:31:58.471754 1 reflector.go:205] github.com/containous/traefik/vendor/k8s.io/client-go/informers/factory.go:86: Failed to list *v1.Service: Unauthorized
E1221 12:31:58.472603 1 reflector.go:205] github.com/containous/traefik/vendor/k8s.io/client-go/informers/factory.go:86: Failed to list *v1.Endpoints: Unauthorized
I had to delete and re-make the service account for traefik
and then delete the DS for the ingress controller itself – luckily my cluster is all checked into source control (I’m following the makeinfra pattern), so it was only a make
away. After getting traefik
working, I also checked cert-manager
just in case and guess what I found:
E1221 12:43:42.150531 1 leaderelection.go:224] error retrieving resource lock kube-system/cert-manager-controller: Unauthorized
E1221 12:43:46.414256 1 leaderelection.go:224] error retrieving resource lock kube-system/cert-manager-controller: Unauthorized
E1221 12:43:49.415112 1 leaderelection.go:224] error retrieving resource lock kube-system/cert-manager-controller: Unauthorized
As you might expect, cert-manager
also has to talk to the API server, and it was getting the same errors. In general what I had to do here was re-create the appropriate service acocunts, then kill the related pods/daemonsets and re-apply them. The service account invalidations/bugs that were caused by the expiration mean that every single namespace that has service accounts that might have used the k8s API is in danger.
After roughly ~8-12 hours of downtime, I was finally serving traffic again! This was an absolutely wild ride but I was glad to put it behind me.
This is one of those things that you kind of have to beat with discipline it feels like – not letting a cert expire requires good automated rollover reminders. kubeadm
actually has automatic cert renegotiation, but I don’t think it was enabled on the kubernetes version I was using previously. A systemd timer might help here, but it’d be easy to forget if I ever tear down the cluster? I guess a regular calendar reminder is good enough for now – to at least check and ensure that the negotiation will happen automatically next time.
The ability to blow everything away and completely rebuild the system (in the immutable infrastructure sense) would have been great, but it’s really hard to take backups without access to the cluster – I’m using dynamically provisioned local volumes thanks to OpenEBS, and it’s more than a bit of a pain to start migrating the storage over.
Since newer versions of k8s don’t have this problem I don’t think I’ll have to worry about it again, but I guess I’m going to find out next year.
Here is a listing of some of the resources from the article that helped me and might help you if you run into this:
I really, really want a way to do secure communication without dealing with the complexity of TLS. It’s such a source of complexity that I would greatly appreciate a way to secure communications just using pre-shared keys. Of course, TLS already has this via the rarely used TLS-PSK option, but it seems like there’s a minimal communication encryption scheme out there that could work given pre-shared symmetric/asymmetric keys, but it seems like no one has tried to do it and standardize it. The closest possibly trustable thing I’ve found is COSE (CBOR), but it actually doesn’t seem simple enough to warrant moving away from TLS.
If you know some convincing reasons why this isn’t a good idea please send me an email – would it just turn into trying to re-build TLS? Is a simple preshared key + nonce not enough? I’m thinking something simple like:
Maybe it would be enough to just do something crazy like require a Wireguard setup w/ unsecured HTTP1/2/3for deploying that piece of software? All I want is for a way for two programs to talk to each other without requiring the complexity of x509 certificates. Somehow carting around priv/pub keys is more interesting to me – and especially in the case where I don’t need the 3rd party attestation for possibly unknown clients.
Well this was certainly quite an experience for me, while I ended up mostly learning about kubeadm
’s defaults for setting up clusters and some server configuration things, this was an outage I certainly wouldn’t want to recur.
Hopefully someone out there sees this just in the nick of time and has an idea of how to recover their clusters! Or in the best case, you have easy-to-shift storage clusters and can just completely blow away the cluster itself and avoid this completely (though I think it’d be hard, if no one could talk to your control plane because it’s certs were expired).