Awesome FOSS Logo
Discover awesome open source software
Launched 🚀🧑‍🚀

Stale kubeconfig breaking service to service communication with kube-router

Categories
Kubernetes logo + Kube Router logo

UPDATE (03/23/2021)

After trying to restart my cluster I ran into issues not having the --kubeconfig option. kube-router couldn't find 10.96.0.1 (the API server) and CoreDNS couldn't start because it couldn't find anything (since kube-router) was down. It's a chicken and egg and though this PR was supposed to solve it, it's certainly still a problem. Not sure if this is because I'm on older versions of kube-router (1.0.1) and Kubernetes itself (1.17) but got some nice unscheduled downtime after a machine restart.

UPDATE (07/26/2020)

In the reddit discussion, there was a better solution suggested -- using a hard-coded kubeconfig that references the mounted pod-local service account file.

An even better solution showed up after that when murali-reddy hopped in to note that this was actually an upstream Kubernetes issue which has since been solved, it was likely that kube-router could be updated to use and rely on the pod-mounted credentials without specifying kubeconfig at all. I removed --kubeconfig from my DS and it worked great for me, so I can whole-heartedly recommend the solution.

tl;dr - If you’re running kube-router (I run version 1.0.1), make sure to update the kubeconfig that is being used by it after credential rotations, otherwise spooky pod->service (but not pod->pod) communication issues could occur.

Recently while working on some unrelated issues, I discovered that the kubeconfig that kube-router uses can indeed be stale. I ran into some issues with service->service communication and root-caused the issue after a bunch of head scratching – the fix was to simply copy the newer kubeconfig over to the right directory on th ehost for kube-router to pick up. This wasn’t a real satisfying fix, but it certainly was enough to get me going again. I probably won’t be using kube-router for much longer, in favor of Cilium or Calico so for now it was good enough.

Debug process

Here’s the rough gist of what I did to figure out what was wrong:

  • Observe that pod->service communication wasn’t working
  • Check if there are any NetworkPolicy objects involved (and blocking the requests)
  • kubectl exec into the container and do nslookup to the service name (inthe output you should see an <service>.<namespace>.svc.cluster.local entry for the service in question)
  • kubectl exec and curlthe IP of the pod backing the service (an Endpoint of the Service) directly (you can get pod IPs via kubectl get pods -o wide)

If all the above goes well, we know at this point we know the problem is not DNS at least, and pod->pod communication is working as we expect, so the problem lies elsewhere.

This point is where I started so suspect that something with the CNI (kube-router) wasn’t working properly so I took a look at the logs of my kube-router pod and found the answer:

E0107 19:06:25.099257       1 reflector.go:205] github.com/cloudnativelabs/kube-router/vendor/k8s.io/client-go/informers/factory.go:73: Failed to list *v1.NetworkPolicy: Unauthorized
I0107 19:06:25.099907       1 reflector.go:240] Listing and watching *v1.Pod from github.com/cloudnativelabs/kube-router/vendor/k8s.io/client-go/informers/factory.go:73
E0107 19:06:25.100317       1 reflector.go:205] github.com/cloudnativelabs/kube-router/vendor/k8s.io/client-go/informers/factory.go:73: Failed to list *v1.Pod: Unauthorized
I0107 19:06:25.101082       1 reflector.go:240] Listing and watching *v1.Namespace from github.com/cloudnativelabs/kube-router/vendor/k8s.io/client-go/informers/factory.go:73
E0107 19:06:25.101501       1 reflector.go:205] github.com/cloudnativelabs/kube-router/vendor/k8s.io/client-go/informers/factory.go:73: Failed to list *v1.Namespace: Unauthorized
I0107 19:06:26.096717       1 reflector.go:240] Listing and watching *v1.Endpoints from github.com/cloudnativelabs/kube-router/vendor/k8s.io/client-go/informers/factory.go:73
E0107 19:06:26.097421       1 reflector.go:205] github.com/cloudnativelabs/kube-router/vendor/k8s.io/client-go/informers/factory.go:73: Failed to list *v1.Endpoints: Unauthorized

And there it is – one of the most common problems I run into, something trying to talk to the Kubernetes API and not being able to authenticate. Of course, I didn’t believe these messages immediately – “Why in the world would it be Unauthorized? kube-router definitely has credentials”, so I start going back and looking through my configuration for the kube-router DaemonSet (ConfigMaps, etc).

Then it dawns on me – “how does kube-router get it’s credentials again?…” – does it use a ServiceAccount? The serviceAccount seemed to be set correctly, but there’s one I didn’t consider – turns out kube-router uses a kubeconfig. This is where I found the actual problem – I have a hostPath entry that points to a file on disk, and after the credentials were rotated (during a recent cluster upgrade/change), the file actually became stale, so I updated it, and everything started working again.

I don’t particularly like how manual the solution was (copying over a file), but since I plan on moving to a different CNI in the near future it’s good enough for me, for now.