UPDATE (03/23/2021)

After trying to restart my cluster I ran into issues not having the --kubeconfig option. kube-router couldn't find 10.96.0.1 (the API server) and CoreDNS couldn't start because it couldn't find anything (since kube-router) was down. It's a chicken and egg and though this PR was supposed to solve it, it's certainly still a problem. Not sure if this is because I'm on older versions of kube-router (1.0.1) and Kubernetes itself (1.17) but got some nice unscheduled downtime after a machine restart.

UPDATE (07/26/2020)

In the reddit discussion, there was a better solution suggested -- using a hard-coded kubeconfig that references the mounted pod-local service account file.

An even better solution showed up after that when murali-reddy hopped in to note that this was actually an upstream Kubernetes issue which has since been solved, it was likely that kube-router could be updated to use and rely on the pod-mounted credentials without specifying kubeconfig at all. I removed --kubeconfig from my DS and it worked great for me, so I can whole-heartedly recommend the solution.

tl;dr - If you’re running kube-router (I run version 1.0.1), make sure to update the kubeconfig that is being used by it after credential rotations, otherwise spooky pod->service (but not pod->pod) communication issues could occur.

Recently while working on some unrelated issues, I discovered that the kubeconfig that kube-router uses can indeed be stale. I ran into some issues with service->service communication and root-caused the issue after a bunch of head scratching – the fix was to simply copy the newer kubeconfig over to the right directory on th ehost for kube-router to pick up. This wasn’t a real satisfying fix, but it certainly was enough to get me going again. I probably won’t be using kube-router for much longer, in favor of Cilium or Calico so for now it was good enough.

Debug process

Here’s the rough gist of what I did to figure out what was wrong:

Observe that pod->service communication wasn’t working
Check if there are any NetworkPolicy objects involved (and blocking the requests)
kubectl exec into the container and do nslookup to the service name (inthe output you should see an <service>.<namespace>.svc.cluster.local entry for the service in question)
kubectl exec and curlthe IP of the pod backing the service (an Endpoint of the Service) directly (you can get pod IPs via kubectl get pods -o wide)

If all the above goes well, we know at this point we know the problem is not DNS at least, and pod->pod communication is working as we expect, so the problem lies elsewhere.

This point is where I started so suspect that something with the CNI (kube-router) wasn’t working properly so I took a look at the logs of my kube-router pod and found the answer:

E0107 19:06:25.099257       1 reflector.go:205] github.com/cloudnativelabs/kube-router/vendor/k8s.io/client-go/informers/factory.go:73: Failed to list *v1.NetworkPolicy: Unauthorized
I0107 19:06:25.099907       1 reflector.go:240] Listing and watching *v1.Pod from github.com/cloudnativelabs/kube-router/vendor/k8s.io/client-go/informers/factory.go:73
E0107 19:06:25.100317       1 reflector.go:205] github.com/cloudnativelabs/kube-router/vendor/k8s.io/client-go/informers/factory.go:73: Failed to list *v1.Pod: Unauthorized
I0107 19:06:25.101082       1 reflector.go:240] Listing and watching *v1.Namespace from github.com/cloudnativelabs/kube-router/vendor/k8s.io/client-go/informers/factory.go:73
E0107 19:06:25.101501       1 reflector.go:205] github.com/cloudnativelabs/kube-router/vendor/k8s.io/client-go/informers/factory.go:73: Failed to list *v1.Namespace: Unauthorized
I0107 19:06:26.096717       1 reflector.go:240] Listing and watching *v1.Endpoints from github.com/cloudnativelabs/kube-router/vendor/k8s.io/client-go/informers/factory.go:73
E0107 19:06:26.097421       1 reflector.go:205] github.com/cloudnativelabs/kube-router/vendor/k8s.io/client-go/informers/factory.go:73: Failed to list *v1.Endpoints: Unauthorized

And there it is – one of the most common problems I run into, something trying to talk to the Kubernetes API and not being able to authenticate. Of course, I didn’t believe these messages immediately – “Why in the world would it be Unauthorized? kube-router definitely has credentials”, so I start going back and looking through my configuration for the kube-router DaemonSet (ConfigMaps, etc).

Then it dawns on me – “how does kube-router get it’s credentials again?…” – does it use a ServiceAccount? The serviceAccount seemed to be set correctly, but there’s one I didn’t consider – turns out kube-router uses a kubeconfig. This is where I found the actual problem – I have a hostPath entry that points to a file on disk, and after the credentials were rotated (during a recent cluster upgrade/change), the file actually became stale, so I updated it, and everything started working again.

I don’t particularly like how manual the solution was (copying over a file), but since I plan on moving to a different CNI in the near future it’s good enough for me, for now.

VADOSWARE

Living in a yak shaver's paradise.

Stale kubeconfig breaking service to service communication with kube-router

Categories

UPDATE (03/23/2021)

UPDATE (07/26/2020)

Debug process