tl;dr - Check out Kubernetes features like PodSecurityPolicy
, NetworkPolicy. There are also Fantastic fun analogy-laden Talks from Kubecon 2017 (Austin) and Kubecon 2018 (Copenhagen). CIS standards for Kubernetes clusters exist. There are also companies like Aqua that produce tools like kube-bench
that let you test your clusters CIS benchmarks. It’s also important to remember to secure the machine as well as the Kubernetes cluster – so the usual Unix server administration advice applies.
While Kubernetes’s various setup methods kubeadm
, kops
, kubespray
are doing all they can to create secure-by-construction clusters, Kubernetes is an ever-evolving platform and cluster operators must do their part to keep it as safe as possible.
This is by no means a definitive guide, but rather a stepladder of features (similar to the “operational intelligence” concept I discussed a while ago) that I think good security conscious cluster operators will have considered.
There are some amazing talks on Kubernetes security given by people smarter/more experienced than me, and you should listen to what they have to say:
In addition to this, the Kubernetes Documentation and a firm grasp on the concepts and how they interconnect, and what pieces make up the Kubernetes platform is obviously necessary.
What this looks like depends heavily on which operating system you’re using to run your cluster, but the idea here is to secure the underlying operating system as much as you can.
One corner you can cut while actually improving security and one of them is to use a minimal (and if possible security-focused) distribution like Container Linux or maybe even Alpine Linux. For a lot of reasons (one of the biggest being kubeadm
support), you might choose Ubuntu/Debian/Fedora, which are fine too of course – the attack surface is a bit bigger, but they can also be hardened. There are lots of guides related to hardening ubuntu, so it’s a bit hand-wavy but at least skim through those articles ASAP to at least get an idea.
It should be obvious, but no matter which OS you use, ensure that password-auth SSH is disabled ASAP. Just about the only port you should need open at the beginning (at the point before you set up a cluster) is probably SSH’s default port 22. I’m not a big believer in changing the default port for SSH (smacks of security-by-obscurity)but you could do that as well, to something like 2223
or something. Usually within seconds of bringing a machine online, you’ll have bad actors trying to probe the server to figure out what kind of software it’s running and if it’s vulnerable-by-default.
Ensure your system receives proper security updates. Ideally this should be automated, but I generally just login and keep the system updated every once in a while (which obviously doesn’t scale, but I’m not managing a large amount of servers just yet).
Ensure your TLS certs are setup properly. Kubernetes by default performs all it’s communication over TLS-protected channels, and this is fantastic, but if you’re doing it manually make sure you don’t mess up the steps (likely things won’t run if you do).
Now that you’ve got your hopefully secure setup going, now it’s time to do what you can inside the ecosystem to ensure
Keep Kubernetes up to date. Kubernetes moves fast, and unfortunately this often means going through messy upgrades. I seem to remember most articles I’ve seen espousing the migrate-machines at a time with commands like kubectl drain
and bring new nodes into the cluster one at a time. Some tools like kops upgrade
/kubeadm upgrade
will offer a nice upgrade path, so they are easy to use but otherwise, this can be very painful. There’s of course the Kubernetes documentation on how to do upgrades.
Use Kuberentes’ RBAC authorization systems, and heavily restrict permissions for different users/accounts. While this might be harder/impossible if you’re on an old version of Kubernetes, if you’re upgraded past ~v1.8 where RBAC reached GA (Generally Available) status, you should be able to enable it easily (if you’re on a version where it isn’t the default already).
Use NetworkPolicy to restrict intercommunication inside the cluster. It’s always a good idea to practice “defense in depth”, which is just a fancy way to say “make sure to have multiple backup blocks for attackers” – assume your. It might also make sense to make sure to restrict access to the platform specific services (ex. AWS’s EC2 Metadata Service). This is often pretty easy if you just set up a deny-all
rule in all your namespaces on the cluster, to make communication-enablement explicit.
Here’s an example of a NetworkPolicy
I use with my canal (Calico + Flannel) enabled cluster to deny all traffic in a namespace:
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny
namespace: <your namespace here>
spec:
podSelector:
matchLabels: {}
policyTypes:
- Ingress
- Egress
Use PodSecurityPolicy
to restrict what pods can do. There are a lot of tweakable options you can employ to allow Pod
s to perform certain actions and/or access sytem resources.
Configure securityContext
for your pods.
Limit access to your internal API server. Obviously, you’re going to want to make sure that not every pod can access the internal API server – then they can actually attack it directly after compromising some external-facing service you were running.
Ensure the default
service token isn’t used. You can prevent the mounting oof the default access token by setting automountServiceAccountToken: false
in your Pod configurations (or using an AdmissionController
to add it all the time).
Use/create advanced AdmissionController
s. AdmissionController
s give you the ability to help solve a piece of the security policy by integrating with the platform itself, check out the documentation for more information.
Use a tool like Notary (with IBM’s Portieris) to ensure your container pipeline isn’t producing compromised containers/data. An easier way to achieve the same result might be to write a smaller custom AdmissionController
that actually only whitelists certain containers, and maybe hook it up to your registry.
Consider using alternate runtimes for sketchy workloads. Sometimes it might make sense to just run potentially dangerous workloads (from services that might be quite dynamic) in more isolation than normal. Depending on which container engine you’re using, there are many options:
containerd
’s untrusted workloads, made available in 1.1.0frakti
, a delegating layerkata-containers
, the combination of Intel’s Clear Containers and Hyper’s RunV projectsvirtlet
, A CRI-compliant virtual machine runtimekubevirt
, an Operator that creates and manages VMsConsider enabling Mutual TLS authentication between your services. This can be very complicated to set up, but you do have some options, you can use side-car proxies like Envoy or bigger proxies like LinkerD, or tools that tie everything up like Istio (make sure to take a look at SPIFFE identity standard & Spire). While this might seem like a LOT of trouble for very little value, it can make all the difference depending on the industry you serve (for example if HIPAA compliance is mandatory), and in the case where an attacker is inside the cluster already.
This, IMO, is pretty high up in the tree (far from the low hanging fruit), but is actually really easy to get to, thanks to automation.
Take a look at the CIS benchmark for Kubernetes. Unfortunately it’s not as straightforward to download the latest benchmark from their site, but hopefully you can figure it out. There’s information in those
Run advanced toolsets that implement the CIS benchmark. There are a bunch of tools that actually run the CIS benchmark against your cluster:
Take a look at advanced tools made by security companies like Aqua or Twistlock. If your cluster is important enough, it may make sense to pay some of these companies. I’m not at all associated with them but have just seen them present and contribute to Open Source security efforts in the past so figured I’d note them here.
Run simple tools like nmap
, ensure that your Ingress
/Service
s are not exposing more than they should. Some mis-entered/copy-pasted configuration might be exposing your Prometheus cluster to the world rather than just on a local network.
Do everything you can to protect your clusters from human error, and attacks from outside your cluster as well. This can mean so much that I consider it level 4 – and I’m not sure it ever actually ends. I’m using this as a catch-all, but some examples:
There’s always more to do. Becoming secure and staying secure is a moving target. Sorry :(
Hopefully you’ve enjoyed some of these pointers. I do ask that you take this post with a grain of salt – while I am confident what I’ve written is accurate to the best of my ability, I don’t run any production safety-critical systems, and I think you should reserve most of your without-salt consumption for people who can give more concrete experience/case-studies.