tl;dr - I kinda succeeded in getting simplistic VM level isolation working on a container linux powered Kubernetes cluster with lots of failures along the way. This post is cobbled-together notes from the exploration stage, which ultimately lead to an extremely hackish CoreOS VM powered by qemu running inside a privileged Kubernetes pod running on top of a CoreOS dedicated machine. The notes that were cobbled together to make this post are very old, I’ve actually already switched to Ubuntu server for my kubernetes cluster, but I figured it was worth editing and releasing these notes for anyone interested that is experimenting with coreos container linux or flatcar linux.
A reader of this article named Fred recently contacted me to let me know that he made an ansible playbook with tasks for manipulating cloud images. In particular the cloudinit-iso.yml
and templates
folder are the places to look for some inspiration. Thanks again to Fred for pointing this out, wanted to pass this along to anyone who is interested in doing more proper/principled cloud image modifciation.
This blog post is part of a multi-part series:
kata-runtime
a fair shakeOne of the great things about the Kubernetes ecosystem (and containerization in general) is the innovation it’s spurred in the container runtime space. The programs tasked with running containers (“container runtimes”) have standardized, and new container runtimes have become more established. On the standardization front, Container <something> interfaces are pushing things forward – runtimes have the Container Runtime Interface to look to, networking has the Container Networking Interface, storage solutions have the Container Storage Interface. It’s probably worth repeating that containers are just sandboxed and isolated/restricted processes – BSD Jails, Solaris Zones, Linux’s LXC features are ways to isolate processes and Docker came along to put an approachable face and ergonomic CLI on all of it. Half of the time, depending on what you’re trying ot protect, containers don’t even contain. As containers are sandboxed/isolated processes, the usual process security mechanisms apply – namely seccomp
and apparmor
, with some exciting new entries like gvisor
.
One of the most exciting things about CRI (and abstraction with interfaces in general) is the ability for different runtimes to provide different methods for the sandboxing that containers are supposed ot provide. To get more concrete, the following projects offer containerization in VMs, rather than normal processes:
NOTE: While Frakti v1 actually focused on providing a shim over multiple runtimes, right now Frakti v2 is aiming to focus on being primary a containerd plugin.
One of the many many projects I want to build is a service that makes it easy to spin up Kubernetes clusters on top of an existing Kubernetes cluster. Essentially this means building GKE, but with as much of the heavy lifting as part of the Kubernetes cluster. The smaller internal clusters would greatly benefit from being VM-level isolated, so I took some time to try and explore the untrusted workload space in Kubernetes and check out what my options were. I had some success, but mostly failed at the various things I wanted to try.
I first set out to survey the landscape and see what options are available to me, and here’s what I found:
LXD machine containers + a k8s Operator - Actually using distribution-level support for LXD machine containers, and creating a Kubernetes operator (that would likely need to be privileged, and have things like hostNetwork
and use hostPath
s) that could spin up LXD machine containers. Roughly the steps would look like this:
MiniCluster
, TenantCluster
, or even KubernetesCluster
lxd
machine images and sets them up initially to serve as kubernetes nodesRuntime suported isolation - Using proper runtime support (for example containerd
’s untrusted workload options or frakti
+ kata-containers
) to run pods that are automatically run by VMs when the container runtime attempts to get them started.
QEMU-in-a-pod - Actually running qemu
in a pod. I actually didn’t know this was really viable until seeing a random dockerfile that proposed to be able to start qemu. This approach was interesting to me because VMs, of course, are processes – so it makes perfect sense that they’d be containerizable. This is likely the simplest approach conceptually; no additional runtimes needed, no lxd, just a kubernetes pod whose image
is set to one that does nothing but run qemu
. Resources given to the pods automatically become the VMs resources (so making a bigger VM isn’t a matter of creating a configuration lanaguage to use with a CRD, I can just use the pod’s facilities and make sure the VM uses everything.
NKube - I found a somewhat defunct-looking project called nkube that seems to handle my goals more directly. It’s great news that kubernetes-in-kubernetes has been done a lot because it made it easier to find resources:
A bunch of these approaches seem to rely heavily on docker-in-docker, and while I don’t actually use docker
as my container runtime (I use containerd
and am very happy with it) it was good to see people discuss it and learn what they went through.
After surveying the field, I figured I’d try first with Method 2 (runtime support) since it seemed the most official/correct way to get it done. Method 1 (lxc
+lxd
) was pretty attractive after watching an amazing talk on how it worked from the people behind SilphRoad but lxc
/lxd
didn’t statically compile at the time. I posted on LXD’s forum about it and also made an issue on LXC’s github about it. At the time of this post, lxc
now builds staticly (the Github issue is now closed), but at the time I was exploring it was still unsupported – so basically tryign to install lxc
+lxd
on container linux seemed like a death sentence. Method 3 (QEMU-in-a-pod) also looked really appealing and easy, but it didn’t seem like the “right” way to do things, so I decided to try Method 2 (runtime support) a try first.
During my exploration, I tried to use runV on it’s own as kata-containers
wasn’t quite ready for primetime yet and I briefly looked at the clear-containers
project but was put off that they didn’t have generic installation instructions. Along with runv
I’d of course need to install qemu
(and statically build it, because container linux is minimal), so I downloaded the qemu
source from the wiki. At first I incorrectly assumed that multi-arch’s binary static qemu
builds would be beneficial to me – it containers user stuff for qemu
but what I needed was the -system
binaries (as in qemu-system-x86_64
).
After (wrongfully) thinking I had the qemu
stuff sorted, I went ahead trying to install runv
, which meant installing hyperstart
. Unfortunately there weren’t easy-to-find static binaries for it, and all the instructions were for other linux distributions (with no easy/generic build) - I needed to try and build it myself.
Long story short, attempting to build a completely static hyperstart
binary went absolutely terribly. My usual trick is to start with alpine linux in a container and try and work through the building instructions, fiddling with build switches to try and get things to build statically. That approach didn’t work at all. hyperstart
depends on a site.com/script | bash
type install and it basically just tells you that alpine wasn’t supported, so I just threw my hands up. As I don’t want to try and the static build on a different distribution (notably because of the lack of musl libc), I decided I’d try some of the other methods and maybe come back to runv
with proper runtime support another time. I did a cursory search on the internet for anyone that had gotten runv
/hyperstart
working on container linux and couldn’t find anything which kind of confirmed my hastily formed biases.
Successes: 0, Failures: 1.
This is the most difficult/code-intensive solution as it involves me writing an operator which likely must run in privileged
mode, which will create, manage, and ultimately teardown lxc
powered machine containers on the node. I identified some gotchas up front:
After looking into it for a bit, it looked like this appraoch suffers from the same issues as Method 2 – I need to install lxc
and lxd
binaries on my container linux machine, and I can’t find anywhere else that’s done it successfully. At this point I started to wonder if I could mix this with Method 3, and run lxc
/lxd
from an OS that already has it (let’s say Ubuntu server), but from a privileged container with access to the host system.
Turns out that isn’t possible according to an SO post. The basic reasoning is that docker (and presumably containerd
) prevents syscalls and features that would be expressly necessary for something like LXD to work properly. The overwhelming majority of resources I could find were about running containers inside LXD and not the other way around. At this stage I reallly wasn’t looking to fight any windmills but I did want to give this approach a fair shake, and regardless I wanted to see if I could install lxc
/lxd
natively (as I supposedly can’t just run it in a container)on container linux, so I chose to try and do that.
The lxd
Github repository was easy to access and pull down – so my first idea was to try and build it in Alpine Linux, with a healthy helping of compiler flags and musl libc. I had to hack a bit to even get started:
gettext-dev
alpine package. I needed to add some flags (export CGO_LDFLAGS="-lintl -L/usr/lib" && export CGO_GCCFLAGS="-I/usr/include"
). It took me a while to figure this out, but eventually I ran into a helpful github issue.lxc-dev
alpine package is required (https://github.com/lxc/go-lxc/issues/44)go
alpine package is obviously requiredAfter going through the build, a quick check using file
revealed that I wasn’t doing it right and the generated binary @ $GOPATH/bin
was actually dynamically linked… Time to put some more elbow grease into making sure it’s static. It was surprisingly hard to find instructions on how to ensure static linking in Golang, as it had been a while since I did a golang project.
After the refresher course on static building golang programs, I was primed to edit the Makefile
– I figured only one line really needed to change, the default
target:
CGO_ENABLED=0 GOOS=linux go install -a -ldflags '-extldflags "-static"' -v $(TAGS) -tags logdebug $(DEBUG) ./...
#go install -v $(TAGS) -tags logdebug $(DEBUG) ./...
As is tradition, this initial attempt failed, the build errored because of two packages in particular:
# github.com/lxc/lxd/shared
shared/archive_linux.go:79: undefined: DeviceTotalMemory
shared/network_linux.go:20: undefined: ExecReaderToChannel
# github.com/CanonicalLtd/go-sqlite3
../../CanonicalLtd/go-sqlite3/sqlite3_go18.go:18: undefined: SQLiteConn
../../CanonicalLtd/go-sqlite3/sqlite3_go18.go:26: undefined: SQLiteConn
../../CanonicalLtd/go-sqlite3/sqlite3_go18.go:27: undefined: namedValue
../../CanonicalLtd/go-sqlite3/sqlite3_go18.go:29: undefined: namedValue
../../CanonicalLtd/go-sqlite3/sqlite3_go18.go:35: undefined: SQLiteConn
../../CanonicalLtd/go-sqlite3/sqlite3_go18.go:36: undefined: namedValue
../../CanonicalLtd/go-sqlite3/sqlite3_go18.go:44: undefined: SQLiteConn
../../CanonicalLtd/go-sqlite3/sqlite3_go18.go:49: undefined: SQLiteConn
../../CanonicalLtd/go-sqlite3/sqlite3_go18.go:54: undefined: SQLiteStmt
../../CanonicalLtd/go-sqlite3/sqlite3_go18.go:63: undefined: SQLiteStmt
../../CanonicalLtd/go-sqlite3/sqlite3_go18.go:36: too many errors
At this point I started to wonder if what I was doing was as reasonable as it seemed. It was at this point that I decided to ask a question on the lxc
discussion forum, and wait until I got an answer to decide how to proceed. Trying to statically build go-sqlite3
and the shared stuff from lxc
seemed too much like trying to boil the ocean.
At this point since I was stuck, I got enticed by a different rabbit hole, trying to run lxc from inside a privileged container (instead of trying to build it statically to run from the system itself).
lxc
/lxd
from inside a privileged containerSince I was stuck with the static binary for lxc and lxd to use at the system level, I also briefly looked into whether you can run lxd
inside a container (despite what I read in the afore-mentioned SO post). Even if I can’t run the whole thing inside the container, it might be useful to run the client in the container and patch the system’s lxc
socket into the container and it could do it’s work.
It was relatively easy to install lxc
& lxd
inside the container, though I did run into some roadblocks:
apt-get update && apt-get install iproute2 lxc lxd lxd-client ca-certificates
was all I needed)lxc init
command invocation.Once I actually tried to run a container though, I ran into a cgroups
s issue:
root@7e301829204c:/# lxc launch ubuntu:16.04 first
Creating first
The container you are starting doesn't have any network attached to it.
To create a new network, use: lxc network create
To attach a network to a container, use: lxc network attach
Starting first
EROR[05-22|09:02:20] Failed starting container action=start created=2018-05-22T09:02:17+0000 ephemeral=false name=first stateful=false used=1970-01-01T00:00:00+0000
Error: Failed to run: /usr/lib/lxd/lxd forkstart first /var/lib/lxd/containers /var/log/lxd/first/lxc.conf:
Try `lxc info --show-log local:first` for more info
root@7e301829204c:/# lxc info --show-log local:first
Name: first
Remote: unix://
Architecture: x86_64
Created: 2018/05/22 09:02 UTC
Status: Stopped
Type: persistent
Profiles: default
Log:
lxc 20180522090220.864 ERROR lxc_start - start.c:lxc_spawn:1553 - Failed initializing cgroup support
lxc 20180522090220.864 ERROR lxc_start - start.c:__lxc_start:1866 - Failed to spawn container "first"
lxc 20180522090220.864 ERROR lxc_container - lxccontainer.c:wait_on_daemonized_start:824 - Received container state "ABORTING" instead of "RUNNING"
Looks like despite the fact that I started the container in privileged mode the cgroup isolation is still interfering. At this point I decided to stop and try working with Method 3 (QEMU-in-a-pod). Hopefully someone will get back to me about building lxc/lxd statically, since it’s not great that I have to do this in the first place.
Successes: 0, Failures: 2.
I took a look at the nkube
repo and all but instantly decided I didn’t want to try and use it. I figured I’d only try this IFF (if and only if) Method 3 and basically all other avenues failed.
Successes: 0, Failures: 2, Skips: 1.
As is almost always the case, the simplest/easiest method is the one that I’ve found myself reduced to pinning my hopes on. As a refresher, this approach was basically just running QEMU inside a pod. I found a dockerfile that runs qemu
so I’ll basically be doing the minimum amount of effort to get it running.
My initial revision for the pod’s resource config looked like this:
---
apiVersion: v1
kind: Pod
metadata:
name: qemu-test
spec:
containers:
- name: qemu
image: tianon/qemu
imagePullPolicy: IfNotPresent
resources:
limits:
cpu: 1000m
memory: 512Mi
requests:
cpu: 1000m
memory: 512Mi
ports:
- containerPort: 22
protocol: TCP
env:
- name: QEMU_HDA
value: /data/hda.qcow2
- name: QEMU_HDA_SIZE
value: "10G"
- name: QEMU_PU
value: "1"
- name: QEMU_RAM
value: "512"
- name: QEMU_CDROM
value: /images/debian.iso
- name: QEMU_BOOT
value: 'order=d'
- name: QEMU_PORTS
value: '2375 2376'
volumeMounts:
- name: data
mountPath: /data
- name: iso
mountPath: /images
readOnly: true
volumes:
- name: data
emptyDir: {}
- name: iso
hostPath:
path: /qemu
Note that I’m actually starting without use of /dev/kvm
, the virtualization device, which greatly speeds up VMs. Without access to it the vm is much much slower, but can still function. Other than that small hidden caveat, the resource config is pretty straight forward – I did need to download some images and host them @ /qemu
on the host system but other than that things went very smoothly. A lot of my time was spent inside the pod (via kubectl exec -it
), poking around and trying to figure out what was going on.
Of course, things didn’t go perfectly right away, the first problem was that although qemu
was running insidethe container, there wasn’t much output, and I couldn’t SSH into the machine that was running. It turned out that instead of using a CD ROM (which you would normlaly use to install ubuntu), it would be much easier to use a Ubuntu cloud image with qemu
, and save myself a lot of time/manual work. Once I got that taken care made the output much much better, and I could skip the install-from-livecd step, to actually booting a ubuntu image itself. Here’s what the output at the end of the log looked like:
[ OK ] Started Getty on tty1.
[ OK ] Reached target Login Prompts.
[ OK ] Started LSB: automatic crash report generation.
[ OK ] Started LSB: Record successful boot for GRUB.
[ OK ] Started Authorization Manager.
[ OK ] Started Accounts Service.
Ubuntu 18.04 LTS ubuntu ttyS0
Even after this successful boot I couldn’t actually log in to the machine, even though I knew it started and was running – the reason being that cloud images don’t actually have username/password login enabled (which is nice from a security standpoint), they ONLY use ssh creds (via ~/.ssh/authorized_keys
) for SSH auth. So this meant I needed to go back to the (container building) drawing board and figure out how to make sure my SSH creds were set up inside the image itself. While it was a bit of a chore to have to go back, it was nice to get a look at how to implment something that is not too different from how bigger providers like AWS must handle it when people provision instances with SSH keys pre-selected.
At this point I can confirm that booting the pre-loaded image (in the hostPath
) in qemu
is successful via the logs – now it’s time to try and ensure that the right SSH keys are inserted into the image so I can try SSHing in. I didn’t (and still don’t) have much experience with programmatic VM image modification, so I found and started looking into a program called uvtool
, and some other resources listed here:
Turns out there is a Hashicorp tool also built for this called Packer. I looked into Packer but it seemed a bit heavyweight – I didn’t want to install it on my system so I found the packer docker image that could run it. I was mostly concerned with using packer to build a qemu image so the relevant documentation was:
Looking at how much there was to learn with packer, I started to get a little skittish and look for athoer way. All I’m trying to do is inject a single file (~/.ssh/authorized_keys
) into an already existing ubuntu cloud image… Packer seemed at this point like overkill. As I started looking for other resources I came across a few that were enlightening:
Eventually I learned that someone’s solved this, and neither packer
or qemu-image
were necessarily the answer for the simple thing I wanted to do – libguestfsprovides a suite of tools that were perfect! In particular libguestfs’s virt-copy-in
command was exactly what I was looking for. The suite of tools can also be easily used in a containerized environment so I don’t have to install it on my system. Now, instead of using uvtool
or packer
, the plan was to use the virt-copy-in
from inside a transient container to change the image (mounted into the container). The libguestfs
documentation wasn’t super clear on whether the image would change in-place or if a new one would be created but I wasn’t really too worried about it.
Before getting started on actually running virt-copy-in
though, I figured it was time to stop and have a think about which image I really wanted to use.
Ubuntu isn’t a bad image to piock, but I found myself wondering how hard it would be to run Alpine (which is likely what I’d run in production). I was away from the computer for a bit and realized that rather than building images of ubunutu, I’d much rather be building images of alpine, especially if all I’m going to use it for is to run Kubernetes. Turns out Alpine has a tool for this. Then I had an even better idea – why download the image, then start running shit to install docker
when I could just pick an image that’s small, already has docker
and other container tools, since I already know how to initialize kubernetes from relative scratch (and I’ve written a blog post about it)! If you’re wondering This image is CoreOS!
Turns out CoreOS actually supports booting with qemu
, which make things pretty easy. So if you’ve been following along at home, we’re now building a CoreOS image to run in a container on top of Kubernetes the container orchestration system running on top of CoreOS. I should be able to do the following:
ignition.yaml
(that I already checked will properly boot up a single node load-bearing cluster) in the afore-mentioned prevoius post.The next issue I ran into was the [qemu firmware config][qemu-firwmware-config] (which is used to load ignition config files) wasn’t working as expected. I found a github issue that seemed related which pointed at lack of proper KVM support not being enabled as the issue. Up until now I’ve been using an unprivileged container, hoping that I could get all this working with minimal secuirty impact, but it looks like I’m going to have to go with a privileged container. I evne saw some indications that it might not be as insecure as I thought it might be to give a less-than-trusted VM process access to /dev/kvm
:
Along the way I also found another guide where someone was doing the exact same thing. Of course, it turns out the person attempting it @dghubble was a CoreOS person, and even produced a video about the experiment. It was great to find this guide since what he was trying to do is exactly what I’m doing and he’s definitely got a lot more knowledge of CoreOS so it will be easy to learn of.
Coming back to my hackfest of a project, I was still having problems with the firmware config not being properly picked up, and realized that it’s actually not possible to expose only a subset of system devices (of which /dev/kvm
is one) to a container, reading the github issue that was filed for it. It does look like people are cognizant of the issue/need though which is nice. Initially I thought Kubernetes’s support for Device Plugins would be of help after reading a few resources:
Unfortunately, neither of those approaches are a solution because they very explicitly can’t be shared amongst multiple pods. Obviously, /dev/kvm
needs to be shared. There might be some thing I could do with symlinking or something, but I doubt it would work out. Unfortunately this means that if I want passable performance I’m going to need to embrace using a privileged
container. I did post about it in the github issue, but for now I’m just going to proceed with using a privileged
pod. The good news is that it worked and the VM was way way faster – CoreOS booted in seconds, and while it didn’t fix the firmware config loading problem, it was nice to see such a speedup.
Ultimately I failed in getting qemu
to pick up my fw_cfg
option, so I resorted to actually writing in the authorized_keys
file myself into the image itself. I found an amazing blog post that made if very clear, and set to writing some instructions that should have worked but didn’t because of libvirt
not being available in the build container I was using. Here’s what actually worked though:
(inside a fedora image, the coreos `toolbox` command)
# dnf install libguestfs-tools
# mkdir -p share/oem/
# cp only-ssh-ignition.json share/oem/config.ign
# virt-copy-in -a coreos_production_qemu_image.img share/ /usr/
While these commands modified the image in place (yuck), it actually didn’t work. It was better for me to use the CoreOS helper script referenced in the qemu documentation. I kubectl exec
’d into the qemu
pod and ran the following:
(inside the qemu pod itself, which is an alpine image)
# apk add --update qemu-system-x86_64 bzip2 wget
# wget https://stable.release.core-os.net/amd64-usr/current/coreos_production_qemu.sh
# chmod +x coreos_production_qemu.sh
# ln -s /images/coreos_production_qemu_image.img coreos_production_qemu_image.img
# cp /images/only-ssh-ignition.json config.ign
# # Try with just script (modified image) => NOPE
# #./coreos_production_qemu.sh -i config.ign -- -nographic
# # COPY your id_rsa.pub onnto the sever
# ./coreos_production_qemu.sh -a ~/.ssh/authorized_keys -- -nographic
With this I could actually finally build the appropriate image and get the VM running properly which was great! It’s been a long long process but I finally have a CoreOS VM running inside a pod on Kubernetes, on a CoreOS system (allbeit with questionable security).
Successes: 1, Failures: 2, Skips: 1.
Before I got too excited I figured I’d try a little inception and start a docker
container inside the VM:
core@coreos_production_qemu-1688-5-3 ~ $ docker run alpine /bin/sh
Unable to find image 'alpine:latest' locally
latest: Pulling from library/alpine
ff3a5c916c92: Pull complete
Digest: sha256:7df6db5aa61ae9480f52f0b3a06a140ab98d427f86d8d5de0bedab9b8df6b1c0
Status: Downloaded newer image for alpine:latest
core@coreos_production_qemu-1688-5-3 ~ $ docker run -it alpine /bin/sh
/ #
For those of you keeping track at home, we’re now in a docker container inside a CoreOS VM running inside a pod on kubernetes on a CoreOS system, and it’s pretty responsive (mostly in part to containers being so light)!
I figured all this success was too much so I decided to try without /dev/kvm
support, andfound that I actually needed to enable an switch to qemu-system-x86_64
, the -s
switch, to enable qemu
to work without access to /dev/kvm
. The good news is that it worked, but the bad news is that it was terribly slow! Either way, success on two fronts, even though the non-/dev/kvm/
version is very likely unusable.
Successes: 2, Failures: 2, Skips: 1.
Now that I’ve gotten it working very experimentally (kubectl exec
ing into the pod and fiddling), the final step was to put all this into a pod spec and start it, then SSH in without any fiddling.
Here’s what the final YAML config for the pod looked like:
---
apiVersion: v1
kind: ConfigMap
metadata:
name: qemu-ssh-keys
data:
authorized_keys: |
ssh-rsa <GIBBERISH> user@machine
---
apiVersion: v1
kind: Pod
metadata:
name: qemu-test
spec:
containers:
- name: qemu
image: alpine
#image: tianon/qemu
# Really any image that builds & runs QEMU could go here,
# turns out it's not that hard necessarily, even alpine can do it
imagePullPolicy: IfNotPresent
resources:
limits:
cpu: 1000m
memory: 1Gi
requests:
cpu: 1000m
memory: 1Gi
#securityContext:
# privileged: true
### doesn't work
## securityContext:
## runAsUser: 500 # on the machine, the core user is 500
## runAsGroup: 78 # on the machine, the kvm group happens to be 78
ports:
- containerPort: 22
protocol: TCP
## never got the fw_cfg cmd to work properly :(
# command:
# - start-qemu
# - -fw_cfg
# - "name=opt/com.coreos/config,file=/images/only-ssh-ignition.json"
# - -nographic
command: [ "/bin/ash", "-c", "--" ]
args: [ "while true; do sleep 30; done;" ]
env:
- name: QEMU_HDA
# value: /data/hda.qcow2
# value: /images/bionic-server-cloudimg-amd64.img
value: /images/coreos_production_qemu_image.img
- name: QEMU_HDA_SIZE
value: "10G"
- name: QEMU_PU
value: "1"
- name: QEMU_RAM
value: "1024"
# - name: QEMU_CDROM
# value: /images/debian.iso
# - name: QEMU_BOOT
# value: 'order=d'
- name: QEMU_NO_SERIAL
value: "1"
- name: QEMU_PORTS
value: '2375 2376'
##
## CoreOS Ignition config can be set, using the firmware config
## don't even have to copy in SSH configs!
##
volumeMounts:
- name: data
mountPath: /data
- name: ssh-keys
mountPath: /ssh-keys
- name: iso
mountPath: /images
- name: kvm-device
mountPath: /dev/kvm
# readOnly: true
volumes:
- name: data
emptyDir: {}
- name: ssh-keys
configMap:
name: qemu-ssh-keys
- name: iso
hostPath:
path: /qemu
- name: kvm-device
hostPath:
path: /dev/kvm
## TO RUN THIS:
#apk add --update qemu-system-x86_64 bzip2 wget
#wget https://stable.release.core-os.net/amd64-usr/current/coreos_production_qemu.sh
#chmod +x coreos_production_qemu.sh
#ln -s /images/coreos_production_qemu_image.img coreos_production_qemu_image.img
# mkdir -p ~/.ssh
#cat /ssh-keys/authorized_keys >> ~/.ssh/authorized_keys
After starting the pod it was as easy as running the following commands:
$ kubectl port-forward qemu-test 2222:22
$ ssh localhost -p 2222
As you might have guessed, it worked! After all the experimentation I did inside the container, all it took was making sure that I made sure the changes to the way I was doing things stuck and sure enough, QEMU booted up.
Theoretically, this set up should be safer than a regular pod because the user is exposed to the VM INSIDE the pod, so there would be two levels of security to escape. In addition to the usual best practices for kubernetes pod security (PodSecurityPolicy
, seccomp
, apparmor
, etc), it might even be production-worthy. However, it’s terrible that we have to use a privileged
pod – it makes it so much worse because a container + VM escape now equals a node-compromise.
At first I wasn’t sure how to benchmark the performance, since I haven’t done much work on this side of the ops spectrum – I figured the important things to check were:
I started looking around for tools to test these things, and found a few. For testing the network I figured I could use siege
, maybe spinning up a very simple web server and pinging it. For device access, I found lots more resources:
fio
)But it was really when searching for how to test CPU ops that I found the best tool for the job so far. I came across a Stack Exchange question which lead me to stress-ng
, which is a fantastic tool that checks CPU and IO metrics. It looked like the command I needed to run was something like this:
stress-ng --cpu 1 --io 2 --vm-bytes 1G --timeout 60s --metrics-brief
So I did, for all three environments:
Regular resource-constrainted pod run in k8s on the machine (machine -> k8s -> container, NO qemu)
CMD: stress-ng --cpu 1 --io 2 --vm-bytes 1G --timeout 60s --metrics-brief
RESULTS:
stress-ng: info: [52] dispatching hogs: 1 cpu, 2 io
stress-ng: info: [52] successful run completed in 60.07s (1 min, 0.07 secs)
stress-ng: info: [52] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s
stress-ng: info: [52] (secs) (secs) (secs) (real time) (usr+sys time)
stress-ng: info: [52] cpu 15169 60.04 58.31 0.00 252.63 260.14
stress-ng: info: [52] io 4342 60.07 0.00 0.68 72.29 6385.29
This result is really weird, while I kind of believe the large jump in cpu ops, the dip in IO is really weird… I need to understand the tool more
QEMU VM in the resource constrained pod with /dev/kvm support (machine > k8s > container > qemu vm)
CMD: ./stress-ng --cpu 1 --io 2 --vm 1 --vm-bytes 1G --timeout 60s --metrics-brief
stress-ng: info: [1285] dispatching hogs: 1 cpu, 2 io, 1 vm
stress-ng: error: [1290] stress-ng-vm: gave up trying to mmap, no available memory
stress-ng: info: [1285] successful run completed in 60.09s (1 min, 0.09 secs)
stress-ng: info: [1285] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s
stress-ng: info: [1285] (secs) (secs) (secs) (real time) (usr+sys time)
stress-ng: info: [1285] cpu 3515 60.09 14.62 0.34 58.50 234.96
stress-ng: info: [1285] io 147104 60.00 0.08 3.85 2451.73 37431.04
stress-ng: info: [1285] vm 0 10.09 0.00 0.00 0.00 0.00
I should note that I copied the stress-ng
executable out of a container built from a Dockerfile in the stress-ng
project. As usual, all I had to do was build that Dockerfile and copy the static binary out of it and run it on the CoreOS VM inside the pod.
Container inside the VM without KVM support (machine > k8s > container > VM (+/dev/kvm) > stress-ng container)
CMD: docker run -it --rm alexeiled/stress-ng --cpu 1 --io 2 --vm 1 --vm-bytes 1G --timeout 60s --metrics-brief
core@coreos_production_qemu-1688-5-3 ~ $ docker run -it --rm alexeiled/stress-ng --cpu 1 --io 2 --vm 1 --vm-bytes 1G --timeout 60s --metrics-brief
Unable to find image 'alexeiled/stress-ng:latest' locally
latest: Pulling from alexeiled/stress-ng
1160f4abea84: Pull complete
110786018a74: Pull complete
Digest: sha256:105518acaa868016746e0bd6d58e9145a3a437792971409daf37490dbfc24ea2
Status: Downloaded newer image for alexeiled/stress-ng:latest
stress-ng: info: [1] dispatching hogs: 1 cpu, 2 io, 1 vm
stress-ng: error: [9] stress-ng-vm: gave up trying to mmap, no available memory
stress-ng: info: [1] successful run completed in 60.25s (1 min, 0.25 secs)
stress-ng: info: [1] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s
stress-ng: info: [1] (secs) (secs) (secs) (real time) (usr+sys time)
stress-ng: info: [1] cpu 96 60.07 7.86 4.90 1.60 7.52
stress-ng: info: [1] io 21908 60.00 0.21 22.63 365.12 959.19
stress-ng: info: [1] vm 0 10.38 0.00 0.03 0.00 0.00
Container inside the VM with KVM support in the machine (machine > k8s > container > VM (+/dev/kvm) > stress-ng container
CMD: docker run -it --rm alexeiled/stress-ng --cpu 1 --io 2 --vm 1 --vm-bytes 1G --timeout 60s --metrics-brief
core@coreos_production_qemu-1688-5-3 ~ $ docker run -it --rm alexeiled/stress-ng --cpu 1 --io 2 --vm 1 --vm-bytes 1G --timeout 60s --metrics-brief
stress-ng: info: [1] dispatching hogs: 1 cpu, 2 io, 1 vm
stress-ng: error: [11] stress-ng-vm: gave up trying to mmap, no available memory
stress-ng: info: [1] successful run completed in 60.04s (1 min, 0.04 secs)
stress-ng: info: [1] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s
stress-ng: info: [1] (secs) (secs) (secs) (real time) (usr+sys time)
stress-ng: info: [1] cpu 3508 60.03 14.54 0.28 58.43 236.71
stress-ng: info: [1] io 104088 60.00 0.06 3.90 1734.80 26284.85
stress-ng: info: [1] vm 0 10.01 0.00 0.00 0.00 0.00
There’s a lot of caveats to the testing I’ve done and the numbers here, so take them with a boulder of salt but here are some observations/highlights:
qemu
to use./dev/kvm
in a pod vs different stage1 container).--device
command like in docker will be supported in kubernetes. it’s probably too insecure to run privileged containers.While I was neckdeep in trying to figure things out, I was reminded of the fact that rkt
(an alternative container runtime, the first reasonable alternative to docker) actually supports multiple “stage 1” images, which means that it could support VMs! Trying tomake it work is more like method 1 (runtime support) but I realized that this would be another avenue to explore at some point if/when I work back around to getting a more legitimate/safe setup working, as running qemu
as the main process in a privileged pod is probably not advisable long term.
Turns out rkt
, the very first runtime I started a Kubernetes cluster with supports alternative stage1s via annotations now!. This wasn’t the case when I first looked at it but it’s huge because it means I could explore that avenue for running untrusted workloads – I just have to switch out containerd
for rkt
and then use the annotations provided, and get easy runtime-level support working, theoretically. The PR that introduced the support was nice to read through as well.
If this works it would be mcuh better than the current working solution (running qemu
in a privileged pod), but I’ll leave that for Part 2 since this post is already way too long!
It’s been a long wild ride, but in the end the only way I could get started even getting a glimpse at running a VM through Kubernetes on my container linux machine was through the dirty dirty hack of running a qemu
in a privileged pod, with lots of trial and error. I went on a wild goose chase through the internet but ultimately got at least one super hacky method of untrusted workloads running on Kubernetes. It’s a bit dizzying standing on the tower of abstractions being used in this post, but as long as the tower isn’t crumbling I think it was a good build.
In the end, the easiest thing to get working was running QEMU form a regular pod, but the tradeoff of security vs performance is too great, I need to look into other solutions. I’m finding that I’m somewhat restricted in my use of CoreOS, and in the future I think I’m going to go with some distributions that support more of the technology that I think are options (namely lxc
/lxd
) without so much hard work (SPOILER: I DID).