Running Untrusted Workloads K8s Container Linux Part 3

Third and final part of my adventures and mistakes in trying to get untrusted workloads running on k8s (with container linux underneath)

49 minute read

CoreOS logo + Kubernetes logo + kata-containers logo

tl;dr - After struggling through settting up containerd’s untrusted workload runtime, building a static kata-runtime and a neutered-but-static qemu-system-x86_64 to use, I succeeded in hooking up containerd to use kata-runtime only to fail @ the last step since the pods that were created ran qemu properly but couldn’t be communicated with and would immediately make k8s node they were running on move to the NotReady due to PLEG errors. I did a lot of work to partially succeed (if you want to run QEMU on container linux, this is the post for you), but hopefully these notes will help someone else out. Also, you’ll hate me for this, but I didn’t make a Dockerfile and a gitlab project like I did for static-kata-runtime, so basically the operational knowledge is scatterred throughout this post. Basically, if you’re here from a google search from trying to make things work, welcome.

UPDATE (09/16/2018)

If you're actually trying to install kata-containers, use kata-deploy -- they've done what I did here (except successfully), and it's *super* easy to install, by utilizing non-privileged DaemonSets and Node tagging on your cluster. It's super slick. It seems reasonably likely to work on container linux, given that they've statically compiled qemu (and kata-runtime are pretty easy to statically install because golang) -- YMMV!

This blog post is part of a multi-part series:

  1. Part 1 - Hacking my way to container+VM inception
  2. Part 2 - rkt alternate stage1 experiments
  3. Part 3 - giving kata-runtime a fair shake (this post)

After running into the issues mentioned in part two with using rkt as my primary runtime, I decided to go ahead and give containerd (with it’s support for untrusted workloads) along with kata-runtime a try. This exploration actually happened during Part 2 bug I figured it was so far off in left field that I needed to make it a separate post. If you look back at part one, proper runtime support was “Method 2”, and is still the most desirable as it is the most properly integrated method. I originally tried to use “Method 2” by using just runv, but couldn’t install hyperstart, but this time, we’re going to use kata-containers, which is the merging of the Intel clear-containers and runv projects.

A refreshing surprise was that kata-containers was very easily buildable, thanks probably in large part to the fact that it’s a golang application! While there wasn’t a dedicated build command for fully static builds, the kata-containers project’s build process was pretty easy to follow, so I hacked it and filed an issue to make it officially easier to do static builds. To elaborate, the usual tricks of making sure to do a Golang static build and build it in Alp[ine Linux (using musl libc) were all I needed to get a binary that I could SCP right over to the container linux box. I didn’t get to test it right away due to not having any OCI bundles on-hand – I actually had a hard time finding any to use for testing either so it was kind of inconvenient – I opted instead to test by spinning up the whole cluster and configuring containerd to actualy use insecure runtimes, which is kind of insane (due to how large a gamble it is and the amount of complexity from other systems I’m pulling in before I can validate the first step) but whatever.

Step 0: Going back to containerd

At this point I’d been actually been using rkt as my runtime after the experimentation in part two.I consider rkt to be the safer runtime when compared to both docker and containerd as it has much safer defaults around privileges for example (which ironically is exactly why rook didn’t work). However, for this experiment, I’d need to switch back to my containerd setup, which meant rebuilding the machine with CoreOS’s ignition. I’ve also previously written about the ignition config, so feel free to check that out.

Step 1: Standardizing the static build of kata-runtime

After doing a bunch of experimentation (which I’ll elide here) getting kata-runtime to build properly, I wanted to standardize the process, and building a simple container was the perfect, easy way to do that. By writing a little Dockerfile, I could do the static build in an easily portable container which when finished would contain the binaries needed by container linux.

I say it often on the blog, but I absolutely love Gitlab. One of the many reasons I love Gitlab (in this instance) is that they support releases, which makes my life way easier, for versioning and making the static binaries available – unfortunately it’s not available in Gitlab Community Edition. I do pay for a higher tier of Gitlab, but for now I can easily replicate the functionality by just making release branches that only contain the build binary. It’s a bit of a hack, but it will make it easier to download the releases properly.

The result of that work was the static-kata-runtime gitlab repository! One of the more interesting of codeto me is the Makefile, because it encapsulates how to do this releases hack:

.PHONY: check-tool-docker image retrieve-artifacts release

all: image retrieve-artifacts

VERSION := 1.0.0
DOCKER := $(shell command -v docker 2> /dev/null)
IMAGE_NAME := static-kata-container

check-tool-docker:
ifndef DOCKER
        $(error "`docker` is not available please install docker (https://docs.docker.com/install/)")
endif

image: check-tool-docker
    $(DOCKER) build -t $(IMAGE_NAME) .

retrieve-artifacts: check-tool-docker
    $(DOCKER) run --rm --entrypoint cat $(IMAGE_NAME) /kata-runtime > kata-runtime
    $(DOCKER) run --rm --entrypoint cat $(IMAGE_NAME) /kata-runtime.sha512 > kata-runtime.sha512

# The below target is only to be run on version (`vX.X.X`) branches -
# command runs with the expectation that the resulting binary will be downloaded over HTTPS from some source-control mechanism
release: check-tool-docker image retrieve-artifacts
    rm -rf Dockerfile	Makefile README.md Makefile

Of course, the (likely more interesting) bits are in the Dockerfile as that actually contains everything you need to statically build the binary.

Now that we have a static kata-runtime binary, it’s time to try and use it with containerd!

Step 3: Setting up containerd to use the alternate runtime

Support for running untrusted workloads on a CRI-compliant runtime of your choice was added to containerd in version v1.1.0. It’s supported by annotations and looks pretty easy and simple to use, requiring only an update to containerd’s configuration. Up until now I actually haven’t had to configure containerd at all, and didn’t even have a config file that I was using which meant i needed to create one that matched what containerd was expecting. The default configuration file can be found @ /etc/containerd/config.toml (according to the documentation), so that is where I started. Unfortunately I didn’t keep a spare copy of what the configuration file looked like when all was said and done (I also remember having some slight trouble figuring out what it should look like, but hopefully it won’t be too hard.

Eventually the filewoudl have made a great addition to the ignition configuration I used for the machine, but in the moment I just did it live and added the file to the box itself. Either way, it’s almost too easy to get containerd to start using kata-runtime… Let’s see if we can test it all out.

Step 4: Testing it all out

Theoreticlaly with annotation-based support, I should be able to start two otherwise identical pods, and if I ensure one has the appropriate annotation, one will be in a regular container (sandboxed/isolated process) and the other will be in a full-blown qemu-powered VM. If all goes well (SPOILER: it didn’t really), I should be able to SSH into either of them, using the usual kubectl exec -it <pod> /bin/sh and see different outputs for a command like uname -a.

Here’s the YAML config for the pod (the one with the annotation):

---
apiVersion: v1
kind: Pod
metadata:
  name: vm-shell
  annotations:
    io.kubernetes.cri.untrusted-workload: "true"
spec:
  containers:
  - name: shell
    image: alpine:3.7
    imagePullPolicy: IfNotPresent
    resources:
      limits:
        cpu: 2
        memory: 2Gi
      requests:
        cpu: 2
        memory: 2Gi
    command: [ "/bin/ash", "-c", "--" ]
    args: [ "while true; do sleep 30; done;" ]

After running the usual kubectl run, things seemed to start fine, but I was greeted with an error when I checked the kubectl describe output for the pod:

Normal   Scheduled               25s                default-scheduler   Successfully assigned vm-shell-pod to localhost
Normal   SuccessfulMountVolume   25s                kubelet, localhost  MountVolume.SetUp succeeded for volume "default-token-j9x2h"
Warning  FailedCreatePodSandBox  11s (x3 over 25s)  kubelet, localhost  Failed create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for untrusted workload is configured

At this point I realized that my configuration for containerd must have been wrong – the annotation was being read, but containerd (underneath kubernetes) couldn’t find the sandboxd runtime that I specified. This meant that despite putting what I thought was the right TOML (again sorry I can’t reproduce it here) @ /etc/containerd/config.toml, the configuration still wasn’t being picked up. I quickly realized that I wasn’t also correcting the systemd service file I was using for containerd itself which was @ /etc/systemd/system/containerd.service.d! After some fiddling with the service parameters I made some progress, leading to my next batch of error output, this time from the containerd service itself:

May 27 14:25:10 localhost containerd[25158]: time="2018-05-27T14:25:10Z" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:vm-shell-pod,Uid:40ee2d9a-61b8-11e8-8de7-8c89a517d15e,Namespace:default,Attempt:0,} failed, error" error="failed to start sandbox container:
failed to create containerd task: OCI runtime create failed: Cannot find usable config file (config file "/etc/kata-containers/configuration.toml" unresolvable: file /etc/kata-containers/configuration.toml does not exist, config file "/usr/share/defaults/kata-containers/configuration.toml" unresolvable: file /usr/share/defaults/kata-containers/configuration.toml does not exist): unknown"

OK, so it looks like now Kuberentes is calling containerd correctly, containerd is finding the right sandbox runtime (kata-runtime), but I haven’t configured how to call kata-runtime properly – in particular, files that configure kata-runtime which would have existed if I’d installed kata-runtime on any other distribution are missing. I didn’t have a good idea what these configurations were supposed to look like, but luckily I was able to find a very long example in the kata-conainers/runtime repo, a file called configuration.toml.in. This file is input to some transformation process but it was clear enough for me to use to figure out what needed to be configured. I crafted as minimal a configuration as I could (sorry, I didn’t save this either :( ), and tried again.

I forgot to get the statically built qemu binaries from multiarch/qemu-user-static. In particular, the file I was looking for was `qemu-x86_64-static.

NOTE FROM THE FUTURE I’m still dealing with the wrong qemu binaries here, I’m dealing with “user” qemu binaries (i.e. qemu-x86_64-static) when I should be using qemu-system-x86_64(-static).

With the minimal kata-runtime config in place and the missing qemu binary installed I tried again:

May 27 14:38:05 localhost containerd[25158]: time="2018-05-27T14:38:05Z" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:vm-shell-pod,Uid:40ee2d9a-61b8-11e8-8de7-8c89a517d15e,Namespace:default,Attempt:0,} failed, error" error="failed to start sandbox container: failed to create containerd task: OCI runtime create failed: /etc/kata-containers/configuration.toml: file /usr/share/kata-containers/vmlinuz.container does not exist: unknown"

Unfortunately I needed to specify a few configuration options that I was hoping to be able to leave blank – in particular I needed to generate a rootfs image and a kernel image. I found and started using the clearcontainers/osbuilder and set out to start using it, but was pretty put off by the amount of software I seemingly needed to add to my system to build everything. Luckily, they have a docker-based appraoch at the bottom of the README so that’s what I used. Here’s what my work directory looked like at the start (I think I ran make or something):

$ ls workdir/
clear-dnf.conf  container.img  image_info  img  rootfs

Here’s what the commands I ran looked like:

$ mkdir /tmp/image-builder && cd /tmp/image-builder
$ export USE_DOCKER=true
$ scripts/kernel_builder.sh prepare
$ sudo mv /tmp/image-builder/linux /tmp/image-builder/workdir/linux # the script is written just slightly incorrectly I think, need to copy the linux folder into workdir to make sure container can see it
$ sudo -E make rootfs # produces 'rootfs' (folder), along with 'workdir'
$ mv linux workdir # make sure that the linux that was retrieved is in the workdir folder fo rhte kernel generation step to use
$ edit scripts/Dockerfile to include `elfutils-libelf-devel` in the `dnf install` @ the start
$ sudo -E make kernel # produces 'vmlinuz.container'
$ sudo -E make image # produces container.img

OK, so now I have a kernel image (vmlinuz.container), and the image (container.img, which I assumed contained a root fs), I copied them over to the server. I was a little unclear on the difference between the two files and file came to my rescue:

$ file workdir/vmlinuz.container
workdir/vmlinuz.container: Linux kernel x86 boot executable bzImage, version 4.14.22 (root@) #1 SMP Sun May 27 15:05:48 UTC 2018, RO-rootFS, swap_dev 0x5, Normal VGA
$ file workdir/vmlinux.container
workdir/vmlinux.container: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, BuildID[sha1]=2b4cc92cdc5fa2cc64f9421253265c628319c3e8, not stripped
$ file workdir/container.img
workdir/container.img: DOS/MBR boot sector; partition 1 : ID=0xee, start-CHS (0x0,0,2), end-CHS (0x3ff,254,63), startsector 1, 262143 sectors, extended partition table (last)

As you can see, vmlinuz.container is definitely the kernel, at this point I wasn’t absolutely sure how I was supposed to use vmlinux.container, but a quick check of the documentation revealed that one is just a compressed version of the other (vmlinuz is a compressed form of vmlinux.container).

With these files copied over and put in the right place in the configuration for kata-runtime I was able to make some more progress (logs are from containerd I believe):

May 27 15:15:55 localhost kata-runtime[16254]: time="2018-05-27T15:15:55.734367385Z" level=error msg="Invalid config type" command=create name=kata-runtime pid=16254 source=runtime
May 27 15:16:11 localhost kata-runtime[16465]: time="2018-05-27T15:16:11.733472292Z" level=info msg="loaded configuration" command=create file=/etc/kata-containers/configuration.toml format=TOML name=kata-runtime pid=16465 source=runtime
May 27 15:16:11 localhost kata-runtime[16465]: time="2018-05-27T15:16:11.733608005Z" level=info arguments="\"create --bundle /run/containerd/io.containerd.runtime.v1.linux/k8s.io/ac17e9aeb57c0bb68470e6ac32e672083ec4543be11e91bd6b55796de6c0410d --pid-file /run/containerd/io.containerd.runtime.v1.linux/k8s.io/ac17e9aeb57c0bb68470e6ac32e672083ec4543be11e91bd6b55796de6c0410d/init.pid ac17e9aeb57c0bb68470e6ac32e672083ec4543be11e91bd6b55796de6c0410d\"" command=create commit=9fb0b337ef997079b304fe895dacd5d96d6f2fb6-dirty name=kata-runtime pid=16465 source=runtime version=1.0.0
May 27 15:16:11 localhost kata-runtime[16465]: time="2018-05-27T15:16:11.746939364Z" level=warning msg="shortening QMP socket name" arch=amd64 name=kata-runtime new-name=mon-35406192-727e-418e-ba72-c6 original-name=mon-35406192-727e-418e-ba72-c67c24daf587 pid=16465 source=virtcontainers subsystem=qemu
May 27 15:16:11 localhost kata-runtime[16465]: time="2018-05-27T15:16:11.746987164Z" level=warning msg="shortening QMP socket name" arch=amd64 name=kata-runtime new-name=ctl-35406192-727e-418e-ba72-c6 original-name=ctl-35406192-727e-418e-ba72-c67c24daf587 pid=16465 source=virtcontainers subsystem=qemu
May 27 15:16:11 localhost kata-runtime[16465]: time="2018-05-27T15:16:11.747126345Z" level=error msg="Create new sandbox failed" arch=amd64 error="Invalid config type" name=kata-runtime pid=16465 sandbox-id=ac17e9aeb57c0bb68470e6ac32e672083ec4543be11e91bd6b55796de6c0410d sandboxid=ac17e9aeb57c0bb68470e6ac32e672083ec4543be11e91bd6b55796de6c0410d source=virtcontainers subsystem=sandbox

So again, this was a problem with the configuration (this time, of kata-runtime, but everything seemed right after double-checking so I enabled the runtime.debug option and got a LOT more information out of containerd:

May 27 15:22:20 localhost kata-runtime[23186]: time="2018-05-27T15:22:20.728528994Z" level=info msg="loaded configuration" command=create file=/etc/kata-containers/configuration.toml format=TOML name=kata-runtime pid=23186 source=runtime
May 27 15:22:20 localhost kata-runtime[23186]: time="2018-05-27T15:22:20.728653637Z" level=info arguments="\"create --bundle /run/containerd/io.containerd.runtime.v1.linux/k8s.io/e4218ad3719bf72259986a23459ef8f164f048607dbd5db900c4fb7a4a46abc8 --pid-file /run/containerd/io.containerd.runtime.v1.linux/k8s.io/e4218ad3719bf72259986a23459ef8f164f048607dbd5db900c4fb7a4a46abc8/init.pid e4218ad3719bf72259986a23459ef8f164f048607dbd5db900c4fb7a4a46abc8\"" command=create commit=9fb0b337ef997079b304fe895dacd5d96d6f2fb6-dirty name=kata-runtime pid=23186 source=runtime version=1.0.0
May 27 15:22:20 localhost kata-runtime[23186]: time="2018-05-27T15:22:20.728751825Z" level=debug msg="converting /run/containerd/io.containerd.runtime.v1.linux/k8s.io/e4218ad3719bf72259986a23459ef8f164f048607dbd5db900c4fb7a4a46abc8/config.json" name=kata-runtime pid=23186 source=virtcontainers/oci
May 27 15:22:20 localhost kata-runtime[23186]: time="2018-05-27T15:22:20.729637389Z" level=debug msg="container rootfs: /run/containerd/io.containerd.runtime.v1.linux/k8s.io/e4218ad3719bf72259986a23459ef8f164f048607dbd5db900c4fb7a4a46abc8/rootfs" name=kata-runtime pid=23186 source=virtcontainers/oci
May 27 15:22:20 localhost kata-runtime[23186]: time="2018-05-27T15:22:20.729847179Z" level=debug msg="Creating bridges" arch=amd64 name=kata-runtime pid=23186 source=virtcontainers subsystem=qemu
May 27 15:22:20 localhost kata-runtime[23186]: time="2018-05-27T15:22:20.72989443Z" level=debug msg="Creating UUID" arch=amd64 name=kata-runtime pid=23186 source=virtcontainers subsystem=qemu
May 27 15:22:20 localhost kata-runtime[23186]: time="2018-05-27T15:22:20.741793676Z" level=debug msg="Disable nesting environment checks" arch=amd64 inside-vm=false name=kata-runtime pid=23186 source=virtcontainers subsystem=qemu
May 27 15:22:20 localhost kata-runtime[23186]: time="2018-05-27T15:22:20.741898982Z" level=warning msg="shortening QMP socket name" arch=amd64 name=kata-runtime new-name=mon-f7c46de3-533d-4ee9-84a3-bb original-name=mon-f7c46de3-533d-4ee9-84a3-bb7489376242 pid=23186 source=virtcontainers subsystem=qemu
May 27 15:22:20 localhost kata-runtime[23186]: time="2018-05-27T15:22:20.741940933Z" level=warning msg="shortening QMP socket name" arch=amd64 name=kata-runtime new-name=ctl-f7c46de3-533d-4ee9-84a3-bb original-name=ctl-f7c46de3-533d-4ee9-84a3-bb7489376242 pid=23186 source=virtcontainers subsystem=qemu
May 27 15:22:20 localhost kata-runtime[23186]: time="2018-05-27T15:22:20.742078196Z" level=error msg="Create new sandbox failed" arch=amd64 error="Invalid config type" name=kata-runtime pid=23186 sandbox-id=e4218ad3719bf72259986a23459ef8f164f048607dbd5db900c4fb7a4a46abc8 sandboxid=e4218ad3719bf72259986a23459ef8f164f048607dbd5db900c4fb7a4a46abc8 source=virtcontainers subsystem=sandbox
May 27 15:22:20 localhost kata-runtime[23186]: time="2018-05-27T15:22:20.742123471Z" level=error msg="Invalid config type" command=create name=kata-runtime pid=23186 source=runtime

After reading through the output for a while, it looked like most options were actually just fine, which made me even more confused. While thinking about it I did figure out that it’s actually either/or the initrd or the image, thanks to a random commit from the kata-containers/runtime code.

While trying to figure out what other configuration I was missing, I searched on Github to try and find out if the agent configuration was required, and found two places of interest: kata_agent.go which made it very clear that agent configuration was indeed required – which makes sense, given that kata-agent is the thing that helps your VM talk to the outside world (I didn’t know that early on).

The config for kata-agent actually gets generated for you, according to the kata-containers developer guide – to get it I just had to head back to my previous work for static-kata-runtime and docker run -it <container> /bin/bash (and complete the make install step) to generate the file, according to the static build. At this point I realized that not understanding how kata-agent interacted with the rest of the things was probably a big red flag that I didn’t do enough research so I wondered if I’d have the same problems with the proxy (kata-proxy) and shim (kata-shim) configs but figured I’d try without them first, to at least see some more progress.

After some more config fiddling I made some more progress, kata-runtime was actually trying to run qemu now! Here’s the output from containerd, filtered with journalctl -xef -u containerd | grep "kata-container":

May 27 15:40:21 localhost kata-runtime[10639]: time="2018-05-27T15:40:21.747857731Z" level=info msg="loaded configuration" command=create file=/etc/kata-containers/configuration.toml format=TOML name=kata-runtime pid=10639 source=runtime
May 27 15:40:21 localhost kata-runtime[10639]: time="2018-05-27T15:40:21.747986078Z" level=info arguments="\"create --bundle /run/containerd/io.containerd.runtime.v1.linux/k8s.io/52891c2ef940064ab2b1652e2015f9fb546bc814ea397c4cc025cbbd08f9af07 --pid-file /run/containerd/io.containerd.runtime.v1.linux/k8s.io/52891c2ef940064ab2b1652e2015f9fb546bc814ea397c4cc025cbbd08f9af07/init.pid 52891c2ef940064ab2b1652e2015f9fb546bc814ea397c4cc025cbbd08f9af07\"" command=create commit=9fb0b337ef997079b304fe895dacd5d96d6f2fb6-dirty name=kata-runtime pid=10639 source=runtime version=1.0.0
May 27 15:40:21 localhost kata-runtime[10639]: time="2018-05-27T15:40:21.748079309Z" level=debug msg="converting /run/containerd/io.containerd.runtime.v1.linux/k8s.io/52891c2ef940064ab2b1652e2015f9fb546bc814ea397c4cc025cbbd08f9af07/config.json" name=kata-runtime pid=10639 source=virtcontainers/oci
May 27 15:40:21 localhost kata-runtime[10639]: time="2018-05-27T15:40:21.748941795Z" level=debug msg="container rootfs: /run/containerd/io.containerd.runtime.v1.linux/k8s.io/52891c2ef940064ab2b1652e2015f9fb546bc814ea397c4cc025cbbd08f9af07/rootfs" name=kata-runtime pid=10639 source=virtcontainers/oci
May 27 15:40:21 localhost kata-runtime[10639]: time="2018-05-27T15:40:21.749137875Z" level=debug msg="Creating bridges" arch=amd64 name=kata-runtime pid=10639 source=virtcontainers subsystem=qemu
May 27 15:40:21 localhost kata-runtime[10639]: time="2018-05-27T15:40:21.749177883Z" level=debug msg="Creating UUID" arch=amd64 name=kata-runtime pid=10639 source=virtcontainers subsystem=qemu
May 27 15:40:21 localhost kata-runtime[10639]: time="2018-05-27T15:40:21.760881384Z" level=debug msg="Disable nesting environment checks" arch=amd64 inside-vm=false name=kata-runtime pid=10639 source=virtcontainers subsystem=qemu
May 27 15:40:21 localhost kata-runtime[10639]: time="2018-05-27T15:40:21.760981229Z" level=warning msg="shortening QMP socket name" arch=amd64 name=kata-runtime new-name=mon-a66a34e8-9787-4041-b468-e6 original-name=mon-a66a34e8-9787-4041-b468-e634d94a15fe pid=10639 source=virtcontainers subsystem=qemu
May 27 15:40:21 localhost kata-runtime[10639]: time="2018-05-27T15:40:21.761022484Z" level=warning msg="shortening QMP socket name" arch=amd64 name=kata-runtime new-name=ctl-a66a34e8-9787-4041-b468-e6 original-name=ctl-a66a34e8-9787-4041-b468-e634d94a15fe pid=10639 source=virtcontainers subsystem=qemu
May 27 15:40:21 localhost kata-runtime[10639]: time="2018-05-27T15:40:21.761088153Z" level=debug msg="Could not retrieve anything from storage" arch=amd64 name=kata-runtime pid=10639 source=virtcontainers subsystem=kata_agent
May 27 15:40:21 localhost kata-runtime[10639]: time="2018-05-27T15:40:21.761561131Z" level=info msg="Attaching virtual endpoint" arch=amd64 name=kata-runtime pid=10639 source=virtcontainers subsystem=network
May 27 15:40:21 localhost kata-runtime[10639]: time="2018-05-27T15:40:21.76751231Z" level=info msg="Starting VM" arch=amd64 name=kata-runtime pid=10639 sandbox-id=52891c2ef940064ab2b1652e2015f9fb546bc814ea397c4cc025cbbd08f9af07 source=virtcontainers subsystem=sandbox
May 27 15:40:21 localhost kata-runtime[10639]: time="2018-05-27T15:40:21.767684044Z" level=info msg="Adding extra file [0xc42000ec98 0xc42000eca0 0xc42000eca8 0xc42000ecb0 0xc42000ecb8 0xc42000ecc0 0xc42000ecc8 0xc42000ecd0 0xc42000ec58 0xc42000ec60 0xc42000ec68 0xc42000ec70 0xc42000ec78 0xc42000ec80 0xc42000ec88 0xc42000ec90]" arch=amd64 name=kata-runtime pid=10639 source=virtcontainers subsystem=qmp
May 27 15:40:21 localhost kata-runtime[10639]: time="2018-05-27T15:40:21.767749592Z" level=info msg="launching qemu with: [-name sandbox-52891c2ef940064ab2b1652e2015f9fb546bc814ea397c4cc025cbbd08f9af07 -uuid a66a34e8-9787-4041-b468-e634d94a15fe -machine pc,accel=kvm,kernel_irqchip,nvdimm -cpu host -qmp unix:/run/vc/sbs/52891c2ef940064ab2b1652e2015f9fb546bc814ea397c4cc025cbbd08f9af07/mon-a66a34e8-9787-4041-b468-e6,server,nowait -qmp unix:/run/vc/sbs/52891c2ef940064ab2b1652e2015f9fb546bc814ea397c4cc025cbbd08f9af07/ctl-a66a34e8-9787-4041-b468-e6,server,nowait -m 2048M,slots=2,maxmem=25121M -device pci-bridge,bus=pci.0,id=pci-bridge-0,chassis_nr=1,shpc=on,addr=2 -device virtio-serial-pci,id=serial0 -device virtconsole,chardev=charconsole0,id=console0 -chardev socket,id=charconsole0,path=/run/vc/sbs/52891c2ef940064ab2b1652e2015f9fb546bc814ea397c4cc025cbbd08f9af07/console.sock,server,nowait -device nvdimm,id=nv0,memdev=mem0 -object memory-backend-file,id=mem0,mem-path=/home/core/clear-container-image/container.img,size=134217728 -device virtio-scsi-pci,id=scsi0 -device virtserialport,chardev=charch0,id=channel0,name=agent.channel.0 -chardev socket,id=charch0,path=/run/vc/sbs/52891c2ef940064ab2b1652e2015f9fb546bc814ea397c4cc025cbbd08f9af07/kata.sock,server,nowait -device virtio-9p-pci,fsdev=extra-9p-kataShared,mount_tag=kataShared -fsdev local,id=extra-9p-kataShared,path=/run/kata-containers/shared/sandboxes/52891c2ef940064ab2b1652e2015f9fb546bc814ea397c4cc025cbbd08f9af07,security_model=none -netdev tap,id=network-0,vhost=on,vhostfds=3:4:5:6:7:8:9:10,fds=11:12:13:14:15:16:17:18 -device driver=virtio-net-pci,netdev=network-0,mac=da:c3:4f:c8:08:8c,mq=on,vectors=18 -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard -vga none -no-user-config -nodefaults -nographic -daemonize -kernel /home/core/clear-container-image/vmlinuz.container -append tsc=reliable no_timer_check rcupdate.rcu_expedited=1 i8042.direct=1 i8042.dumbkbd=1 i8042.nopnp=1 i8042.noaux=1 noreplace-smp reboot=k console=hvc0 console=hvc1 iommu=off cryptomgr.notests net.ifnames=0 pci=lastbus=0 root=/dev/pmem0p1 rootflags=dax,data=ordered,errors=remount-ro rw rootfstype=ext4 quiet systemd.show_status=false panic=1 initcall_debug nr_cpus=4 ip=::::::52891c2ef940064ab2b1652e2015f9fb546bc814ea397c4cc025cbbd08f9af07::off:: init=/usr/lib/systemd/systemd systemd.unit=kata-containers.target systemd.mask=systemd-networkd.service systemd.mask=systemd-networkd.socket -smp 1,cores=1,threads=1,sockets=1,maxcpus=4]" arch=amd64 name=kata-runtime pid=10639 source=virtcontainers subsystem=qmp
May 27 15:40:21 localhost kata-runtime[10639]: time="2018-05-27T15:40:21.768894409Z" level=error msg="Unable to launch qemu: exit status 1" arch=amd64 name=kata-runtime pid=10639 source=virtcontainers subsystem=qmp
May 27 15:40:21 localhost kata-runtime[10639]: time="2018-05-27T15:40:21.76897353Z" level=error msg="qemu: unknown option 'name'\n" arch=amd64 name=kata-runtime pid=10639 source=virtcontainers subsystem=qmp
May 27 15:40:21 localhost kata-runtime[10639]: time="2018-05-27T15:40:21.76905962Z" level=error msg="qemu: unknown option 'name'\n" command=create name=kata-runtime pid=10639 source=runtime

After this output, it looks like the command being sent to qemu is unknown to begin with – my first thought was that maybe the proxy was needed in actuality, so I decided to just stop and go back to static-kata-runtime and ensure the proxy and shim were also built, installed on the machine, and configured properly.

Step 7: Backtracking to install static versions of kata-proxy and kata-shim

I needed to change the static-kata-runtime repo I set up to include static builds for kata-proxy and kata-shim, as they seemed to be needed. I did the same sort of hacks for kata-runtime, with some minor modifications due to -ldflags already being used in teh commands. Command looked something like this:

go build -o kata-proxy -ldflags "-linkmode external -extldflags '-static' -X main.version=1.0.0-a69326b63802952b14203ea9c1533d4edb8c1d64-dirty"

After doing that, the ldd output is empty (as it should be for a statically built program):

~/go/src/github.com/kata-containers/shim # ldd /root/go/src/github.com/kata-containers/shim/kata-shim
     ldd (0x7f5921fe4000)

After making sure the kata-shim was built statically as well I copied both binaries (kata-proxy and kata-shim) to the actual machine and tried again.

Step 7: Trying again, w/ kata-shim & kata-proxy

So at this point, I’m trying to back into the configuration, uncomment all the commented lines, and start pointing at the necessary kata-proxy and kata-shim binaries in /opt/bin (in container linux a bunch of system directories are read-only). Upon doing this, the errors didn’t change, and I started looking into the qemu setup only to realize that the wrong commands were getting passed to qemu, which seemed weird, so I started taking a look at qemu itself. First I checked the version (2.12.0, which was recent at the time), and confirmed with the qemu documentation that the arguments being passed were valid, but eventually I started wondering if I had the right binary in the first place (obtaining it did seem too easy):

$ which qemu
/opt/bin/qemu
$ qemu --version
qemu-x86_64 version 2.12.0 (qemu-2.12.0-1.fc29)
Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers

One thing I tried was to open up the CoreOS toolbox (using the toolbox command), and run dnf install -y qemu, and it downloaded a TON of stuff that I didn’t realize was needed. Considering the qemu binary that I downloaded was <5MB I assumed there’s no way I had gotten the right thing. This is the point at which I actually realized that I had downloaded the wrong qemu – I had the qemu user binary but I needed the system binary – i.e. qemu-system-x86_64. I also found a really informative post about everything.

At this point, you should already know what’s coming next – I’m going to try and compile qemu-system-x86_64 from source.

Step 8: Building qemu-system-x86_64 from source

It was great to see that Alpine already had support for QEMU – this made me hopeful that a static binary (that used musl libc) was possible, if it wasn’t even necessarily easy to get. First I installed the qemu (userland) package on alpine, and saw what all the userland stuff looked like. That stuff is actually required by the qemu system packages, which is what I was looking for. First step is to take a look what the installed binary actually calls upon:

# file /usr/bin/qemu-system-x86_64
/usr/bin/qemu-system-x86_64: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib/ld-musl-x86_64.so.1, stripped
# ldd /usr/bin/qemu-system-x86_64
   /lib/ld-musl-x86_64.so.1 (0x7ff3c0292000)
   libepoxy.so.0 => /usr/lib/libepoxy.so.0 (0x7ff3bef1f000)
   libgbm.so.1 => /usr/lib/libgbm.so.1 (0x7ff3bed12000)
   libz.so.1 => /lib/libz.so.1 (0x7ff3beafb000)
   libaio.so.1 => /usr/lib/libaio.so.1 (0x7ff3be8f9000)
   libnfs.so.11 => /usr/lib/libnfs.so.11 (0x7ff3be6bd000)
   libcurl.so.4 => /usr/lib/libcurl.so.4 (0x7ff3be454000)
   libssh2.so.1 => /usr/lib/libssh2.so.1 (0x7ff3be22c000)
   libbz2.so.1 => /usr/lib/libbz2.so.1 (0x7ff3be01f000)
   libpixman-1.so.0 => /usr/lib/libpixman-1.so.0 (0x7ff3bdd8f000)
   libncursesw.so.6 => /usr/lib/libncursesw.so.6 (0x7ff3bdb37000)
   libasound.so.2 => /usr/lib/libasound.so.2 (0x7ff3bd844000)
   libvdeplug.so.3 => /usr/lib/libvdeplug.so.3 (0x7ff3bd63e000)
   libpng16.so.16 => /usr/lib/libpng16.so.16 (0x7ff3bd410000)
   libjpeg.so.8 => /usr/lib/libjpeg.so.8 (0x7ff3bd1b1000)
   libnettle.so.6 => /usr/lib/libnettle.so.6 (0x7ff3bcf7d000)
   libgnutls.so.30 => /usr/lib/libgnutls.so.30 (0x7ff3bcc45000)
   liblzo2.so.2 => /usr/lib/liblzo2.so.2 (0x7ff3bca28000)
   libsnappy.so.1 => /usr/lib/libsnappy.so.1 (0x7ff3bc81f000)
   libspice-server.so.1 => /usr/lib/libspice-server.so.1 (0x7ff3bc526000)
   libusb-1.0.so.0 => /usr/lib/libusb-1.0.so.0 (0x7ff3bc310000)
   libusbredirparser.so.1 => /usr/lib/libusbredirparser.so.1 (0x7ff3bc109000)
   libglib-2.0.so.0 => /usr/lib/libglib-2.0.so.0 (0x7ff3bbe18000)
   libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0x7ff3bbc06000)
   libc.musl-x86_64.so.1 => /lib/ld-musl-x86_64.so.1 (0x7ff3c0292000)
   libexpat.so.1 => /usr/lib/libexpat.so.1 (0x7ff3bb9e5000)
   libwayland-client.so.0 => /usr/lib/libwayland-client.so.0 (0x7ff3bb7d7000)
   libwayland-server.so.0 => /usr/lib/libwayland-server.so.0 (0x7ff3bb5c6000)
   libdrm.so.2 => /usr/lib/libdrm.so.2 (0x7ff3bb3b6000)
   libssl.so.44 => /lib/libssl.so.44 (0x7ff3bb16a000)
   libcrypto.so.42 => /lib/libcrypto.so.42 (0x7ff3badc4000)
   libp11-kit.so.0 => /usr/lib/libp11-kit.so.0 (0x7ff3bab68000)
   libunistring.so.2 => /usr/lib/libunistring.so.2 (0x7ff3ba804000)
   libtasn1.so.6 => /usr/lib/libtasn1.so.6 (0x7ff3ba5f4000)
   libhogweed.so.4 => /usr/lib/libhogweed.so.4 (0x7ff3ba3c1000)
   libgmp.so.10 => /usr/lib/libgmp.so.10 (0x7ff3ba15d000)
   libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x7ff3b9e0b000)
   libcelt051.so.0 => /usr/lib/libcelt051.so.0 (0x7ff3b9bfe000)
   libopus.so.0 => /usr/lib/libopus.so.0 (0x7ff3b99aa000)
   libgio-2.0.so.0 => /usr/lib/libgio-2.0.so.0 (0x7ff3b963a000)
   libgobject-2.0.so.0 => /usr/lib/libgobject-2.0.so.0 (0x7ff3b93f9000)
   libsasl2.so.3 => /usr/lib/libsasl2.so.3 (0x7ff3b91e0000)
   libpcre.so.1 => /usr/lib/libpcre.so.1 (0x7ff3b8f85000)
   libintl.so.8 => /usr/lib/libintl.so.8 (0x7ff3b8d77000)
   libffi.so.6 => /usr/lib/libffi.so.6 (0x7ff3b8b6f000)
   libgmodule-2.0.so.0 => /usr/lib/libgmodule-2.0.so.0 (0x7ff3b896b000)
   libmount.so.1 => /lib/libmount.so.1 (0x7ff3b8720000)
   libblkid.so.1 => /lib/libblkid.so.1 (0x7ff3b84dc000)
   libuuid.so.1 => /lib/libuuid.so.1 (0x7ff3b82d6000)

Believe it or not, this list was actually encouraging – it was way smaller than when I the same exploration on Fedora (inside the container launched by the CoreOS toolbox command). Time to try and statically build this beast.

Step 9: Building qemu-system-x86_64 from tarball source

My first instict was to try to build the binary starting from the qemu source code. Here’s what my shell-based exploration was like

# pull required packages on alpine
$ apk add --update python alpine-sdk linux-headers zlib-dev glib-dev pixman-dev
# run configure and make in the qemu source code directory
./configure && make # dont' forget -j to speed things up

This process failed, with the output below:

/root/qemu-2.12.0/linux-user/syscall.c:6542:22: error: 'F_EXLCK' undeclared here (not in a function)
     TRANSTBL_CONVERT(F_EXLCK),
                      ^
/root/qemu-2.12.0/linux-user/syscall.c:6537:51: note: in definition of macro 'TRANSTBL_CONVERT'
 #define TRANSTBL_CONVERT(a) { -1, TARGET_##a, -1, a }
                                                   ^
/root/qemu-2.12.0/linux-user/syscall.c:6543:22: error: 'F_SHLCK' undeclared here (not in a function)
     TRANSTBL_CONVERT(F_SHLCK),
                      ^
/root/qemu-2.12.0/linux-user/syscall.c:6537:51: note: in definition of macro 'TRANSTBL_CONVERT'
 #define TRANSTBL_CONVERT(a) { -1, TARGET_##a, -1, a }
                                                   ^
/root/qemu-2.12.0/linux-user/syscall.c: In function 'target_to_host_sigevent':
/root/qemu-2.12.0/linux-user/syscall.c:7132:14: error: 'struct sigevent' has no member named '_sigev_un'; did you mean 'sigev_value'?
     host_sevp->_sigev_un._tid = tswap32(target_sevp->_sigev_un._tid);
              ^~
/root/qemu-2.12.0/linux-user/syscall.c:7132:25: error: '(const bitmask_transtbl *)&<erroneous-expression>' is a pointer; did you mean to use '->'?
     host_sevp->_sigev_un._tid = tswap32(target_sevp->_sigev_un._tid);
                         ^
                         ->
/root/qemu-2.12.0/linux-user/syscall.c:7132:5: warning: statement with no effect [-Wunused-value]
     host_sevp->_sigev_un._tid = tswap32(target_sevp->_sigev_un._tid);
     ^~~~~~~~~
make[1]: *** [/root/qemu-2.12.0/rules.mak:66: linux-user/syscall.o] Error 1
make: *** [Makefile:478: subdir-aarch64-linux-user] Error 2

Turns out there’s an issue with compiling qemu with musl libc. Thanks a lot to Nathaniel Coppa though, he shows up all over the place doing the gymnastics necessary to get things building on Alpine. Rather than trying to patch things up myself, I’m going to get the source from aports and try with that instead, since someone’s already done the hard work (patching).

Building a static qemu-system from (aports) source

I’m going to rely heavily on aports like I did before – knowing that the qemu-system stuff is an alpine package means that other folks have done a lot of hard work already that I can take advantage of, rather than necessarily downloading the QEMU source tarball off the bat. So basically, I’ll be trying to rebuild the qemu-system-x86_64 package. The alpine documentation on how to work with the aports tree was fantastic to skim through to get a feel for how things work. In addition reading up on how to use aports, I needed to do some user trickery, since normally root is not allowed to do aports-based builds (using abuild). In particular I needed to add a user to do builds with:

$ adduser <user>
$ chmod -R <user> <aports dir>
$ chgrp -r <user> <aports dir>
$ addgroup <user> wheel
$ addgroup <user> abuild
$ su - <user>

While you can use abuild with the -F option, you have to put it on every command and things still go a little wonky, I just didn’t bother and added a whole separate user. After making this user I went into main/qemu and ran abuild -r, just to make sure things worked (they should, since this package is obviously a working package in the alpine linux repos). There was a bunch of patching required (so basically the fixes required to overcome the musl libc issues), and it was nice to have them available to look at (the applied patches are part of the aports package distribution). After the basic no-modification build was done, qemu-system was installed @ ~/packages.

Now that the regular basic build was working, it was time to try and get the thing to statically compile. After glancing at a few files in the source, it looks like there were two big options determining how things built – --enable-user and --disable-system seemed to be what toggled user/system builds, and --enable-static seemed to be the option that enabled static builds, though the option was only used on the user side (which explains how it was so easy to get the user binaries). The first instinct was to try --enable-static on the system side. I also noticed that there were a ton of architectures that it built for by default – knowing that I was working on a pretty standard x86_64 machine, and probably always would be, I limited the architectures by modifying the subsystems variable in ABUILD (a file in the aports package distribution).

Next thing I needed to do was add some more dependencies, here’s a consolidated list:

$ apk add lzo-dev libseccomp-dev gtk+3.0-dev libcap-ng-dev alsa-lib-dev snappy-dev xen-dev cyrus-sasl-dev xfsprogs-dev jpeg-dev vde2-dev bluez-dev`

I thought I was ready to do another build but using abuild to do the second build was a little confusing at this point if you delete the contents of ~/packages and rebuild, the output doesn’t go there the second time (or it didn’t for me). I had to go into the src directory generated by abuild -r and start setting configuration directly there. Since that was bascially just the raw codebase, I figured I’d deal with getting abuild to work properly later and just see if I could get qemu to build. Of course, I’m exactly where I was before if I don’t apply the patches developed by the Alpine people – so after some reading I found I could use abuild prepareto apply the patches I cared about. Next step is to actually run the configure part of the build:

$ ./configure --static --cpu="x86_64"

After running this I got tons of errors as various static dependencies weren’t found. Yes, you read that right – errors in the configure step which is normally a fire-and-forget, here’s an example of one:

cc -m64 -mcx16 -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -Wstrict-prototypes -Wredundant-decls -Wall -Wundef -Wwrite-strings -Wmissing-prototypes -fno-strict-aliasing -fno-common -fwrapv -L/usr/lib -Wendif-labels -Wno-shift-negative-value -Wno-missing-include-dirs -Wempty-body -Wnested-externs -Wformat-security -Wformat-y2k -Winit-self -Wignored-qualifiers -Wold-style-declaration -Wold-style-definition -Wtype-limits -fstack-protector-strong -o config-temp/qemu-conf.exe config-temp/qemu-conf.c -m64 -static -g -lsnappy
/usr/lib/gcc/x86_64-alpine-linux-musl/6.4.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lsnappy
collect2: error: ld returned 1 exit status

I checked and found that /usr/lib contained lsnappy.so but what I needed was a statically linked version of lsnappy (ex. lsnappy.a). But before I jump into the rabbit hole of trying to statically build that library or any of the others, I wondered if I could just cut out more unneeded functionality to avoid building stuff I didn’t need to. Here’s what the list of not-properly-linked libraries looked like:

aac5c740b857:/aports/main/qemu/src/qemu-2.12.0# cat config.log  | grep "cannot find"
/usr/lib/gcc/x86_64-alpine-linux-musl/6.4.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lsnappy
/usr/lib/gcc/x86_64-alpine-linux-musl/6.4.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lsasl2
/usr/lib/gcc/x86_64-alpine-linux-musl/6.4.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lncursesw
/usr/lib/gcc/x86_64-alpine-linux-musl/6.4.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lncursesw
/usr/lib/gcc/x86_64-alpine-linux-musl/6.4.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lcursesw
/usr/lib/gcc/x86_64-alpine-linux-musl/6.4.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lncursesw
/usr/lib/gcc/x86_64-alpine-linux-musl/6.4.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lncursesw
/usr/lib/gcc/x86_64-alpine-linux-musl/6.4.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lcursesw
/usr/lib/gcc/x86_64-alpine-linux-musl/6.4.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lssh2
/usr/lib/gcc/x86_64-alpine-linux-musl/6.4.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lssh2
/usr/lib/gcc/x86_64-alpine-linux-musl/6.4.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lbluetooth
/usr/lib/gcc/x86_64-alpine-linux-musl/6.4.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lgthread-2.0
/usr/lib/gcc/x86_64-alpine-linux-musl/6.4.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lglib-2.0
/usr/lib/gcc/x86_64-alpine-linux-musl/6.4.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lintl
/usr/lib/gcc/x86_64-alpine-linux-musl/6.4.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lintl

That’s a bunch, but like half of them are ncurses related, and it turns out container linux actually has some of these dependencies already installed. Looking at the list of shared libraries required by qemu-system but not provided by container linux already, here’s the list of things I have to actually statically compile:

  • libepoxy
  • libgbm
  • libpixman
  • libasound
  • libvdeplug
  • libpng
  • libjpeg
  • libnettle
  • libgnutls
  • libsnappy
  • libspice-server
  • libusbredirparser
  • libwayland-*
  • libdrm
  • libp11-kit
  • libunistirng
  • libtasn1
  • libhogweed
  • libgmp
  • libcelt051
  • libintl

At this point I started wondering just how many fo these I actually needed to just get a qemu-system that would boot up. Obviously stuff like libgnutls and libdrm would be necessary, but I’m not sure about libasound or libwayland-*. At this point the list kindf of freaked me out so I started removing flags (functionality of qemu-system) from the build to lessen the load, then my mind wandered….

SIDETRACK: Wondering if I should try rkt again

At this point everything was getting pretty heavy and it seemed like I’d never get anything working so I started wondering if I gave rkt enough of a shot. It really seemed like it running an alternative stage1 was going to be so much easier. Well, it turns out that evne that avenue of retreat was impossible because I discovered that using rkt for untrusted runtimes seemed to require securityContext: privileged, supposedly to allow people to disable the feature. This is about where I discovered rktlet’s getting started guide, but I felt like I was going to run into the exact same issues as in the second post with the restrictive permissions issues (which again, would be more secure to just solve properly than ignore, but I was looking for a quick win, not death by 1000 cuts just yet).

Step 10: Back to trying ot build a statically linked (or at least minimal dynamically linked) qemu-system-x86_64

OK, after that brief bout of panic, let’s get back to building qemu-system. After running configure (as I wrote out earlier), the next build actually locked up my relatively beefy desktop. This time I added -j to make, and that was enough to hobble me. With this ominous start I took some time to think of other ways to possibly get what I wanted.

SIDETRACK/DIGRESSION ALERT One thing I wondered if it was possible to give acess to /dev/kvm to a non-root user – producing a way for me to “share” the device amongst containers running on the same system without giving them root privileges on the system. I found a stack exchange post that seemed to point to being able to authorize groups but at the time I looked into it, the runAsGroup feature flag support was still in alpha so I passed on it. This could also be a really really easy way forward for others.

OK, so back to the grind again – I found an excellent resource on how to build qemu for only x86_64, which gave me some confidence. After skimming it I changed my configure command to:

$ ./configure --target-list=x86_64-softmmu --enable-debug

I was wrong earlier – it wasn’t the --cpu flag that I needed to set. After doing this, I found out that I could actually fix just about all my problems running configure by installing glib-static. With this, I could get the above configure command to run wonderfully! Unfortunately, there were still probems with the actual make build, here’s the updated list of missing libraries, from the make command (the actual build) this time:

/usr/lib/gcc/x86_64-alpine-linux-musl/6.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -latk-1.0
/usr/lib/gcc/x86_64-alpine-linux-musl/6.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -latk-bridge-2.0
/usr/lib/gcc/x86_64-alpine-linux-musl/6.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -latspi
/usr/lib/gcc/x86_64-alpine-linux-musl/6.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lcairo
/usr/lib/gcc/x86_64-alpine-linux-musl/6.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lcairo-gobject
/usr/lib/gcc/x86_64-alpine-linux-musl/6.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -ldbus-1
/usr/lib/gcc/x86_64-alpine-linux-musl/6.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -ldrm
/usr/lib/gcc/x86_64-alpine-linux-musl/6.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lepoxy
/usr/lib/gcc/x86_64-alpine-linux-musl/6.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lfontconfig
/usr/lib/gcc/x86_64-alpine-linux-musl/6.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lfreetype
/usr/lib/gcc/x86_64-alpine-linux-musl/6.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lgbm
/usr/lib/gcc/x86_64-alpine-linux-musl/6.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lgdk-3
/usr/lib/gcc/x86_64-alpine-linux-musl/6.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lgdk_pixbuf-2.0
/usr/lib/gcc/x86_64-alpine-linux-musl/6.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lgraphite2
/usr/lib/gcc/x86_64-alpine-linux-musl/6.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lgtk-3
/usr/lib/gcc/x86_64-alpine-linux-musl/6.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lharfbuzz
/usr/lib/gcc/x86_64-alpine-linux-musl/6.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lpango-1.0
/usr/lib/gcc/x86_64-alpine-linux-musl/6.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lpangocairo-1.0
/usr/lib/gcc/x86_64-alpine-linux-musl/6.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lpangoft2-1.0
/usr/lib/gcc/x86_64-alpine-linux-musl/6.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lpixman-1

Again, as weird as it sounds, this actually encourages me at this point – this list is also not so bad! Let’s just go off and build all the things! Here’s a quick rundown of the commands to get myself to this point:

$ apk add --update python alpine-sdk linux-headers zlib-dev glib-dev pixman-dev lzo-dev libseccomp-dev gtk+3.0-dev libcap-ng-dev alsa-lib-dev snappy-dev xen-dev cyrus-sasl-dev xfsprogs-dev jpeg-dev vde2-dev bluez-dev
$ git clone git://dev.alpinelinux.org/aports
$ cd aports/main/qemu
$ abuild -rF unpack
$ abuild -F prepare
$ cd src/qemu-2.12.0
$ ./configure --target-list=x86_64-softmmu --enable-debug --cpu=x86_64 --static

Static build: atk

First up is atk, which as far as I can tell is an accessibility toolkit for GTK. I couldn’t find a static binary distribution like atk-static of it for alpine, but I was able to find the atk-dev however, so I started from there. Luckily, building the atk library statically was pretty easy, here’s the step-by-step:

  • go into aports, look under main/atk
  • run abuild -F unpack to get the source
  • run abuild -F prepare (just in case, there aren’t really any patches
  • go into the source and run ./configure --enable-static
  • run make
  • cp ./atk/.libs/* /usr/lib (make install does not copy the static libs to /usr/lib, I had to find . -name "*.a" to find them)

After installing, here’s the list, down by one, indicating slow-but-steady progress:

~/aports/main/qemu/src/qemu-2.12.0 # make -j4 2>&1 | grep "cannot find" | sort | uniq
/usr/lib/gcc/x86_64-alpine-linux-musl/6.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -latk-bridge-2.0
/usr/lib/gcc/x86_64-alpine-linux-musl/6.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -latspi
/usr/lib/gcc/x86_64-alpine-linux-musl/6.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lcairo
/usr/lib/gcc/x86_64-alpine-linux-musl/6.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lcairo-gobject
/usr/lib/gcc/x86_64-alpine-linux-musl/6.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -ldbus-1
/usr/lib/gcc/x86_64-alpine-linux-musl/6.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -ldrm
/usr/lib/gcc/x86_64-alpine-linux-musl/6.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lepoxy
/usr/lib/gcc/x86_64-alpine-linux-musl/6.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lfontconfig
/usr/lib/gcc/x86_64-alpine-linux-musl/6.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lfreetype
/usr/lib/gcc/x86_64-alpine-linux-musl/6.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lgbm
/usr/lib/gcc/x86_64-alpine-linux-musl/6.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lgdk-3
/usr/lib/gcc/x86_64-alpine-linux-musl/6.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lgdk_pixbuf-2.0
/usr/lib/gcc/x86_64-alpine-linux-musl/6.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lgraphite2
/usr/lib/gcc/x86_64-alpine-linux-musl/6.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lgtk-3
/usr/lib/gcc/x86_64-alpine-linux-musl/6.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lharfbuzz
/usr/lib/gcc/x86_64-alpine-linux-musl/6.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lpango-1.0
/usr/lib/gcc/x86_64-alpine-linux-musl/6.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lpangocairo-1.0
/usr/lib/gcc/x86_64-alpine-linux-musl/6.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lpangoft2-1.0
/usr/lib/gcc/x86_64-alpine-linux-musl/6.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lpixman-1

One down, 19 to go! After this, I did take some time to think whether I actually need or can disable GTK support, since I’m not going to be using any of the features on an OS that’s going to be used to primarily run containers. After taking a look at the build code, I came up with the following configure command:

./configure --target-list=x86_64-softmmu --enable-debug --cpu=x86_64 --static --disable-gtk --disable-user

This made things WAY EASIER – check out the updated error list:

~/aports/main/qemu/src/qemu-2.12.0 # make -j4 2>&1 | grep "cannot find" | sort | uniq
/usr/lib/gcc/x86_64-alpine-linux-musl/6.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -ldrm
/usr/lib/gcc/x86_64-alpine-linux-musl/6.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lepoxy
/usr/lib/gcc/x86_64-alpine-linux-musl/6.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lgbm
/usr/lib/gcc/x86_64-alpine-linux-musl/6.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lpixman-1

Yeah, so at this point I was internally and externally yelling “FUCK GTK, THEN!” despite having nothing against GTK. I did waste time building atk but the time I saved not building all the other things that were required because of it was a serious weight off my shoulders.

Static build: libdrm

Top of the updated list of missing libraries is libdrm so I set off to start building it. Here’s the step-by-step:

  • downloaded the source code
  • configure --enable-static errored, complaining about not finding libpciaccess!
  • apk add libpciaccess-dev cleared that up… but it seems like when I build statically it’s going to be an issue (fractally)
  • ran make

While things ran smoothly , I find a few more binaries than I expected:

~/aports/main/libdrm/src/libdrm-2.4.89 # find . -name "*.a"
./libkms/.libs/libkms.a
./radeon/.libs/libdrm_radeon.a
./nouveau/.libs/libdrm_nouveau.a
./.libs/libdrm.a # <--- there it is
./intel/.libs/libdrm_intel.a
./amdgpu/.libs/libdrm_amdgpu.a
./tests/util/.libs/libutil.a
./tests/kms/.libs/libkms-test.a

I figured I’d only copy the one I needed for the build so I copied libdrm.a (cp .libs/libdrm.a /usr/libs) and was on my way. As you’d expect, the build of qemu-system has one less error!

~/aports/main/qemu/src/qemu-2.12.0 # make -j4 2>&1 | grep "cannot find" | sort | uniq
/usr/lib/gcc/x86_64-alpine-linux-musl/6.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lepoxy
/usr/lib/gcc/x86_64-alpine-linux-musl/6.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lgbm
/usr/lib/gcc/x86_64-alpine-linux-musl/6.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lpixman-1

Static build: libepoxy

Next up is libepoxy – at this point you likely know the drill so I’ll just boil it down to the hiccups:

  • autoreconf binary wasn’t installed, because some how autoconf wasn’t installed (apk add autoconf)
  • aclocal binary wasn’t installed, b/c automake wasn’t installed (apk add automake)
  • xorg-macros binary wasn’t installed, b/c util-macros package wasn’t installed (apk add util-macros)
  • libtoolize binary wasn’t installed, b/c libtool package wasn’t installed (apk add libtool)

After this everything was the same running ./configure --enable-static and make worked and the static binary went to libs/libepoxy.a. As you’d expect, qemu-system has one less error for me

~/aports/main/qemu/src/qemu-2.12.0 # make -j4 2>&1 | grep "cannot find" | sort | uniq
/usr/lib/gcc/x86_64-alpine-linux-musl/6.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lgbm
/usr/lib/gcc/x86_64-alpine-linux-musl/6.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lpixman-1

Static build: lgbm (AKA mesa-gbm)

Despite the package being called lgbm in the error message, the code I was looking for was mesa-gbm – another seeminly unnecessary library but I’ll try and build it anyway. Most of it was the same – IIRC I ran abuild unpack and abuild prepare, but I ran into an issue with the --enable-llvm flag being required when building on an r300 series discrete GFX chip (I have an RADEON r370 in my desktop and recognized the designation). Since it really didn’t matter if my qemu-system-x86_64 was able to support advanced graphics, I had two choices; remove the device or enable LLVM. I found a random post on phoronix that left a good hint on where to go. Rather than try and get it to work I just removed that from the mesa-gbm build by editing the configure command I was using:

`./configure --enable-static --with-gallium-drivers=`

After this I ran into another issue – it turns out the DRI libraries can’t be built statically, like it’s literally impossible:

checking for LIBDRM... yes
configure: error: DRI cannot be build as static library

That’s all I got from configure, which was pretty disheartening. After reading the Arch Linux qemu documentation I found out that the gallium drivers (which I disabled) are what supports virtio’s operation, so this means the configure change has to change again:

./configure --enable-static --with-dri-drivers= --with-gallium-drivers=virgl --disable-shared --disable-driglx-direct`

Running thi sproduced an error however:

configure: error: gbm cannot be build as static library

At this point I thought building mesa-gbm/gbm statically was fucked, but I came across a build file from yocto linux’s build of QEMU, and was able to find the configure command that got me across the finish line by disabling OpenGL (again something I don’t need on a VM that’s just going to run non-graphically 100% of the time):

./configure --target-list=x86_64-softmmu --enable-debug --cpu=x86_64 --static --disable-gtk --disable-user --disable-opengl

After this, we’re only down to ONE missing library in the output of the build from qemu-system-x86_64!

/usr/lib/gcc/x86_64-alpine-linux-musl/6.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lpixman-1

Static build: pixman-dev

Building this was super easy to do statically, very similar to libepoxy and atk, so I won’t even write it up here.

Step 11: Cautious optimism and some double checking

Well, at this point, skepticism was through the roof, because I’d seemingly done it – resolved all the dependencies, and was only a make command away from having a statically compiled qemu-system-x86_64. So I did what you do when you’re standing on the precipice – I jumped off and ran make and produced a binary, that should have been statically linked. Of course, when things finished successfully I couldn’t believe it so I turned to file and ldd:

~/aports/main/qemu/src/qemu-2.12.0 # file x86_64-softmmu/qemu-system-x86_64
x86_64-softmmu/qemu-system-x86_64: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, with debug_info, not stripped
~/aports/main/qemu/src/qemu-2.12.0 # ldd x86_64-softmmu/qemu-system-x86_64
        ldd (0x7f1cbd295000)

This is pretty awesome – while I’m running a heavily neutered version of qemu-system I’m super happy it works, and the binary itself was only ~45MB:

$ du -hs qemu-system-x86_64
45M     qemu-system-x86_64

So at this point, paranoia washed over me, and I felt that I needed to save my work in a Dockerfile as quickly as possible so I could build everything easily and repeatably. Unfortunately, I didn’t act on those instincts so I don’t have a Dockerfile I can point you at now, but I copied the statically built qemu-system-x86_64 to the server directly. At first, I tried to test if the image actually worked, following the qemu documentation on testing system images:

$ core@localhost ~ $ ./qemu-system-x86_64 or1k-linux-4.10
WARNING: Image format was not specified for 'or1k-linux-4.10' and probing guessed raw.
         Automatically detecting the format is dangerous for raw images, write operations on block 0 will be restricted.
         Specify the 'raw' format explicitly to remove the restrictions.
qemu: could not load PC BIOS 'bios-256k.bin'
$ core@localhost ~ $ wget https://stable.release.core-os.net/amd64-usr/current/coreos_production_iso_image.iso
--2018-05-29 05:51:32--  https://stable.release.core-os.net/amd64-usr/current/coreos_production_iso_image.iso
Resolving stable.release.core-os.net... 104.16.21.26, 104.16.20.26, 2400:cb00:2048:1::6810:141a, ...
Connecting to stable.release.core-os.net|104.16.21.26|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 378535936 (361M) [application/x-iso9660-image]
Saving to: 'coreos_production_iso_image.iso'
coreos_production_iso_image.iso                                      100%[====================================================================================================================================================================>] 361.00M   113MB/s    in 3.4s
2018-05-29 05:51:35 (106 MB/s) - 'coreos_production_iso_image.iso' saved [0/0]
$ core@localhost ~ $ ls
coreos_production_iso_image.iso  or1k-linux-4.10  qemu-system-x86_64
$ core@localhost ~ $ ./qemu-system-x86_64 -cdrom coreos_production_iso_image.iso
qemu: could not load PC BIOS 'bios-256k.bin'

As you can see I actually try with the or1k image and also the coreos production ISO in CD-ROM mode, and get the same bios error, so I think QEMU is actually working as far as I can tell. This is flimsly logic, but not seeing a segfault or any other more serious problem was pretty encouraging at the time. I did find more information on the error that occurred and that was even more encouraging, given that I was simply missing things qemu required, not because the qemu binary was broken.

Step 12: Debugging QEMU as it tries to get running

After going through all the steps above again (I still don’t have a proper Dockerfile) - After going through all the steps again in generating all the executables and pieces you need (VM image, VM kernel, kata-runtime configs, etc), It was time to actually start trying to run an untrusted container. I set up all the config for containerd and kata-runtime (so they could talk to each other), pointed everything at the right places, and created a pod with the right annotation, and intently watched kubelet (journalctl -xef -u kubelet) and containerd (journalctl -xef -u containerd) to start the debug loop.

DEBUG: can’t find the bios

The first issue I ran into was similar to that of when I was running qemu alone – I needed a BIOS for this VM image! The error from inside kata-runtime (through containerd):

ERROR: OCI runtime create failed: qemu: could not load PC BIOS 'bios-256k.bin': unknown"

I was able to find a bios in the qemu repository @ https://github.com/qemu/qemu/blob/master/pc-bios/bios-256k.bin. I had to create a script that would do the starting of qemu so I could more directly modify what was happening when kata-runtime tried to start qemu. Here’s how the script started out:

#!/bin/bash

/opt/bin/qemu-system-x86_64 -L /var/lib/kata-containers/bios $@

As you can see the simplest version is to just take the commands that kata-runtime would normally try to pass to qemu and inject the flag that specified the location of the BIOS in.

DEBUG: no_timer_check file/directory missing

The second issue that I ran into was something a little more obscure, but the the containerd logs were great in showing what went wrong:

Error: qemu-system-x86_64: -append tsc=reliable: Could not open 'no_timer_check': No such file or directory: unknown
May 29 11:48:17 localhost kata-runtime[25014]: time="2018-05-29T11:48:17.564507816Z" level=info msg="launching qemu with: [-name sandbox-497560af811a91e8ae0d92fd1a43aee8a26a10d380a6e4295d4bcbcebb949d8f -uuid 295a9200-e7fc-470f-a83b-62665cd2192f -machine pc,accel=kvm,kernel_irqchip,nvdimm -cpu host -qmp unix:/run/vc/sbs/497560af811a91e8ae0d92fd1a43aee8a26a10d380a6e4295d4bcbcebb949d8f/mon-295a9200-e7fc-470f-a83b-62,server,nowait -qmp unix:/run/vc/sbs/497560af811a91e8ae0d92fd1a43aee8a26a10d380a6e4295d4bcbcebb949d8f/ctl-295a9200-e7fc-470f-a83b-62,server,nowait -m 2048M,slots=2,maxmem=25121M -device pci-bridge,bus=pci.0,id=pci-bridge-0,chassis_nr=1,shpc=on,addr=2 -device virtio-serial-pci,id=serial0 -device virtconsole,chardev=charconsole0,id=console0 -chardev socket,id=charconsole0,path=/run/vc/sbs/497560af811a91e8ae0d92fd1a43aee8a26a10d380a6e4295d4bcbcebb949d8f/console.sock,server,nowait -device nvdimm,id=nv0,memdev=mem0 -object memory-backend-file,id=mem0,mem-path=/var/lib/kata-containers/container.img,size=134217728 -device virtio-scsi-pci,id=scsi0 -device virtserialport,chardev=charch0,id=channel0,name=agent.channel.0 -chardev socket,id=charch0,path=/run/vc/sbs/497560af811a91e8ae0d92fd1a43aee8a26a10d380a6e4295d4bcbcebb949d8f/kata.sock,server,nowait -device virtio-9p-pci,fsdev=extra-9p-kataShared,mount_tag=kataShared -fsdev local,id=extra-9p-kataShared,path=/run/kata-containers/shared/sandboxes/497560af811a91e8ae0d92fd1a43aee8a26a10d380a6e4295d4bcbcebb949d8f,security_model=none -netdev tap,id=network-0,vhost=on,vhostfds=3:4:5:6:7:8:9:10,fds=11:12:13:14:15:16:17:18 -device driver=virtio-net-pci,netdev=network-0,mac=8e:0d:dd:b9:e0:db,mq=on,vectors=18 -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard -vga none -no-user-config -nodefaults -nographic -daemonize -kernel /var/lib/kata-containers/vmlinuz.container -append tsc=reliable no_timer_check rcupdate.rcu_expedited=1 i8042.direct=1 i8042.dumbkbd=1 i8042.nopnp=1 i8042.noaux=1 noreplace-smp reboot=k console=hvc0 console=hvc1 iommu=off cryptomgr.notests net.ifnames=0 pci=lastbus=0 root=/dev/pmem0p1 rootflags=dax,data=ordered,errors=remount-ro rw rootfstype=ext4 quiet systemd.show_status=false panic=1 initcall_debug nr_cpus=12 ip=::::::497560af811a91e8ae0d92fd1a43aee8a26a10d380a6e4295d4bcbcebb949d8f::off:: init=/usr/lib/systemd/systemd systemd.unit=kata-containers.target systemd.mask=systemd-networkd.service systemd.mask=systemd-networkd.socket -smp 1,cores=1,threads=1,sockets=1,maxcpus=12]" arch=amd64 name=kata-runtime pid=25014 source=virtcontainers subsystem=qmp

Fixing this meant extending the bash script hack – clearly I was going to have to get more and more familiar with how kata-runtime was trying to start qemu. There’s a TON of flags in there, hopefully I don’t have to investigate every single one.

DEBUG: Missing bios files

I realized that I needed to copy all of the bios files from qemu’s pc-bios folder, out of the docker container where I did the build.

DEBUG: Getting all the hacks into one spot

After lots more hacking here’s what the script looked like:

SCRIPT:
#!/bin/bash
args=$@
pre_append=$(echo $args | sed 's/\-append.*//')
pre_append=$(echo $pre_append | sed 's/\-device virtio-9p-pci,fsdev=extra-9p-kataShared,mount_tag=kataShared//')
post_append=$(echo $args | sed 's/.*\-append//')
echo -e "$pre_append $post_append" >> /tmp/qemu-calls
/opt/bin/qemu-system-x86_64 -L /var/lib/kata-containers/bios $pre_append -append "$post_append"

This is what madness looks like – I’m literally intercepting and modifying the flags passed by qemu-runtime and doing string munging to try and get it to generate the right flags for qemu to take.

Step 13: More cautious optimism along with partial failure (but also partial success?)

After all this hacking, I finally run it, and… it starts up! But k8s isn’t reporting on the pod properly, while it thinks it did start. A look at the logs and nothing is immediately wrong:

     May 29 13:39:56 localhost kata-runtime[24041]: time="2018-05-29T13:39:56.63968976Z" level=info msg="launching qemu with: [-name sandbox-6461cd25700b6f8c285e784bfe4e3df204bbb6feeda709100431d15df952061a -uuid 0ae6d9a1-66b4-4e3d-b9f4-7578ae12e14e -machine pc,accel=kvm,kernel_irqchip,nvdimm -cpu host -qmp unix:/run/vc/sbs/6461cd25700b6f8c285e784bfe4e3df204bbb6feeda709100431d15df952061a/mon-0ae6d9a1-66b4-4e3d-b9f4-75,server,nowait -qmp unix:/run/vc/sbs/6461cd25700b6f8c285e784bfe4e3df204bbb6feeda709100431d15df952061a/ctl-0ae6d9a1-66b4-4e3d-b9f4-75,server,nowait -m 2048M,slots=2,maxmem=25121M -device pci-bridge,bus=pci.0,id=pci-bridge-0,chassis_nr=1,shpc=on,addr=2 -device virtio-serial-pci,id=serial0 -device virtconsole,chardev=charconsole0,id=console0 -chardev socket,id=charconsole0,path=/run/vc/sbs/6461cd25700b6f8c285e784bfe4e3df204bbb6feeda709100431d15df952061a/console.sock,server,nowait -device nvdimm,id=nv0,memdev=mem0 -object memory-backend-file,id=mem0,mem-path=/var/lib/kata-containers/container.img,size=134217728 -device virtio-scsi-pci,id=scsi0 -device virtserialport,chardev=charch0,id=channel0,name=agent.channel.0 -chardev socket,id=charch0,path=/run/vc/sbs/6461cd25700b6f8c285e784bfe4e3df204bbb6feeda709100431d15df952061a/kata.sock,server,nowait -device virtio-9p-pci,fsdev=extra-9p-kataShared,mount_tag=kataShared -fsdev local,id=extra-9p-kataShared,path=/run/kata-containers/shared/sandboxes/6461cd25700b6f8c285e784bfe4e3df204bbb6feeda709100431d15df952061a,security_model=none -netdev tap,id=network-0,vhost=on,vhostfds=3:4:5:6:7:8:9:10,fds=11:12:13:14:15:16:17:18 -device driver=virtio-net-pci,netdev=network-0,mac=96:09:e1:57:c8:f8,mq=on,vectors=18 -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard -vga none -no-user-config -nodefaults -nographic -daemonize -kernel /var/lib/kata-containers/vmlinuz.container -append tsc=reliable no_timer_check rcupdate.rcu_expedited=1 i8042.direct=1 i8042.dumbkbd=1 i8042.nopnp=1 i8042.noaux=1 noreplace-smp reboot=k console=hvc0 console=hvc1 iommu=off cryptomgr.notests net.ifnames=0 pci=lastbus=0 root=/dev/pmem0p1 rootflags=dax,data=ordered,errors=remount-ro rw rootfstype=ext4 quiet systemd.show_status=false panic=1 initcall_debug nr_cpus=12 ip=::::::6461cd25700b6f8c285e784bfe4e3df204bbb6feeda709100431d15df952061a::off:: init=/usr/lib/systemd/systemd systemd.unit=kata-containers.target systemd.mask=systemd-networkd.service systemd.mask=systemd-networkd.socket no_timer_check -smp 1,cores=1,threads=1,sockets=1,maxcpus=12]" arch=amd64 name=kata-runtime pid=24041 source=virtcontainers subsystem=qmp
     May 29 13:39:56 localhost kata-runtime[24041]: time="2018-05-29T13:39:56.700216428Z" level=info msg="{\"QMP\": {\"version\": {\"qemu\": {\"micro\": 0, \"minor\": 12, \"major\": 2}, \"package\": \"\"}, \"capabilities\": []}}" arch=amd64 name=kata-runtime pid=24041 source=virtcontainers subsystem=qmp
     May 29 13:39:56 localhost kata-runtime[24041]: time="2018-05-29T13:39:56.700407212Z" level=info msg="QMP details" arch=amd64 name=kata-runtime pid=24041 qmp-capabilities= qmp-major-version=2 qmp-micro-version=0 qmp-minor-version=12 source=virtcontainers subsystem=qemu
     May 29 13:39:56 localhost kata-runtime[24041]: time="2018-05-29T13:39:56.700473689Z" level=info msg="{\"execute\":\"qmp_capabilities\"}" arch=amd64 name=kata-runtime pid=24041 source=virtcontainers subsystem=qmp
     May 29 13:39:56 localhost kata-runtime[24041]: time="2018-05-29T13:39:56.700838482Z" level=info msg="{\"return\": {}}" arch=amd64 name=kata-runtime pid=24041 source=virtcontainers subsystem=qmp
     May 29 13:39:56 localhost kata-runtime[24041]: time="2018-05-29T13:39:56.700919512Z" level=info msg="VM started" arch=amd64 name=kata-runtime pid=24041 sandbox-id=6461cd25700b6f8c285e784bfe4e3df204bbb6feeda709100431d15df952061a source=virtcontainers subsystem=sandbox
     May 29 13:39:56 localhost kata-runtime[24041]: time="2018-05-29T13:39:56.701388061Z" level=info msg="proxy started" arch=amd64 name=kata-runtime pid=24041 proxy-pid=24077 proxy-url="unix:///run/vc/sbs/6461cd25700b6f8c285e784bfe4e3df204bbb6feeda709100431d15df952061a/proxy.sock" sandbox-id=6461cd25700b6f8c285e784bfe4e3df204bbb6feeda709100431d15df952061a source=virtcontainers subsystem=kata_agent
     May 29 13:39:56 localhost kata-runtime[24041]: time="2018-05-29T13:39:56.701489259Z" level=warning msg="unsupported address" address="fe80::9409:e1ff:fe57:c8f8/64" arch=amd64 name=kata-runtime pid=24041 source=virtcontainers subsystem=kata_agent unsupported-address-type=ipv6
     May 29 13:39:56 localhost kata-runtime[24041]: time="2018-05-29T13:39:56.701609103Z" level=warning msg="unsupported route" arch=amd64 destination="fe80::/64" name=kata-runtime pid=24041 source=virtcontainers subsystem=kata_agent unsupported-route-type=ipv6

Weirdly enough, I see 3 processes when I pgrep for qemu:

     core@localhost ~ $ pgrep -a qemu
     23507 /opt/bin/qemu-system-x86_64 -L /var/lib/kata-containers/bios -name sandbox-70e865fe9662e87889534dcfc5486868bdc50ac7c948c6aa42829ac36a5a455c -uuid 048d74e0-4ae7-4bb7-8d11-5fb3416854bb -machine pc,accel=kvm,kernel_irqchip,nvdimm -cpu host -qmp unix:/run/vc/sbs/70e865fe9662e87889534dcfc5486868bdc50ac7c948c6aa42829ac36a5a455c/mon-048d74e0-4ae7-4bb7-8d11-5f,server,nowait -qmp unix:/run/vc/sbs/70e865fe9662e87889534dcfc5486868bdc50ac7c948c6aa42829ac36a5a455c/ctl-048d74e0-4ae7-4bb7-8d11-5f,server,nowait -m 2048M,slots=2,maxmem=25121M -device pci-bridge,bus=pci.0,id=pci-bridge-0,chassis_nr=1,shpc=on,addr=2 -device virtio-serial-pci,id=serial0 -device virtconsole,chardev=charconsole0,id=console0 -chardev socket,id=charconsole0,path=/run/vc/sbs/70e865fe9662e87889534dcfc5486868bdc50ac7c948c6aa42829ac36a5a455c/console.sock,server,nowait -device nvdimm,id=nv0,memdev=mem0 -object memory-backend-file,id=mem0,mem-path=/var/lib/kata-containers/container.img,size=134217728 -device virtio-scsi-pci,id=scsi0 -device virtserialport,chardev=charch0,id=channel0,name=agent.channel.0 -chardev socket,id=charch0,path=/run/vc/sbs/70e865fe9662e87889534dcfc5486868bdc50ac7c948c6aa42829ac36a5a455c/kata.sock,server,nowait -fsdev local,id=extra-9p-kataShared,path=/run/kata-containers/shared/sandboxes/70e865fe9662e87889534dcfc5486868bdc50ac7c948c6aa42829ac36a5a455c,security_model=none -netdev tap,id=network-0,vhost=on,vhostfds=3:4:5:6:7:8:9:10,fds=11:12:13:14:15:16:17:18 -device driver=virtio-net-pci,netdev=network-0,mac=62:b4:c0:a3:25:4d,mq=on,vectors=18 -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard -vga none -no-user-config -nodefaults -nographic -daemonize -kernel /var/lib/kata-containers/vmlinuz.container -append  tsc=reliable no_timer_check rcupdate.rcu_expedited=1 i8042.direct=1 i8042.dumbkbd=1 i8042.nopnp=1 i8042.noaux=1 noreplace-smp reboot=k console=hvc0 console=hvc1 iommu=off cryptomgr.notests net.ifnames=0 pci=lastbus=0 root=/dev/pmem0p1 rootflags=dax,data=ordered,errors=remount-ro rw rootfstype=ext4 quiet systemd.show_status=false panic=1 initcall_debug nr_cpus=12 ip=::::::70e865fe9662e87889534dcfc5486868bdc50ac7c948c6aa42829ac36a5a455c::off:: init=/usr/lib/systemd/systemd systemd.unit=kata-containers.target systemd.mask=systemd-networkd.service systemd.mask=systemd-networkd.socket no_timer_check -smp 1,cores=1,threads=1,sockets=1,maxcpus=12
     23827 /opt/bin/qemu-system-x86_64 -L /var/lib/kata-containers/bios -name sandbox-beb0cfa06a6ce4208484239fbf935667d982718b12ce6e67eb109d6000bb30ae -uuid fa8e69c7-428e-4779-b1e7-ee9ae9b37527 -machine pc,accel=kvm,kernel_irqchip,nvdimm -cpu host -qmp unix:/run/vc/sbs/beb0cfa06a6ce4208484239fbf935667d982718b12ce6e67eb109d6000bb30ae/mon-fa8e69c7-428e-4779-b1e7-ee,server,nowait -qmp unix:/run/vc/sbs/beb0cfa06a6ce4208484239fbf935667d982718b12ce6e67eb109d6000bb30ae/ctl-fa8e69c7-428e-4779-b1e7-ee,server,nowait -m 2048M,slots=2,maxmem=25121M -device pci-bridge,bus=pci.0,id=pci-bridge-0,chassis_nr=1,shpc=on,addr=2 -device virtio-serial-pci,id=serial0 -device virtconsole,chardev=charconsole0,id=console0 -chardev socket,id=charconsole0,path=/run/vc/sbs/beb0cfa06a6ce4208484239fbf935667d982718b12ce6e67eb109d6000bb30ae/console.sock,server,nowait -device nvdimm,id=nv0,memdev=mem0 -object memory-backend-file,id=mem0,mem-path=/var/lib/kata-containers/container.img,size=134217728 -device virtio-scsi-pci,id=scsi0 -device virtserialport,chardev=charch0,id=channel0,name=agent.channel.0 -chardev socket,id=charch0,path=/run/vc/sbs/beb0cfa06a6ce4208484239fbf935667d982718b12ce6e67eb109d6000bb30ae/kata.sock,server,nowait -fsdev local,id=extra-9p-kataShared,path=/run/kata-containers/shared/sandboxes/beb0cfa06a6ce4208484239fbf935667d982718b12ce6e67eb109d6000bb30ae,security_model=none -netdev tap,id=network-0,vhost=on,vhostfds=3:4:5:6:7:8:9:10,fds=11:12:13:14:15:16:17:18 -device driver=virtio-net-pci,netdev=network-0,mac=ea:68:5e:95:ac:7c,mq=on,vectors=18 -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard -vga none -no-user-config -nodefaults -nographic -daemonize -kernel /var/lib/kata-containers/vmlinuz.container -append  tsc=reliable no_timer_check rcupdate.rcu_expedited=1 i8042.direct=1 i8042.dumbkbd=1 i8042.nopnp=1 i8042.noaux=1 noreplace-smp reboot=k console=hvc0 console=hvc1 iommu=off cryptomgr.notests net.ifnames=0 pci=lastbus=0 root=/dev/pmem0p1 rootflags=dax,data=ordered,errors=remount-ro rw rootfstype=ext4 quiet systemd.show_status=false panic=1 initcall_debug nr_cpus=12 ip=::::::beb0cfa06a6ce4208484239fbf935667d982718b12ce6e67eb109d6000bb30ae::off:: init=/usr/lib/systemd/systemd systemd.unit=kata-containers.target systemd.mask=systemd-networkd.service systemd.mask=systemd-networkd.socket no_timer_check -smp 1,cores=1,threads=1,sockets=1,maxcpus=12
     24062 /opt/bin/qemu-system-x86_64 -L /var/lib/kata-containers/bios -name sandbox-6461cd25700b6f8c285e784bfe4e3df204bbb6feeda709100431d15df952061a -uuid 0ae6d9a1-66b4-4e3d-b9f4-7578ae12e14e -machine pc,accel=kvm,kernel_irqchip,nvdimm -cpu host -qmp unix:/run/vc/sbs/6461cd25700b6f8c285e784bfe4e3df204bbb6feeda709100431d15df952061a/mon-0ae6d9a1-66b4-4e3d-b9f4-75,server,nowait -qmp unix:/run/vc/sbs/6461cd25700b6f8c285e784bfe4e3df204bbb6feeda709100431d15df952061a/ctl-0ae6d9a1-66b4-4e3d-b9f4-75,server,nowait -m 2048M,slots=2,maxmem=25121M -device pci-bridge,bus=pci.0,id=pci-bridge-0,chassis_nr=1,shpc=on,addr=2 -device virtio-serial-pci,id=serial0 -device virtconsole,chardev=charconsole0,id=console0 -chardev socket,id=charconsole0,path=/run/vc/sbs/6461cd25700b6f8c285e784bfe4e3df204bbb6feeda709100431d15df952061a/console.sock,server,nowait -device nvdimm,id=nv0,memdev=mem0 -object memory-backend-file,id=mem0,mem-path=/var/lib/kata-containers/container.img,size=134217728 -device virtio-scsi-pci,id=scsi0 -device virtserialport,chardev=charch0,id=channel0,name=agent.channel.0 -chardev socket,id=charch0,path=/run/vc/sbs/6461cd25700b6f8c285e784bfe4e3df204bbb6feeda709100431d15df952061a/kata.sock,server,nowait -fsdev local,id=extra-9p-kataShared,path=/run/kata-containers/shared/sandboxes/6461cd25700b6f8c285e784bfe4e3df204bbb6feeda709100431d15df952061a,security_model=none -netdev tap,id=network-0,vhost=on,vhostfds=3:4:5:6:7:8:9:10,fds=11:12:13:14:15:16:17:18 -device driver=virtio-net-pci,netdev=network-0,mac=96:09:e1:57:c8:f8,mq=on,vectors=18 -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard -vga none -no-user-config -nodefaults -nographic -daemonize -kernel /var/lib/kata-containers/vmlinuz.container -append  tsc=reliable no_timer_check rcupdate.rcu_expedited=1 i8042.direct=1 i8042.dumbkbd=1 i8042.nopnp=1 i8042.noaux=1 noreplace-smp reboot=k console=hvc0 console=hvc1 iommu=off cryptomgr.notests net.ifnames=0 pci=lastbus=0 root=/dev/pmem0p1 rootflags=dax,data=ordered,errors=remount-ro rw rootfstype=ext4 quiet systemd.show_status=false panic=1 initcall_debug nr_cpus=12 ip=::::::6461cd25700b6f8c285e784bfe4e3df204bbb6feeda709100431d15df952061a::off:: init=/usr/lib/systemd/systemd systemd.unit=kata-containers.target systemd.mask=systemd-networkd.service systemd.mask=systemd-networkd.socket no_timer_check -smp 1,cores=1,threads=1,sockets=1,maxcpus=12

This is awesome, because it means it’s definitely running, despite not being able to actually properly access the pod from kubernetes. I’ve succeeded… a little bit? I tried deleting the old pods, and then seeing if the extra qemu processes would go away, and this is when I first faced kubernetes’ PLEG error – basically, the node had become NotReady due to inability to access and properly control the pods that it started (the untrusted workload pods I started). In the meantime I sudo pkill qemu’d the processes. For future reference the logged kubernetes errors I was seeing looked like this:

Ready            False   Tue, 29 May 2018 22:49:26 +0900   Tue, 29 May 2018 22:43:05 +0900   KubeletNotReady              PLEG is not healthy: pleg was last seen active 9m29.776835455s ago; threshold is 3m0s

After a restart the kubelet was able to recover from the botched pod deletion. While I’ve experienced some small success (the processes were running), I wasn’t sure what was wrong, but I wanted to look back at the hacks I was applying and see if there was something I did wrong:

Investigation: Fix/Replace the virtio-9p-pci driver

I thought one of hte reasons behind the issue was the fact that I wasn’t using the virtio-9p-pci driver that I wasn’t yet using, but was required IIRC by kata-runtime (the command comes in with it specified). If this was true, I had two options:

  • Build qemu with virtio-9p-pci driver support (assuming it’s something you can build in)
  • Replace the virtio-9p-pci driver in the command with another driver that was included with the binary I built (turns out you can list the supported drivers with qemu-system-x86_64 --device help)
$ qemu-system-x86_64 --device help
.... lots more output ...
Storage devices:
name "am53c974", bus PCI, desc "AMD Am53c974 PCscsi-PCI SCSI adapter"
name "dc390", bus PCI, desc "Tekram DC-390 SCSI adapter"
name "floppy", bus floppy-bus, desc "virtual floppy drive"
name "ich9-ahci", bus PCI, alias "ahci"
name "ide-cd", bus IDE, desc "virtual IDE CD-ROM"
name "ide-drive", bus IDE, desc "virtual IDE disk or CD-ROM (legacy)"
name "ide-hd", bus IDE, desc "virtual IDE disk"
name "isa-fdc", bus ISA
name "isa-ide", bus ISA
name "lsi53c810", bus PCI
name "lsi53c895a", bus PCI, alias "lsi"
name "megasas", bus PCI, desc "LSI MegaRAID SAS 1078"
name "megasas-gen2", bus PCI, desc "LSI MegaRAID SAS 2108"
name "nvme", bus PCI, desc "Non-Volatile Memory Express"
name "piix3-ide", bus PCI
name "piix3-ide-xen", bus PCI
name "piix4-ide", bus PCI
name "pvscsi", bus PCI
name "scsi-block", bus SCSI, desc "SCSI block device passthrough"
name "scsi-cd", bus SCSI, desc "virtual SCSI CD-ROM"
name "scsi-disk", bus SCSI, desc "virtual SCSI disk or CD-ROM (legacy)"
name "scsi-generic", bus SCSI, desc "pass through generic scsi device (/dev/sg*)"
name "scsi-hd", bus SCSI, desc "virtual SCSI disk"
name "sdhci-pci", bus PCI
name "usb-bot", bus usb-bus
name "usb-mtp", bus usb-bus, desc "USB Media Transfer Protocol device"
name "usb-storage", bus usb-bus
name "usb-uas", bus usb-bus
name "vhost-scsi", bus virtio-bus
name "vhost-scsi-pci", bus PCI
name "vhost-user-blk", bus virtio-bus
name "vhost-user-blk-pci", bus PCI
name "vhost-user-scsi", bus virtio-bus
name "vhost-user-scsi-pci", bus PCI
name "virtio-blk-device", bus virtio-bus
name "virtio-blk-pci", bus PCI, alias "virtio-blk"
name "virtio-scsi-device", bus virtio-bus
name "virtio-scsi-pci", bus PCI, alias "virtio-scsi"
.... lots more output ...

I figured I’d try to build with virtfs included first – turns out there’s a similar looking feature (that I previously disabled) in the configure script called virtfs. It’s based on VirtFS, and it looks like that is based on the 9p (as in virtio-9p-pci) library, so I might be on the right track. the build failed immediately but that’s good, since I was expecting some change:

~/aports/main/qemu/src/qemu-2.12.0 # ./configure --target-list=x86_64-softmmu --enable-debug --cpu=x86_64 --static --disable-gtk --disable-user --disable-opengl --enable-virtfs

ERROR: VirtFS requires libcap devel and libattr devel

And now it’s time to build/install libcap devel and libattr devel – luckily I accomplished this by simply running apk add libcap-dev! Attempting the build after that was pretty easy, when I next ran the listing of devices command I got the following output:

      Storage devices:
      name "am53c974", bus PCI, desc "AMD Am53c974 PCscsi-PCI SCSI adapter"
      name "dc390", bus PCI, desc "Tekram DC-390 SCSI adapter"
      name "floppy", bus floppy-bus, desc "virtual floppy drive"
      name "ich9-ahci", bus PCI, alias "ahci"
      name "ide-cd", bus IDE, desc "virtual IDE CD-ROM"
      name "ide-drive", bus IDE, desc "virtual IDE disk or CD-ROM (legacy)"
      name "ide-hd", bus IDE, desc "virtual IDE disk"
      name "isa-fdc", bus ISA
      name "isa-ide", bus ISA
      name "lsi53c810", bus PCI
      name "lsi53c895a", bus PCI, alias "lsi"
      name "megasas", bus PCI, desc "LSI MegaRAID SAS 1078"
      name "megasas-gen2", bus PCI, desc "LSI MegaRAID SAS 2108"
      name "nvme", bus PCI, desc "Non-Volatile Memory Express"
      name "piix3-ide", bus PCI
      name "piix3-ide-xen", bus PCI
      name "piix4-ide", bus PCI
      name "pvscsi", bus PCI
      name "scsi-block", bus SCSI, desc "SCSI block device passthrough"
      name "scsi-cd", bus SCSI, desc "virtual SCSI CD-ROM"
      name "scsi-disk", bus SCSI, desc "virtual SCSI disk or CD-ROM (legacy)"
      name "scsi-generic", bus SCSI, desc "pass through generic scsi device (/dev/sg*)"
      name "scsi-hd", bus SCSI, desc "virtual SCSI disk"
      name "sdhci-pci", bus PCI
      name "usb-bot", bus usb-bus
      name "usb-mtp", bus usb-bus, desc "USB Media Transfer Protocol device"
      name "usb-storage", bus usb-bus
      name "usb-uas", bus usb-bus
      name "vhost-scsi", bus virtio-bus
      name "vhost-scsi-pci", bus PCI
      name "vhost-user-blk", bus virtio-bus
      name "vhost-user-blk-pci", bus PCI
      name "vhost-user-scsi", bus virtio-bus
      name "vhost-user-scsi-pci", bus PCI
      name "virtio-9p-device", bus virtio-bus
      name "virtio-9p-pci", bus PCI, alias "virtio-9p"
      name "virtio-blk-device", bus virtio-bus
      name "virtio-blk-pci", bus PCI, alias "virtio-blk"
      name "virtio-scsi-device", bus virtio-bus
      name "virtio-scsi-pci", bus PCI, alias "virtio-scsi"

This is awesome, because now I can remove some hacks to stop doing the script munging I was trying to do to cut out the --device flag that was pointing at virtio-9p-pci.

INVESTIGATION: Rebuilding the image, kernel and initrd with kata-containers/osbuilder

Around this point I figured out that clearcontainers/osbuilder was actually no longer the builder to be used for kata-containers, huge thanks to @grahamwhaley for letting me know and pointing me in the right direction. I switched over to kata-containers/osbuilder and found everything pretty easy to build, although the commands only ran properly once. When I tried to run the commands again I got an error noting that /sbin/init wasn’t in the rootfs for some reason. I’m not sure what tha’ts about, but once is good enough for me!

kata-containers/osbuilder didn’t (at the time) provide a kernel, so I actually had to use the kernel that I got from the old clear-containers/osbuilder (I found a github issue about it on kata-containers/builder). In the meantime, I figured out that to re-run the tools you have to delete centos_rootfs.

OK, so at this point, I generate all the new image/kernel/initrd stuff and try again, but get another error about the kataShared device. At this point my working understanding is that this is the device that enables kata-agent to share information with kubernetes, which is obviously necessary for Kubernetes to communicate with the pods it spawns (that kata-runtime launched/controls). The error I saw was:

ERROR:
May 30 02:54:46 localhost containerd[2485]: time="2018-05-30T02:54:46Z" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:vm-shell,Uid:f992bbf3-63b3-11e8-9ee1-8c89a517d15e,Namespace:default,Attempt:0,} failed, error" error="failed to start sandbox container: failed to create containerd task: OCI runtime create failed: rpc error: code = Internal desc = Could not mount kataShared to /run/kata-containers/shared/containers/: no such file or directory: unknown"

The fix for this was that I was doing the OS building in the wrong order – I was meant to generate the root fs then generate the image (which is obvious in retrospect).

DEBUG: Why is Kubernetes pod dying as soon as it’s created?

After all this, the pods that were started were now dying as soon as they were created. This was a step forward in that I got a different error, but obviously the pods being DOA was less than ideal. Here’s the output I saw:

May 30 03:13:57 localhost kata-runtime[28231]: time="2018-05-30T03:13:57.758452267Z" level=info msg="VM started" arch=amd64 name=kata-runtime pid=28231 sandbox-id=bc375a63edcd3dc5dee09b618455bb47bf0ae65d313e1cece50aeffa7d80325d source=virtcontainers subsystem=sandbox
May 30 03:13:57 localhost kata-runtime[28231]: time="2018-05-30T03:13:57.758808113Z" level=info msg="proxy started" arch=amd64 name=kata-runtime pid=28231 proxy-pid=28266 proxy-url="unix:///run/vc/sbs/bc375a63edcd3dc5dee09b618455bb47bf0ae65d313e1cece50aeffa7d80325d/proxy.sock" sandbox-id=bc375a63edcd3dc5dee09b618455bb47bf0ae65d313e1cece50aeffa7d80325d source=virtcontainers subsystem=kata_agent
May 30 03:13:57 localhost kata-runtime[28231]: time="2018-05-30T03:13:57.758898027Z" level=warning msg="unsupported address" address="fe80::94:22ff:fe6f:e50a/64" arch=amd64 name=kata-runtime pid=28231 source=virtcontainers subsystem=kata_agent unsupported-address-type=ipv6
May 30 03:13:57 localhost kata-runtime[28231]: time="2018-05-30T03:13:57.758997876Z" level=warning msg="unsupported route" arch=amd64 destination="fe80::/64" name=kata-runtime pid=28231 source=virtcontainers subsystem=kata_agent unsupported-route-type=ipv6
May 30 03:13:58 localhost containerd[2485]: time="2018-05-30T03:13:58Z" level=info msg="shim reaped" id=bc375a63edcd3dc5dee09b618455bb47bf0ae65d313e1cece50aeffa7d80325d
May 30 03:13:58 localhost containerd[2485]: 2018-05-30 03:13:58.941 [INFO][28300] utils.go 379: Configured environment: [CNI_COMMAND=DEL CNI_CONTAINERID=bc375a63edcd3dc5dee09b618455bb47bf0ae65d313e1cece50aeffa7d80325d CNI_NETNS=/var/run/netns/cni-0c73f075-4009-af02-c329-e172f17d30ee CNI_ARGS=IgnoreUnknown=1;K8S_POD_NAMESPACE=default;K8S_POD_NAME=vm-shell;K8S_POD_INFRA_CONTAINER_ID=bc375a63edcd3dc5dee09b618455bb47bf0ae65d313e1cece50aeffa7d80325d;IgnoreUnknown=1 CNI_IFNAME=eth0 CNI_PATH=/opt/cni/bin PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin INVOCATION_ID=dee1797538c244ccb7107bf2ffe43b7e JOURNAL_STREAM=9:4414399 DATASTORE_TYPE=kubernetes KUBECONFIG=/etc/cni/net.d/calico-kubeconfig]
May 30 03:13:58 localhost containerd[2485]: 2018-05-30 03:13:58.951 [INFO][28300] calico.go 431: Extracted identifiers ContainerID="bc375a63edcd3dc5dee09b618455bb47bf0ae65d313e1cece50aeffa7d80325d" Node="localhost" Orchestrator="k8s" WorkloadEndpoint="localhost-k8s-vm--shell-eth0"
May 30 03:13:58 localhost containerd[2485]: 2018-05-30 03:13:58.985 [WARNING][28300] workloadendpoint.go 72: Operation Delete is not supported on WorkloadEndpoint type
May 30 03:13:58 localhost containerd[2485]: 2018-05-30 03:13:58.985 [INFO][28300] k8s.go 361: Endpoint deletion will be handled by Kubernetes deletion of the Pod. ContainerID="bc375a63edcd3dc5dee09b618455bb47bf0ae65d313e1cece50aeffa7d80325d" endpoint=&v3.WorkloadEndpoint{TypeMeta:v1.TypeMeta{Kind:"WorkloadEndpoint", APIVersion:"projectcalico.org/v3"}, ObjectMeta:v1.ObjectMeta{Name:"localhost-k8s-vm--shell-eth0", GenerateName:"", Namespace:"default", SelfLink:"", UID:"7ba861c2-63b7-11e8-9ee1-8c89a517d15e", ResourceVersion:"118446", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63663246833, loc:(*time.Location)(0x1ec5320)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string{"projectcalico.org/namespace":"default", "projectcalico.org/orchestrator":"k8s"}, Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, Spec:v3.WorkloadEndpointSpec{Orchestrator:"k8s", Workload:"", Node:"localhost", ContainerID:"", Pod:"vm-shell", Endpoint:"eth0", IPNetworks:[]string{"10.244.0.10/32"}, IPNATs:[]v3.IPNAT(nil), IPv4Gateway:"", IPv6Gateway:"", Profiles:[]string{"kns.default"}, InterfaceName:"cali4a530075471", MAC:"", Ports:[]v3.EndpointPort(nil)}}
May 30 03:13:58 localhost containerd[2485]: Calico CNI releasing IP address
May 30 03:13:58 localhost containerd[2485]: 2018-05-30 03:13:58.985 [INFO][28300] utils.go 149: Using a dummy podCidr to release the IP ContainerID="bc375a63edcd3dc5dee09b618455bb47bf0ae65d313e1cece50aeffa7d80325d" podCidr="0.0.0.0/0"
May 30 03:13:58 localhost containerd[2485]: Calico CNI deleting device in netns /var/run/netns/cni-0c73f075-4009-af02-c329-e172f17d30ee
May 30 03:13:58 localhost containerd[2485]: time="2018-05-30T03:13:58Z" level=error msg="Failed to destroy network for sandbox "bc375a63edcd3dc5dee09b618455bb47bf0ae65d313e1cece50aeffa7d80325d"" error="failed to get IP addresses for "eth0": <nil>"
May 30 03:13:59 localhost containerd[2485]: time="2018-05-30T03:13:58Z" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:vm-shell,Uid:7ba861c2-63b7-11e8-9ee1-8c89a517d15e,Namespace:default,Attempt:0,} failed, error" error="failed to start sandbox container: failed to create containerd task: OCI runtime create failed: rpc error: code = Internal desc = Could not mount kataShared to /run/kata-containers/shared/containers/: no such file or directory: unknown"

I didn’t get this error if I use the version of the rootfs/image where the agent is the thing that starts – but if I do that, the pod is stuck forever in the Creating state, and never gets to running. I figured I should check inside kubelet to see if there’s anything there and I saw this telling line of log output:

May 30 04:09:06 localhost kubelet[2308]: I0530 04:09:06.200582    2308 kubelet_node_status.go:811] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2018-05-30 04:09:06.200546887 +0000 UTC m=+820.245566857 LastTransitionTime:2018-05-30 04:09:06.200546887 +0000 UTC m=+820.245566857 Reason:KubeletNotReady Message:PLEG is not healthy: pleg was last seen active 3m11.681064631s ago; threshold is 3m0s}

This is exactly where the node was becoming NotReady (PLEG error) was taking hold – the pleg is going missing for the pods that get started by kata-runtime still, despite everything starting up and seeming fine.

Step 14: Cliffhanger

Unfortunately, the previous step is how far I got. After all the work, getting containerd to call kata-runtime to call qemu-system-x86_64, all the way down to getting a kubernetes pod actually running, I stopped trying to get everything fixed after I encountered the PLEG errors, which seemed to be taking out the Node.

Weirdly enough, all external indicators seemed to point to the fact that the pod was running and qemu was running and everything was in orer, but Kubernetes just couldn’t manage it (pun intended). I think with a little more work I could have gotten it completely working but I think I’ll leave that to any future explorers or maybe myself if I ever come back to Container Linux, maybe Flatcar linux?, for a 3rd time.

Wrapup

Thanks for reading along, this post has been a doozie to write but it’s been a long time coming (I’ve left it on ice for a while). Hopefully if you’re out there experimenting with these technologies on top of Container Linux you’ll find this post (and the F/OSS repos included within) useful!

This is the last post in the series but certainly not my last time experimenting with untrusted container runtimes! I switched off of Container Linux (to Ubuntu server) because I wasn’t down to experience this much difficulty, so the next time I approach this I expect to glide through everything super easily! Maybe I can even use an Operator like kubevirt (it’s got some nice documentation as well). Either way, the sky is the limit (it feels like) going back to the mainstream and running my Kubernetes cluster on Ubuntu server. This is probably the last post I’ll write about Container Linux for a long time, but it was fun while it lasted, and I’m glad I did this since I learned a lot along the way.

Did you find this read beneficial? Send me questions/comments/clarifciations.
Want my expertise on your team/project? Send me interesting opportunities!