Upgrading Cert Manager From 0.4.0 to 0.16

Categories
Kubernetes logo + cert-manager logo

tl;dr - I go through upgrading cert-manager (formerly known as kube-lego) from version 0.4.0 to 0.9.0 (due to deprecations of cert-manager 0.8.1 and lower) to 0.16.0. After upgrading to well known issues with the upgrade from v0.15 to v0.16 make me downgrade to v0.15.

Background

Let’s Encrypt is legitimately one of the best things to happen to the internet in the last decade. For those who like to build distributed systems or over-invest in building platforms to depoy only a handful of apps, Kubernetes has changed the ecosystem (there are other container orchestrators, but Kubernetes is best-in-class as of the writing of this post). Kubernetes Ingress resources represent access from the outside world to your application (mediated by an IngressController, for example nginx or traefik), and adding TLS (HTTPS) support is one of the first things to do to get your app ready for production. Jetstack saw the need for HTTPS certs on dynamically allocated k8s resource and filled it relatively early on, with a project called cert-manager. cert-manager (formerly known as kube-lego) has been in my cluster almost since inception and has been a really crucial piece of it, one of the things that made Kubernetes more substance than hype.

Before k8s (+/- cert-manager), I wrote systemd timers to manage running certbot, and provisioned machines with ansible – with cert-manager and k8s this is handled by the platform. It wasn’t necessarily “hard” once you have the unit file(s) down, but I greatly enjoy just writing my Deployment, Service and Ingress resources and not worrying about TLS at all.

Upgrading

A while back I installed v0.4.0 (quay.io/jetstack/cert-manager-controller:v0.4.0) and haven’t looked back since then – so naturally I needed to start poring over the releases since then. Here’s what the original Deployment looks like (it’s not even a DaemonSet!):

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cert-manager
  namespace: kube-system
  labels:
    app: cert-manager
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cert-manager
  template:
    metadata:
      labels:
        app: cert-manager
    spec:
      serviceAccountName: cluster-cert-manager
      containers:
      - name: mgr
        image: quay.io/jetstack/cert-manager-controller:v0.4.1
        imagePullPolicy: IfNotPresent
        args:
          - --cluster-resource-namespace=$(POD_NAMESPACE)
          - --leader-election-namespace=$(POD_NAMESPACE)
          - --default-issuer-name=letsencrypt-prod
          - --default-issuer-kind=ClusterIssuer
        env:
        - name: POD_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        resources:
          requests:
            cpu: 10m
            memory: 32Mi

v0.4.1

Release link

This was a bugfix release, so thankfully not much to see here, I updated the image tag to 0.4.1 and restarted to no errors.

After updating the Deployment, logs looked normal:

$ k logs -f cert-manager-785fdfbc8-cbdrk -n kube-system
I0815 04:05:10.388425       1 start.go:63] starting cert-manager v0.4.1 (revision ad30555d3aebaafa1524272a44ba80ffcdc01d2f)
I0815 04:05:10.391163       1 server.go:68] Listening on http://0.0.0.0:9402
I0815 04:05:10.391909       1 controller.go:111] Using the following nameservers for DNS01 checks: [10.96.0.10:53]
I0815 04:05:10.392795       1 leaderelection.go:175] attempting to acquire leader lease  kube-system/cert-manager-controller...
I0815 04:05:30.099395       1 leaderelection.go:184] successfully acquired lease kube-system/cert-manager-controller
I0815 04:05:30.099658       1 controller.go:53] Starting certificates controller
I0815 04:05:30.099725       1 controller.go:53] Starting clusterissuers controller
I0815 04:05:30.099799       1 controller.go:53] Starting ingress-shim controller
I0815 04:05:30.100325       1 controller.go:53] Starting issuers controller
I0815 04:05:30.199952       1 controller.go:138] clusterissuers controller: syncing item 'letsencrypt-prod'
I0815 04:05:30.199987       1 controller.go:138] clusterissuers controller: syncing item 'letsencrypt-staging'
... more logs ...

v0.5.0

Release link

This was a bit of a bigger release, a bunch of things added in here, most notably for me:

  • New fields for Certificate resource (which makes cert-manager better at managing various situations where certificates are useful)
  • Support for acme-dns as a provider which is awesome. AcmeDNS is a fantastic project and I have been seriously considering running it locally in my own cluster for do-it-yourself DNS, or writing some sort of light controller around it.
  • Support RFC2136-compliant nameservers – most famous of which is the industrial-grade BIND/named

None of these changes look particularly dangerous for me – Certificate might pose some risk if the old versions that I undoubtedly am using have translation issues. It wasn’t hard to get the list of certificates I had:

$ k get certificate --all-namespaces
NAMESPACE        NAME                        AGE
vadosware-blog   vadosware-blog-blog-tls     67d
... other entries ...

And you can get a look at one of them with kubectl edit in your $EDITOR of choice (and just don’t change anything):

$ k edit certificate vadosware-blog-blog-tls -n vadosware-blog

After updating the Deployment, logs had some errors:

I0815 04:20:12.749253       1 controller.go:140] clusterissuers controller: syncing item 'letsencrypt-prod'
I0815 04:20:12.749492       1 logger.go:88] Calling GetAccount
I0815 04:20:12.887645       1 controller.go:140] clusterissuers controller: syncing item 'letsencrypt-staging'
I0815 04:20:12.887929       1 logger.go:88] Calling GetAccount
I0815 04:20:13.335734       1 sync.go:71] Error initializing issuer: acme: urn:ietf:params:acme:error:rateLimited: Your ACME client is too old. Please upgrade to a newer version.
E0815 04:20:13.335791       1 controller.go:149] clusterissuers controller: Re-queuing item "letsencrypt-prod" due to error processing: acme: urn:ietf:params:acme:error:rateLimited: Your ACME client is too old. Please upgrade to a newer version.
I0815 04:20:13.475878       1 sync.go:71] Error initializing issuer: acme: urn:ietf:params:acme:error:rateLimited: Your ACME client is too old. Please upgrade to a newer version.
E0815 04:20:13.475933       1 controller.go:149] clusterissuers controller: Re-queuing item "letsencrypt-staging" due to error processing: acme: urn:ietf:params:acme:error:rateLimited: Your ACME client is too old. Please upgrade to a newer version.
I0815 04:20:16.956512       1 controller.go:171] certificates controller: syncing item 'gaisma-blog/gaisma-blog-blog-tls'
I0815 04:20:16.956559       1 controller.go:171] certificates controller: syncing item 'vadosware-blog/vadosware-blog-fathom-tls'
I0815 04:20:16.956573       1 sync.go:120] Issuer letsencrypt-prod not ready
I0815 04:20:16.956594       1 sync.go:120] Issuer letsencrypt-prod not ready
E0815 04:20:16.956621       1 controller.go:180] certificates controller: Re-queuing item "gaisma-blog/gaisma-blog-blog-tls" due to error processing: Issuer letsencrypt-prod not ready
I0815 04:20:16.956643       1 controller.go:171] certificates controller: syncing item 'vadosware-blog/vadosware-blog-blog-tls'
E0815 04:20:16.956681       1 controller.go:180] certificates controller: Re-queuing item "vadosware-blog/vadosware-blog-fathom-tls" due to error processing: Issuer letsencrypt-prod not ready
I0815 04:20:16.956693       1 sync.go:120] Issuer letsencrypt-prod not ready
I0815 04:20:16.956713       1 controller.go:171] certificates controller: syncing item 'monitoring/statping-tls'
E0815 04:20:16.956751       1 controller.go:180] certificates controller: Re-queuing item "vadosware-blog/vadosware-blog-blog-tls" due to error processing: Issuer letsencrypt-prod not ready
I0815 04:20:16.956684       1 controller.go:171] certificates controller: syncing item 'totejo/totejo-fathom-tls'
I0815 04:20:16.956760       1 sync.go:120] Issuer letsencrypt-prod not ready
I0815 04:20:16.956815       1 sync.go:120] Issuer letsencrypt-prod not ready
E0815 04:20:16.956867       1 controller.go:180] certificates controller: Re-queuing item "monitoring/statping-tls" due to error processing: Issuer letsencrypt-prod not ready
I0815 04:20:16.956817       1 controller.go:171] certificates controller: syncing item 'totejo/next-totejo-tls'
E0815 04:20:16.956900       1 controller.go:180] certificates controller: Re-queuing item "totejo/totejo-fathom-tls" due to error processing: Issuer letsencrypt-prod not ready
I0815 04:20:16.956918       1 sync.go:120] Issuer letsencrypt-prod not ready
I0815 04:20:16.956926       1 controller.go:171] certificates controller: syncing item 'vadosware-blog/vadosware-mailtrain-tls'
E0815 04:20:16.956954       1 controller.go:180] certificates controller: Re-queuing item "totejo/next-totejo-tls" due to error processing: Issuer letsencrypt-prod not ready
I0815 04:20:16.956967       1 sync.go:120] Issuer letsencrypt-prod not ready
I0815 04:20:16.956994       1 controller.go:171] certificates controller: syncing item 'rrc/rrc-tls'
E0815 04:20:16.957002       1 controller.go:180] certificates controller: Re-queuing item "vadosware-blog/vadosware-mailtrain-tls" due to error processing: Issuer letsencrypt-prod not ready
I0815 04:20:16.957027       1 sync.go:120] Issuer letsencrypt-prod not ready
E0815 04:20:16.957066       1 controller.go:180] certificates controller: Re-queuing item "rrc/rrc-tls" due to error processing: Issuer letsencrypt-prod not ready
I0815 04:20:16.957027       1 controller.go:171] certificates controller: syncing item 'totejo/totejo-tls'
I0815 04:20:16.957115       1 sync.go:120] Issuer letsencrypt-prod not ready
E0815 04:20:16.957147       1 controller.go:180] certificates controller: Re-queuing item "totejo/totejo-tls" due to error processing: Issuer letsencrypt-prod not ready

And it turns out this is an issue because Let’s Encrypt is blocking versions lower than 0.8.0 (well other than 0.4.0 and 0.4.1, I guess). There are a few more areas where this is documented/discussed:

Looks like I’ll need to revert this to 0.4.1 to at least have a working setup, then jump straight to at least 0.8.1 (see the github issue, v0.8.0 and v0.8.1 send excessive traffic). I did look through the release and here are a few highlights that I personally was interested in/could have posed an issue:

  • v0.6.0-alpha.0 Add CRD for Order and Challenge, big refactoring of how ACME certs are handled, mostly internal change
  • v0.6.0-alpha.0 ECDSA key support (here are those changes that came about earlier – keyAlgorithm was one of the fields in v0.5.0
  • v0.6.0-alpha.0 CNAME following while presenting DNS01 challenges broke for Route53, and got reverted, cnameStrategy was introduced to fix it
  • v0.6.0-alpha.0 DigitalOcean added as a DNS provider
  • v0.6.0-alpha.1 --namespace optoin to limit scope to a single namespace (so now you can run multiple 1 cert-manager per namespace!)
  • v0.6.0-alpha.1 Add cert chains to the secret for the ca issuer
  • v0.6.0-beta.0 Prometheus metrics support for ACME HTTP cleint
  • v0.6.0 Better ratelimiting
  • v0.6.0 Validating webhook component by default – whenever API gets a resource misconfigurations will be checked ahead of time
  • v0.6.1 x509 cert duration to 1y for webhook component certs
  • v0.6.1 Documentation overhaul, more content
  • v0.7.0 Helm chart changes (I don’t use helm :)
  • v0.7.0 Venafi Issuer was added (Venafi would later go on to acquire Jetstack in 2020)
  • v0.7.0 cainjector controller which adds CA bundles to (Validating|Mutating)WebhookConfiguration and APIService
  • v0.7.0 Replace ca-sync CronJob with a controller to manage it
  • v0.7.0 Experimental support for ARM
  • v0.7.0 Easier self check debugging via kubectl describe (more output to the events list)
  • v0.7.2 fix update loop (that sounds bad), fixes to the cainjector controller
  • v0.8.0-alpha.0 issuer.spec.acme.solvers field replaces certificates.spec.acme which makes all cert resources portable between issuers
  • v0.8.0-alpha.0 Build under MacOS (it didn’t before, I guess)
  • v0.8.0-alpha.0 Add webhook based DNS01 provider (this should make it easy to support lots of DNS providers!)
  • v0.8.0-alpha.0 Add DNS01 provider conformance test suite (that warm fuzzy good engineering feel)
  • v0.8.0-alpha.0 Switched to golang’s (then new) module system
  • v0.8.0-alpha.0 Serve metrics from non-leader cert-manager instances (I only have one so…)
  • v0.8.0-alpha.0 Structured logging using logr-compliant logging (there are some interesting notes on logging philosophy worth reading in hear, around logr)
  • v0.8.0-beta.0 email address is now optional in ACME issuers (some issuers don’t need it I guess, maybe they send mail by carrier pidgeon instead)
  • v0.8.0 email address is now optional in ACME issuers (some issuers don’t need it I guess, maybe they send mail by carrier pidgeon instead)
  • v0.8.0 acme field moved from Certificate to Issuer
  • v0.9.0 Introduction of CertificateRequest resource for raw x509 CSRs
  • v0.9.0 Multiple DNS zone support w/ some specificity rules
  • v0.9.0 ACMEv2 Post-as-GET support (read more about this on the let’s encrypt forum)
  • v0.9.0 Common name length limit, anything over 63 characters will be rejected (RFC5280)
  • v0.9.0 cert-manager images are now Distroless image now available
  • v0.9.0 CSRs in Order resources were DER encoded nad are now PEM encoded

v0.9.0

Release link

After updating the Deployment, logs looked like the following:

$ k logs -f cert-manager-65dcf68454-gz42m -n kube-system
I0817 02:43:57.662429       1 start.go:76] cert-manager "level"=0 "msg"="starting controller"  "git-commit"="5d6f92cc" "version"="v0.9.0"
I0817 02:43:57.663691       1 controller.go:169] cert-manager/controller/build-context "level"=0 "msg"="configured acme dns01 nameservers" "nameservers"=["10.96.0.10:53"]
I0817 02:43:57.663873       1 controller.go:134] cert-manager/controller "level"=0 "msg"="starting leader election"
I0817 02:43:57.664157       1 metrics.go:203] cert-manager/metrics "level"=0 "msg"="listening for connections on" "address"="0.0.0.0:9402"
I0817 02:43:57.665346       1 leaderelection.go:235] attempting to acquire leader lease  kube-system/cert-manager-controller...
I0817 02:45:22.750048       1 leaderelection.go:245] successfully acquired lease kube-system/cert-manager-controller
I0817 02:45:22.750307       1 controller.go:91] cert-manager/controller "level"=0 "msg"="not starting controller as it's disabled" "controller"="certificaterequests-issuer-ca"
I0817 02:45:22.750531       1 base_controller.go:132] cert-manager/controller/issuers "level"=0 "msg"="starting control loop"
I0817 02:45:22.852724       1 base_controller.go:132] cert-manager/controller/challenges "level"=0 "msg"="starting control loop"
I0817 02:45:22.852705       1 controller.go:91] cert-manager/controller "level"=0 "msg"="not starting controller as it's disabled" "controller"="certificates-experimental"
I0817 02:45:22.852753       1 base_controller.go:132] cert-manager/controller/orders "level"=0 "msg"="starting control loop"
I0817 02:45:22.852888       1 base_controller.go:132] cert-manager/controller/certificates "level"=0 "msg"="starting control loop"
I0817 02:45:22.852955       1 base_controller.go:132] cert-manager/controller/ingress-shim "level"=0 "msg"="starting control loop"
I0817 02:45:22.852982       1 base_controller.go:132] cert-manager/controller/clusterissuers "level"=0 "msg"="starting control loop"

E0817 02:45:22.856712       1 reflector.go:125] pkg/client/informers/externalversions/factory.go:117: Failed to list *v1alpha1.Challenge: challenges.certmanager.k8s.io is forbidden: User "system:serviceaccount:kube-system:cluster-cert-manager" cannot list resource "challenges" in API group "certmanager.k8s.io" at the cluster scope
E0817 02:45:22.857090       1 reflector.go:125] pkg/client/informers/externalversions/factory.go:117: Failed to list *v1alpha1.Order: orders.certmanager.k8s.io is forbidden: User "system:serviceaccount:kube-system:cluster-cert-manager" cannot list resource "orders" in API group "certmanager.k8s.io" at the cluster scope

I0817 02:45:22.953160       1 base_controller.go:187] cert-manager/controller/ingress-shim "level"=0 "msg"="syncing item" "key"="vadosware-blog/fathom"
I0817 02:45:22.953196       1 base_controller.go:187] cert-manager/controller/ingress-shim "level"=0 "msg"="syncing item" "key"="totejo/fathom"
I0817 02:45:22.953156       1 base_controller.go:187] cert-manager/controller/ingress-shim "level"=0 "msg"="syncing item" "key"="monitoring/statping"
I0817 02:45:22.953234       1 base_controller.go:187] cert-manager/controller/clusterissuers "level"=0 "msg"="syncing item" "key"="letsencrypt-staging"
I0817 02:45:22.953240       1 base_controller.go:187] cert-manager/controller/clusterissuers "level"=0 "msg"="syncing item" "key"="letsencrypt-prod"
I0817 02:45:22.953178       1 base_controller.go:187] cert-manager/controller/ingress-shim "level"=0 "msg"="syncing item" "key"="rrc/rrc"
I0817 02:45:22.953171       1 base_controller.go:187] cert-manager/controller/ingress-shim "level"=0 "msg"="syncing item" "key"="gaisma-blog/blog"

I0817 02:45:22.953323       1 sync.go:205] cert-manager/controller/ingress-shim "level"=0 "msg"="certificate already exists for ingress resource, ensuring it is up to date" "related_resource_kind"="Certificate" "related_resource_name"="vadosware-blog-fathom-tls" "related_resource_namespace"="vadosware-blog" "resource_kind"="Ingress" "resource_name"="fathom" "resource_namespace"="vadosware-blog"


I0817 02:45:24.757057       1 setup.go:228] cert-manager/controller/clusterissuers "level"=0 "msg"="verified existing registration with ACME server" "related_resource_kind"="Secret" "related_resource_name"="letsencrypt-prod" "related_resource_namespace"="kube-system" "resource_kind"="ClusterIssuer" "resource_name"="letsencrypt-prod" "resource_namespace"=""
I0817 02:45:24.757599       1 base_controller.go:193] cert-manager/controller/clusterissuers "level"=0 "msg"="finished processing work item" "key"="letsencrypt-staging"


E0817 02:45:28.866502       1 reflector.go:125] pkg/client/informers/externalversions/factory.go:117: Failed to list *v1alpha1.Challenge: challenges.certmanager.k8s.io is forbidden: User "system:serviceaccount:kube-system:cluster-cert-manager" cannot list resource "challenges" in API group "certmanager.k8s.io" at the cluster scope
E0817 02:45:28.867138       1 reflector.go:125] pkg/client/informers/externalversions/factory.go:117: Failed to list *v1alpha1.Order: orders.certmanager.k8s.io is forbidden: User "system:serviceaccount:kube-system:cluster-cert-manager" cannot list resource "orders" in API group "certmanager.k8s.io" at the cluster scope

So basically, setup went in this order:

  • Attempt and succeed at getting leader lease
  • Start some controllers for various CRDs
  • Start verifying existing certs
  • (problem) Fail to list Challenge and Order CRDs a lot

The problems that we see in the log are caused by me not updating the corresponding resource configurations I have for the CRDs (Challenge and Order) and coresponding RBAC rules. To fix this, I went to the v0.10.0 tag in Github and navigated to the deploy/manifests/00-crds.yaml file which has become very long (compared to what I used to have). One cool thing is that cert-manager generates the deployment files for you instead of requiring you to use helm.

For fixing the RBAC, I basically translated the rules that were listed in templates/rbac.yaml to fit my much smaller (and less segmented) CluterRoles.

After updating the CRDs and RBAC the errors around polling for Challenge and Order were resolved.

v0.10.0

Release link

One thing that was nice about this release is that it has nothing in the Action Required section of the release notes, so it should be a pretty easy upgrade.

After updating the Deployment, logs looked like the following:

I0817 03:09:54.935176       1 start.go:76] cert-manager "level"=0 "msg"="starting controller"  "git-commit"="f1d591a53" "version"="v0.10.0"
I0817 03:09:54.936390       1 controller.go:184] cert-manager/controller/build-context "level"=0 "msg"="configured acme dns01 nameservers" "nameservers"=["10.96.0.10:53"]
I0817 03:09:54.936570       1 controller.go:149] cert-manager/controller "level"=0 "msg"="starting leader election"
I0817 03:09:54.936719       1 metrics.go:201] cert-manager/metrics "level"=0 "msg"="listening for connections on" "address"="0.0.0.0:9402"
I0817 03:09:54.937525       1 leaderelection.go:235] attempting to acquire leader lease  kube-system/cert-manager-controller...

v0.11.0

Release link

This release was much bigger – lots of changes and features here which make the migration harder, most notably:

  • Rename of API group to cert-mangaer.io & Bumping API version from v1alpha1 to v1alpha2
  • Removing some deprecated fields (issuer.spec.http01 and issuer.spec.dns01
  • Use of CertificateRequest internally for issuance

The release notes call out this need for manual intervention as well:

You will also need to manually update all your backed up cert-manager resource types to use the new apiVersion setting.

There are at least the following places that need to be updated on my side:

  • Move cert-manager from kube-system to it’s own namespace
  • CRDs (use newer version of compiled 00-crds.yaml)
  • RBAC (update the apiVersions)
  • Resource versions of existing Certificate, Issuer, ClusterIssuer, etc to have the right apiVersion
  • Update annotations in every project where I use Ingress annotations for the new domain cert-manager.io (luckily for me I didn’t do this at all)

This must have been super painful to upgrade to for people with bigger clusters. One thing I’m wondering is if I can actually just upgrade cert-manager and let it just… re-do the process for every one of my sites… It seems that as long as the CRDs and RBAC stuff is settled, the old resource versions would just not be recognized and new ones would be made (since the controller is sitting there spinning forever).

After actually trying it, the following issues came up:

  • acme.cert-manager.io is now where Challenge and Order CRDs are stored, so RBAC must be updated accordingly
  • "issuers/status", "clusterissuers/status" and other similar resources are now strictly necessary

After mucking with RBAC a lot more, the errors were all resolved and things were ready to go – Orders were being processed and the logs looked like this:

I0817 03:38:26.203230       1 controller.go:135] cert-manager/controller/certificaterequests-issuer-acme "level"=0 "msg"="finished processing work item" "key"="totejo/totejo-fathom-tls-1199857171"
I0817 03:38:26.203306       1 controller.go:129] cert-manager/controller/certificaterequests-issuer-ca "level"=0 "msg"="syncing item" "key"="totejo/totejo-fathom-tls-1199857171"
I0817 03:38:26.203354       1 controller.go:129] cert-manager/controller/certificaterequests-issuer-venafi "level"=0 "msg"="syncing item" "key"="totejo/totejo-fathom-tls-1199857171"
I0817 03:38:26.203366       1 controller.go:129] cert-manager/controller/certificates "level"=0 "msg"="syncing item" "key"="totejo/totejo-fathom-tls"
I0817 03:38:26.203392       1 controller.go:129] cert-manager/controller/certificaterequests-issuer-selfsigned "level"=0 "msg"="syncing item" "key"="totejo/totejo-fathom-tls-1199857171"
I0817 03:38:26.203402       1 controller.go:129] cert-manager/controller/certificaterequests-issuer-vault "level"=0 "msg"="syncing item" "key"="totejo/totejo-fathom-tls-1199857171"
I0817 03:38:26.203368       1 controller.go:129] cert-manager/controller/certificaterequests-issuer-acme "level"=0 "msg"="syncing item" "key"="totejo/totejo-fathom-tls-1199857171"
I0817 03:38:26.203480       1 controller.go:135] cert-manager/controller/certificaterequests-issuer-ca "level"=0 "msg"="finished processing work item" "key"="totejo/totejo-fathom-tls-1199857171"
I0817 03:38:26.203494       1 controller.go:135] cert-manager/controller/certificaterequests-issuer-venafi "level"=0 "msg"="finished processing work item" "key"="totejo/totejo-fathom-tls-1199857171"
I0817 03:38:26.203505       1 controller.go:135] cert-manager/controller/certificaterequests-issuer-vault "level"=0 "msg"="finished processing work item" "key"="totejo/totejo-fathom-tls-1199857171"
I0817 03:38:26.203520       1 controller.go:135] cert-manager/controller/certificaterequests-issuer-selfsigned "level"=0 "msg"="finished processing work item" "key"="totejo/totejo-fathom-tls-1199857171"
I0817 03:38:26.203684       1 acme.go:178] cert-manager/controller/certificaterequests-issuer-acme/sign "level"=0 "msg"="acme Order resource is not in a ready state, waiting..." "related_resource_kind"="Order" "related_resource_name"="totejo-fathom-tls-1199857171-851845284" "related_resource_namespace"="totejo" "resource_kind"="CertificateRequest" "resource_name"="totejo-fathom-tls-1199857171" "resource_namespace"="totejo"
I0817 03:38:26.203757       1 controller.go:135] cert-manager/controller/certificaterequests-issuer-acme "level"=0 "msg"="finished processing work item" "key"="totejo/totejo-fathom-tls-1199857171"
I0817 03:38:26.204182       1 sync.go:379] cert-manager/controller/certificates "level"=0 "msg"="validating existing CSR data" "related_resource_kind"="CertificateRequest" "related_resource_name"="totejo-fathom-tls-1199857171" "related_resource_namespace"="totejo" "resource_kind"="Certificate" "resource_name"="totejo-fathom-tls" "resource_namespace"="totejo"
I0817 03:38:26.204264       1 sync.go:479] cert-manager/controller/certificates "level"=0 "msg"="CertificateRequest is not in a final state, waiting until CertificateRequest is complete" "related_resource_kind"="CertificateRequest" "related_resource_name"="totejo-fathom-tls-1199857171" "related_resource_namespace"="totejo" "resource_kind"="Certificate" "resource_name"="totejo-fathom-tls" "resource_namespace"="totejo" "state"="Pending"
I0817 03:38:26.204565       1 conditions.go:155] Setting lastTransitionTime for Certificate "totejo-fathom-tls" condition "Ready" to 2020-08-17 03:38:26.204561002 +0000 UTC m=+575.115294188
E0817 03:38:26.205368       1 controller.go:131] cert-manager/controller/certificates "msg"="re-queuing item  due to error processing" "error"="certificates.cert-manager.io \"totejo-fathom-tls\" is forbidden: User \"system:serviceaccount:cert-manager:cluster-cert-manager\" cannot update resource \"certificates/status\" in API group \"cert-manager.io\" in the namespace \"totejo\"" "key"="totejo/totejo-fathom-tls"
I0817 03:38:26.347853       1 sync.go:56] cert-manager/controller/orders "level"=0 "msg"="updating Order resource status" "resource_kind"="Order" "resource_name"="totejo-fathom-tls-1199857171-851845284" "resource_namespace"="totejo"
I0817 03:38:26.352787       1 controller.go:135] cert-manager/controller/orders "level"=0 "msg"="finished processing work item" "key"="totejo/totejo-fathom-tls-1199857171-851845284"
I0817 03:38:26.352848       1 controller.go:129] cert-manager/controller/orders "level"=0 "msg"="syncing item" "key"="totejo/totejo-fathom-tls-1199857171-851845284"
I0817 03:38:26.352854       1 controller.go:129] cert-manager/controller/certificaterequests-issuer-acme "level"=0 "msg"="syncing item" "key"="totejo/totejo-fathom-tls-1199857171"
E0817 03:38:26.353066       1 sync.go:103] cert-manager/controller/orders "msg"="Failed to determine the list of Challenge resources needed for the Order" "error"="no configured challenge solvers can be used for this challenge" "resource_kind"="Order" "resource_name"="totejo-fathom-tls-1199857171-851845284" "resource_namespace"="totejo"
I0817 03:38:26.353125       1 controller.go:135] cert-manager/controller/orders "level"=0 "msg"="finished processing work item" "key"="totejo/totejo-fathom-tls-1199857171-851845284"
I0817 03:38:26.353134       1 acme.go:178] cert-manager/controller/certificaterequests-issuer-acme/sign "level"=0 "msg"="acme Order resource is not in a ready state, waiting..." "related_resource_kind"="Order" "related_resource_name"="totejo-fathom-tls-1199857171-851845284" "related_resource_namespace"="totejo" "resource_kind"="CertificateRequest" "resource_name"="totejo-fathom-tls-1199857171" "resource_namespace"="totejo"
I0817 03:38:26.353246       1 controller.go:135] cert-manager/controller/certificaterequests-issuer-acme "level"=0 "msg"="finished processing work item" "key"="totejo/totejo-fathom-tls-1199857171"

As all my certs are currently

v0.12.0

Release link

After the pain of the v0.11.0 release I think they stepped off the gas a little for this release, so it was much much easier, with a usability focused update as they mentioned in the release notes:

The rest of the notable features below are all focused on usability, and as such, the upgrade process from v0.11 should be nice and easy :holiday:.

After updating the Deployment, logs were wonderfully empty of errors, except for the following:

E0817 03:44:41.801154       1 sync.go:111] cert-manager/controller/orders "msg"="Failed to determine the list of Challenge resources needed for the Order" "error"="no configured challenge solvers can be used for this challenge" "resource_kind"="Order" "resource_name"="next-totejo-tls-2366429395-1688760916" "resource_namespace"="totejo"

I’m going to ignore this error for now – not having any configured challenge solvers is something I’ll work through when testing that the setup still works for new certificates.

v0.13.0

Release link

The v0.13.0 release also didn’t require any special upgrade steps, though it did add some features. This also promises to be a relatively “free” upgrade.

After updating the Deployment, logs were wonderfully empty of errors (except the one we know about regarding challenge solvers).

v0.14.0

Release link

The v0.14.0 release was another relatively large release with some changes required for deployment. I updated the CRDs and updated the deployment and was greeted with a particular new log line prior to leader election:

W0818 01:33:54.036836       1 client_config.go:543] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0818 01:33:54.037933       1 controller.go:167] cert-manager/controller/build-context "msg"="configured acme dns01 nameservers" "nameservers"=["10.96.0.10:53"

It wasn’t really mentioned in the release notes (and a quick search on the releases page didn’t show any results from the alphas), but it looks like cert-manager still works all the same (the errors from before are still errors):

I0818 01:34:56.161376       1 controller.go:138] cert-manager/controller/orders "msg"="syncing item" "key"="rrc/rrc-tls-3757528704-2455984007"
E0818 01:34:56.161518       1 sync.go:111] cert-manager/controller/orders "msg"="Failed to determine the list of Challenge resources needed for the Order" "error"="no configured challenge solvers can be used for this challenge" "resource_kind"="Order" "resource_name"="vadosware-blog-blog-tls-2780654544-3641557672" "resource_namespace"="vadosware-blog"

v0.15.0

Release link

The v0.15.0 release is a somewhat large release with new experimental features, but I’m not intending on turning any of them on, and a lot of the other features are for people using Red Hat or other special features so this one should also be relatively easy to upgrade to. They also added installing the CRDs to the helm chart but of course I don’t use helm, and I install the CRDs myself so that doesn’t affect me.

There’s a bit of a wrinkle here though – cert-manager stopped including the built-out CRDs in their manifests directory… So now I either have to build the CRDs with Bazel or commit to Helm. I’m going to do neither and just see if upgrading still works… Not sure why installing the CRDs isn’t part of the operator… My only other option is to just stay on v0.14.0 for a while and ask about whether they have plans to put the compiled CRDs in the releases again (or make sure they’re installed in the operator).

Despite basically ignoring the release notes, it looks like the v0.15.0 installation has gone well – the same errors are present and there wasn’t anything catastrophic. The big worry I’ve introduced here is CRD drift, if I don’t start compiling their templates (with Bazel I guess, since I’m not touching Helm with a 8 ft pole), the CRD the controller expects to see and the one that it actually sees are going to become different which could lead to crazy problems, nevermind if they add any new CRDs.

I’m going to ignore these issues for now and wait until I test out the end to end flow to make sure that everything still works to resolve them.

v0.16.0

Release link

At this point I’m barely reading the release notes – the v0.16.0 release doesn’t seem to have any huge features in it – new API version (but not required yet), there’s an issue with old versions of kubectl/helm, and more information being surfaced but nothing world shattering for me it seems.

The logs look good outside of the usual errors, so let’s take some time and fix that error #

Fixing the missing challenger error

The errors we’ve been seeing since a while back look like this:

E0818 02:18:51.210020       1 sync.go:108] cert-manager/controller/orders "msg"="Failed to determine the list of Challenge resources needed for the Order" "error"="no configured challenge solvers can be used for this challenge" "resource_kind"="Order" "resource_name"="vadosware-blog-fathom-tls-835763303-941936022" "resource_namespace"="vadosware-blog"
I0818 02:18:51.210072       1 controller.go:162] cert-manager/controller/orders "msg"="finished processing work item" "key"="vadosware-blog/vadosware-blog-fathom-tls-835763303-941936022"

They’re not fatal errors, but clearly they represent a big issue with how the controller needs to work – there are no solvers to resolve the ACME challenges, and that’s basically the most important part! Luckily for me there’s a page describing how to configure the solvers which I ignored until now, so let’s update the issuers I have set. Here’s what the YAML looks like for letsencrypt staging:

---
apiVersion: cert-manager.io/v1alpha2
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: "<email address>"
    privateKeySecretRef:
      name: letsencrypt-prod
    solvers:
      - http01:
          ingress:
            class: traefik

I use traefik for my ingress, and it’s ingressClass name is traefik, so it’s pretty easy to configure.

After updating my issuers, the logs exploded with content (I was tailing them via kubectl logs .... -f) and it looks like everything is fine now:

I0818 02:28:40.917802       1 service.go:43] cert-manager/controller/challenges/http01/selfCheck/http01/ensureService "msg"="found one existing HTTP01 solver Service for challenge resource" "dnsName"="mailtrain.vadosware.io" "related_resource_kind"="Service" "related_resource_name"="cm-acme-http-solver-6fqc9" "related_resource_namespace"="vadosware-blog" "resource_kind"="Challenge" "resource_name"="vadosware-mailtrain-tls-653075268-969403002-942001616" "resource_namespace"="vadosware-blog" "type"="http-01"
I0818 02:28:40.917890       1 ingress.go:91] cert-manager/controller/challenges/http01/selfCheck/http01/ensureIngress "msg"="found one existing HTTP01 solver ingress" "dnsName"="mailtrain.vadosware.io" "related_resource_kind"="Ingress" "related_resource_name"="cm-acme-http-solver-pgfbk" "related_resource_namespace"="vadosware-blog" "resource_kind"="Challenge" "resource_name"="vadosware-mailtrain-tls-653075268-969403002-942001616" "resource_namespace"="vadosware-blog" "type"="http-01"

... lots more output ...

I0818 02:29:25.623677       1 controller.go:152] cert-manager/controller/orders "msg"="syncing item" "key"="totejo/totejo-tls-3677999627-3448051846"
I0818 02:29:25.623734       1 sync.go:99] cert-manager/controller/orders "msg"="Order has already been completed, cleaning up any owned Challenge resources" "resource_kind"="Order" "resource_name"="totejo-tls-3677999627-3448051846" "resource_namespace"="totejo"
I0818 02:29:25.624334       1 controller.go:162] cert-manager/controller/orders "msg"="finished processing work item" "key"="totejo/totejo-tls-3677999627-3448051846"

Looks like the work orders have been processed and sites are getting their certs so I think I’m all good!

Testing cert-manager still works

Updating the versions and not seeing any errors in the log is great and all, but I’ve made so many changes and cut so many corners (most prominently when doing the v0.10.0 migration) that it would be a terrible idea to not at least test the certificate flow for a a new domain! so I’ll

  • Redirect a new subdomain (ex. cert-manager-test.vadosware.io)
  • Create a new Ingress for that domain
  • Ensure that the new domain gets HTTPS support

After trying Ensuring that the flow works for new domains is crucial – I haven’t been as dilligent as I could have been, so a sanity check to make sure this system still works is a good idea.

After making a cert, everything seemed to go well, and the work tasks got resolved by the cert-manager controller, but the actual domain still had a self-signed cert. First place to check is the Ingress – let’s see if there is any output worth noting in the Events section:

  Normal  CreateCertificate  5m54s  cert-manager  Successfully created Certificate "testproject-tls"

Well that looks good… a Certifiate supposedly was created, and I see a secret, but the secret has the wrong name:

$ k get secrets -n vadosware-blog
NAME                                     TYPE                                  DATA   AGE
default-token-dgs2j                      kubernetes.io/service-account-token   3      617d
mailtrain-mailtrain-secrets-fk729cfghk   Opaque                                1      190d
mailtrain-mariadb-secrets-ftt9hfdggc     Opaque                                3      190d
mailtrain-mariadb-secrets-mm5bc8c577     Opaque                                2      190d
subscribe-sidecar-secrets-c26tgm5bch     Opaque                                2      177d
testproject-tls-2nqnd             Opaque                                1      9m56s
vadosware-blog-blog-tls                  kubernetes.io/tls                     3      617d
vadosware-blog-fathom-tls                kubernetes.io/tls                     3      617d
vadosware-infra-gitlab-registry          kubernetes.io/dockercfg               1      617d
vadosware-mailtrain-tls                  kubernetes.io/tls                     3      190d

Also, if I list the certificates, they I don’t see the right one:

$ k get certificates -n vadosware-blog
NAME                        READY   SECRET                      AGE
vadosware-blog-blog-tls     True    vadosware-blog-blog-tls     70d
vadosware-blog-fathom-tls   True    vadosware-blog-fathom-tls   582d
vadosware-mailtrain-tls     True    vadosware-mailtrain-tls     174d

Welp, looks like something is super wrong – it’s not immediately clear why I’m getting a weirdly named secret to begin with…

After uninstalling the Ingress here are the errors I get (which make sense, there is no Certificate):

E0818 02:57:03.239917       1 controller.go:156] ingress 'vadosware-blog/test' in work queue no longer exists
E0818 02:57:03.797614       1 requestmanager_controller.go:127] cert-manager/controller/CertificateRequestManager "msg"="certificate not found for key" "error"="certificate.cert-manager.io \"testproject-tls\" not found" "key"="vadosware-blog/testproject-tls"
E0818 02:57:03.797631       1 readiness_controller.go:130] cert-manager/controller/CertificateReadiness "msg"="certificate not found for key" "error"="certificate.cert-manager.io \"testproject-tls\" not found" "key"="vadosware-blog/testproject-tls"
E0818 02:57:03.797663       1 issuing_controller.go:152] cert-manager/controller/CertificateIssuing "msg"="certificate not found for key" "error"="certificate.cert-manager.io \"testproject-tls\" not found" "key"="vadosware-blog/testproject-tls"
E0818 02:57:03.797736       1 keymanager_controller.go:137] cert-manager/controller/CertificateKeyManager "msg"="certificate not found for key" "error"="certificate.cert-manager.io \"testproject-tls\" not found" "key"="vadosware-blog/testproject-tls"
E0818 02:57:03.797643       1 trigger_controller.go:142] cert-manager/controller/CertificateTrigger "msg"="certificate not found for key" "error"="certificate.cert-manager.io \"testproject-tls\" not found" "key"="vadosware-blog/testproject-tls"

Now if I install the Ingress again, the errors that come out:

E0818 02:58:35.044655       1 controller.go:158] cert-manager/controller/CertificateReadiness "msg"="re-queuing item  due to error processing" "error"="Operation cannot be fulfilled on certificates.cert-manager.io \"testproject-tls\": the object has been modified; please apply your changes to the latest version and try again" "key"="vadosware-blog/testproject-tls"
E0818 02:58:35.298214       1 controller.go:158] cert-manager/controller/CertificateKeyManager "msg"="re-queuing item  due to error processing" "error"="Operation cannot be fulfilled on certificates.cert-manager.io \"testproject-tls\": the object has been modified; please apply your changes to the latest version and try again" "key"="vadosware-blog/testproject-tls"

So it looks like the certificate is not working on the CRD at all… Considering it thinks it succeeded, I’m thinking I ran into the bug about upgrading from v0.15 to v0.16. This post is long enough – I don’t think I’ll undertake upgrading my cluster and kubectl in the middle of this. So for now I’ll downgrade to v0.15.

Wrapup

Well this was a grueling process, but made much easier with the fantastic work of Jetstack (now Venafi) and the maintainers and contributors behind cert-manager – I can continue not worrying about creating and managing TLS certs on the websites deployed to my kubernetes cluster.