tl;dr - I go through upgrading cert-manager
(formerly known as kube-lego
) from version 0.4.0
to 0.9.0
(due to deprecations of cert-manager 0.8.1
and lower) to 0.16.0
. After upgrading to well known issues with the upgrade from v0.15 to v0.16 make me downgrade to v0.15.
Let’s Encrypt is legitimately one of the best things to happen to the internet in the last decade. For those who like to build distributed systems or over-invest in building platforms to depoy only a handful of apps, Kubernetes has changed the ecosystem (there are other container orchestrators, but Kubernetes is best-in-class as of the writing of this post). Kubernetes Ingress
resources represent access from the outside world to your application (mediated by an IngressController
, for example nginx
or traefik
), and adding TLS (HTTPS) support is one of the first things to do to get your app ready for production. Jetstack saw the need for HTTPS certs on dynamically allocated k8s resource and filled it relatively early on, with a project called cert-manager
. cert-manager
(formerly known as kube-lego
) has been in my cluster almost since inception and has been a really crucial piece of it, one of the things that made Kubernetes more substance than hype.
Before k8s (+/- cert-manager
), I wrote systemd timers to manage running certbot
, and provisioned machines with ansible
– with cert-manager
and k8s this is handled by the platform. It wasn’t necessarily “hard” once you have the unit file(s) down, but I greatly enjoy just writing my Deployment
, Service
and Ingress
resources and not worrying about TLS at all.
A while back I installed v0.4.0
(quay.io/jetstack/cert-manager-controller:v0.4.0
) and haven’t looked back since then – so naturally I needed to start poring over the releases since then. Here’s what the original Deployment
looks like (it’s not even a DaemonSet
!):
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: cert-manager
namespace: kube-system
labels:
app: cert-manager
spec:
replicas: 1
selector:
matchLabels:
app: cert-manager
template:
metadata:
labels:
app: cert-manager
spec:
serviceAccountName: cluster-cert-manager
containers:
- name: mgr
image: quay.io/jetstack/cert-manager-controller:v0.4.1
imagePullPolicy: IfNotPresent
args:
- --cluster-resource-namespace=$(POD_NAMESPACE)
- --leader-election-namespace=$(POD_NAMESPACE)
- --default-issuer-name=letsencrypt-prod
- --default-issuer-kind=ClusterIssuer
env:
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
resources:
requests:
cpu: 10m
memory: 32Mi
This was a bugfix release, so thankfully not much to see here, I updated the image tag to 0.4.1
and restarted to no errors.
After updating the Deployment
, logs looked normal:
$ k logs -f cert-manager-785fdfbc8-cbdrk -n kube-system
I0815 04:05:10.388425 1 start.go:63] starting cert-manager v0.4.1 (revision ad30555d3aebaafa1524272a44ba80ffcdc01d2f)
I0815 04:05:10.391163 1 server.go:68] Listening on http://0.0.0.0:9402
I0815 04:05:10.391909 1 controller.go:111] Using the following nameservers for DNS01 checks: [10.96.0.10:53]
I0815 04:05:10.392795 1 leaderelection.go:175] attempting to acquire leader lease kube-system/cert-manager-controller...
I0815 04:05:30.099395 1 leaderelection.go:184] successfully acquired lease kube-system/cert-manager-controller
I0815 04:05:30.099658 1 controller.go:53] Starting certificates controller
I0815 04:05:30.099725 1 controller.go:53] Starting clusterissuers controller
I0815 04:05:30.099799 1 controller.go:53] Starting ingress-shim controller
I0815 04:05:30.100325 1 controller.go:53] Starting issuers controller
I0815 04:05:30.199952 1 controller.go:138] clusterissuers controller: syncing item 'letsencrypt-prod'
I0815 04:05:30.199987 1 controller.go:138] clusterissuers controller: syncing item 'letsencrypt-staging'
... more logs ...
This was a bit of a bigger release, a bunch of things added in here, most notably for me:
Certificate
resource (which makes cert-manager
better at managing various situations where certificates are useful)acme-dns
as a provider which is awesome. AcmeDNS is a fantastic project and I have been seriously considering running it locally in my own cluster for do-it-yourself DNS, or writing some sort of light controller around it.named
None of these changes look particularly dangerous for me – Certificate
might pose some risk if the old versions that I undoubtedly am using have translation issues. It wasn’t hard to get the list of certificates I had:
$ k get certificate --all-namespaces
NAMESPACE NAME AGE
vadosware-blog vadosware-blog-blog-tls 67d
... other entries ...
And you can get a look at one of them with kubectl edit
in your $EDITOR
of choice (and just don’t change anything):
$ k edit certificate vadosware-blog-blog-tls -n vadosware-blog
After updating the Deployment
, logs had some errors:
I0815 04:20:12.749253 1 controller.go:140] clusterissuers controller: syncing item 'letsencrypt-prod'
I0815 04:20:12.749492 1 logger.go:88] Calling GetAccount
I0815 04:20:12.887645 1 controller.go:140] clusterissuers controller: syncing item 'letsencrypt-staging'
I0815 04:20:12.887929 1 logger.go:88] Calling GetAccount
I0815 04:20:13.335734 1 sync.go:71] Error initializing issuer: acme: urn:ietf:params:acme:error:rateLimited: Your ACME client is too old. Please upgrade to a newer version.
E0815 04:20:13.335791 1 controller.go:149] clusterissuers controller: Re-queuing item "letsencrypt-prod" due to error processing: acme: urn:ietf:params:acme:error:rateLimited: Your ACME client is too old. Please upgrade to a newer version.
I0815 04:20:13.475878 1 sync.go:71] Error initializing issuer: acme: urn:ietf:params:acme:error:rateLimited: Your ACME client is too old. Please upgrade to a newer version.
E0815 04:20:13.475933 1 controller.go:149] clusterissuers controller: Re-queuing item "letsencrypt-staging" due to error processing: acme: urn:ietf:params:acme:error:rateLimited: Your ACME client is too old. Please upgrade to a newer version.
I0815 04:20:16.956512 1 controller.go:171] certificates controller: syncing item 'gaisma-blog/gaisma-blog-blog-tls'
I0815 04:20:16.956559 1 controller.go:171] certificates controller: syncing item 'vadosware-blog/vadosware-blog-fathom-tls'
I0815 04:20:16.956573 1 sync.go:120] Issuer letsencrypt-prod not ready
I0815 04:20:16.956594 1 sync.go:120] Issuer letsencrypt-prod not ready
E0815 04:20:16.956621 1 controller.go:180] certificates controller: Re-queuing item "gaisma-blog/gaisma-blog-blog-tls" due to error processing: Issuer letsencrypt-prod not ready
I0815 04:20:16.956643 1 controller.go:171] certificates controller: syncing item 'vadosware-blog/vadosware-blog-blog-tls'
E0815 04:20:16.956681 1 controller.go:180] certificates controller: Re-queuing item "vadosware-blog/vadosware-blog-fathom-tls" due to error processing: Issuer letsencrypt-prod not ready
I0815 04:20:16.956693 1 sync.go:120] Issuer letsencrypt-prod not ready
I0815 04:20:16.956713 1 controller.go:171] certificates controller: syncing item 'monitoring/statping-tls'
E0815 04:20:16.956751 1 controller.go:180] certificates controller: Re-queuing item "vadosware-blog/vadosware-blog-blog-tls" due to error processing: Issuer letsencrypt-prod not ready
I0815 04:20:16.956684 1 controller.go:171] certificates controller: syncing item 'totejo/totejo-fathom-tls'
I0815 04:20:16.956760 1 sync.go:120] Issuer letsencrypt-prod not ready
I0815 04:20:16.956815 1 sync.go:120] Issuer letsencrypt-prod not ready
E0815 04:20:16.956867 1 controller.go:180] certificates controller: Re-queuing item "monitoring/statping-tls" due to error processing: Issuer letsencrypt-prod not ready
I0815 04:20:16.956817 1 controller.go:171] certificates controller: syncing item 'totejo/next-totejo-tls'
E0815 04:20:16.956900 1 controller.go:180] certificates controller: Re-queuing item "totejo/totejo-fathom-tls" due to error processing: Issuer letsencrypt-prod not ready
I0815 04:20:16.956918 1 sync.go:120] Issuer letsencrypt-prod not ready
I0815 04:20:16.956926 1 controller.go:171] certificates controller: syncing item 'vadosware-blog/vadosware-mailtrain-tls'
E0815 04:20:16.956954 1 controller.go:180] certificates controller: Re-queuing item "totejo/next-totejo-tls" due to error processing: Issuer letsencrypt-prod not ready
I0815 04:20:16.956967 1 sync.go:120] Issuer letsencrypt-prod not ready
I0815 04:20:16.956994 1 controller.go:171] certificates controller: syncing item 'rrc/rrc-tls'
E0815 04:20:16.957002 1 controller.go:180] certificates controller: Re-queuing item "vadosware-blog/vadosware-mailtrain-tls" due to error processing: Issuer letsencrypt-prod not ready
I0815 04:20:16.957027 1 sync.go:120] Issuer letsencrypt-prod not ready
E0815 04:20:16.957066 1 controller.go:180] certificates controller: Re-queuing item "rrc/rrc-tls" due to error processing: Issuer letsencrypt-prod not ready
I0815 04:20:16.957027 1 controller.go:171] certificates controller: syncing item 'totejo/totejo-tls'
I0815 04:20:16.957115 1 sync.go:120] Issuer letsencrypt-prod not ready
E0815 04:20:16.957147 1 controller.go:180] certificates controller: Re-queuing item "totejo/totejo-tls" due to error processing: Issuer letsencrypt-prod not ready
And it turns out this is an issue because Let’s Encrypt is blocking versions lower than 0.8.0
(well other than 0.4.0
and 0.4.1
, I guess). There are a few more areas where this is documented/discussed:
Looks like I’ll need to revert this to 0.4.1 to at least have a working setup, then jump straight to at least 0.8.1
(see the github issue, v0.8.0 and v0.8.1 send excessive traffic). I did look through the release and here are a few highlights that I personally was interested in/could have posed an issue:
v0.6.0-alpha.0
Add CRD for Order
and Challenge
, big refactoring of how ACME certs are handled, mostly internal changev0.6.0-alpha.0
ECDSA key support (here are those changes that came about earlier – keyAlgorithm
was one of the fields in v0.5.0v0.6.0-alpha.0
CNAME following while presenting DNS01 challenges broke for Route53, and got reverted, cnameStrategy
was introduced to fix itv0.6.0-alpha.0
DigitalOcean added as a DNS providerv0.6.0-alpha.1
--namespace
optoin to limit scope to a single namespace (so now you can run multiple 1 cert-manager per namespace!)v0.6.0-alpha.1
Add cert chains to the secret for the ca
issuerv0.6.0-beta.0
Prometheus metrics support for ACME HTTP cleintv0.6.0
Better ratelimitingv0.6.0
Validating webhook component by default – whenever API gets a resource misconfigurations will be checked ahead of timev0.6.1
x509 cert duration to 1y for webhook component certsv0.6.1
Documentation overhaul, more contentv0.7.0
Helm chart changes (I don’t use helm :)v0.7.0
Venafi Issuer was added (Venafi would later go on to acquire Jetstack in 2020)v0.7.0
cainjector
controller which adds CA bundles to (Validating|Mutating)WebhookConfiguration
and APIService
v0.7.0
Replace ca-sync
CronJob
with a controller to manage itv0.7.0
Experimental support for ARMv0.7.0
Easier self check debugging via kubectl describe
(more output to the events list)v0.7.2
fix update loop (that sounds bad), fixes to the cainjector
controllerv0.8.0-alpha.0
issuer.spec.acme.solvers
field replaces certificates.spec.acme
which makes all cert resources portable between issuersv0.8.0-alpha.0
Build under MacOS (it didn’t before, I guess)v0.8.0-alpha.0
Add webhook based DNS01 provider (this should make it easy to support lots of DNS providers!)v0.8.0-alpha.0
Add DNS01 provider conformance test suite (that warm fuzzy good engineering feel)v0.8.0-alpha.0
Switched to golang’s (then new) module systemv0.8.0-alpha.0
Serve metrics from non-leader cert-manager
instances (I only have one so…)v0.8.0-alpha.0
Structured logging using logr
-compliant logging (there are some interesting notes on logging philosophy worth reading in hear, around logr
)v0.8.0-beta.0
email address is now optional in ACME issuers (some issuers don’t need it I guess, maybe they send mail by carrier pidgeon instead)v0.8.0
email address is now optional in ACME issuers (some issuers don’t need it I guess, maybe they send mail by carrier pidgeon instead)v0.8.0
acme
field moved from Certificate to Issuerv0.9.0
Introduction of CertificateRequest
resource for raw x509 CSRsv0.9.0
Multiple DNS zone support w/ some specificity rulesv0.9.0
ACMEv2 Post-as-GET support (read more about this on the let’s encrypt forum)v0.9.0
Common name length limit, anything over 63 characters will be rejected (RFC5280)v0.9.0
cert-manager
images are now Distroless image now availablev0.9.0
CSRs in Order
resources were DER encoded nad are now PEM encodedAfter updating the Deployment
, logs looked like the following:
$ k logs -f cert-manager-65dcf68454-gz42m -n kube-system
I0817 02:43:57.662429 1 start.go:76] cert-manager "level"=0 "msg"="starting controller" "git-commit"="5d6f92cc" "version"="v0.9.0"
I0817 02:43:57.663691 1 controller.go:169] cert-manager/controller/build-context "level"=0 "msg"="configured acme dns01 nameservers" "nameservers"=["10.96.0.10:53"]
I0817 02:43:57.663873 1 controller.go:134] cert-manager/controller "level"=0 "msg"="starting leader election"
I0817 02:43:57.664157 1 metrics.go:203] cert-manager/metrics "level"=0 "msg"="listening for connections on" "address"="0.0.0.0:9402"
I0817 02:43:57.665346 1 leaderelection.go:235] attempting to acquire leader lease kube-system/cert-manager-controller...
I0817 02:45:22.750048 1 leaderelection.go:245] successfully acquired lease kube-system/cert-manager-controller
I0817 02:45:22.750307 1 controller.go:91] cert-manager/controller "level"=0 "msg"="not starting controller as it's disabled" "controller"="certificaterequests-issuer-ca"
I0817 02:45:22.750531 1 base_controller.go:132] cert-manager/controller/issuers "level"=0 "msg"="starting control loop"
I0817 02:45:22.852724 1 base_controller.go:132] cert-manager/controller/challenges "level"=0 "msg"="starting control loop"
I0817 02:45:22.852705 1 controller.go:91] cert-manager/controller "level"=0 "msg"="not starting controller as it's disabled" "controller"="certificates-experimental"
I0817 02:45:22.852753 1 base_controller.go:132] cert-manager/controller/orders "level"=0 "msg"="starting control loop"
I0817 02:45:22.852888 1 base_controller.go:132] cert-manager/controller/certificates "level"=0 "msg"="starting control loop"
I0817 02:45:22.852955 1 base_controller.go:132] cert-manager/controller/ingress-shim "level"=0 "msg"="starting control loop"
I0817 02:45:22.852982 1 base_controller.go:132] cert-manager/controller/clusterissuers "level"=0 "msg"="starting control loop"
E0817 02:45:22.856712 1 reflector.go:125] pkg/client/informers/externalversions/factory.go:117: Failed to list *v1alpha1.Challenge: challenges.certmanager.k8s.io is forbidden: User "system:serviceaccount:kube-system:cluster-cert-manager" cannot list resource "challenges" in API group "certmanager.k8s.io" at the cluster scope
E0817 02:45:22.857090 1 reflector.go:125] pkg/client/informers/externalversions/factory.go:117: Failed to list *v1alpha1.Order: orders.certmanager.k8s.io is forbidden: User "system:serviceaccount:kube-system:cluster-cert-manager" cannot list resource "orders" in API group "certmanager.k8s.io" at the cluster scope
I0817 02:45:22.953160 1 base_controller.go:187] cert-manager/controller/ingress-shim "level"=0 "msg"="syncing item" "key"="vadosware-blog/fathom"
I0817 02:45:22.953196 1 base_controller.go:187] cert-manager/controller/ingress-shim "level"=0 "msg"="syncing item" "key"="totejo/fathom"
I0817 02:45:22.953156 1 base_controller.go:187] cert-manager/controller/ingress-shim "level"=0 "msg"="syncing item" "key"="monitoring/statping"
I0817 02:45:22.953234 1 base_controller.go:187] cert-manager/controller/clusterissuers "level"=0 "msg"="syncing item" "key"="letsencrypt-staging"
I0817 02:45:22.953240 1 base_controller.go:187] cert-manager/controller/clusterissuers "level"=0 "msg"="syncing item" "key"="letsencrypt-prod"
I0817 02:45:22.953178 1 base_controller.go:187] cert-manager/controller/ingress-shim "level"=0 "msg"="syncing item" "key"="rrc/rrc"
I0817 02:45:22.953171 1 base_controller.go:187] cert-manager/controller/ingress-shim "level"=0 "msg"="syncing item" "key"="gaisma-blog/blog"
I0817 02:45:22.953323 1 sync.go:205] cert-manager/controller/ingress-shim "level"=0 "msg"="certificate already exists for ingress resource, ensuring it is up to date" "related_resource_kind"="Certificate" "related_resource_name"="vadosware-blog-fathom-tls" "related_resource_namespace"="vadosware-blog" "resource_kind"="Ingress" "resource_name"="fathom" "resource_namespace"="vadosware-blog"
I0817 02:45:24.757057 1 setup.go:228] cert-manager/controller/clusterissuers "level"=0 "msg"="verified existing registration with ACME server" "related_resource_kind"="Secret" "related_resource_name"="letsencrypt-prod" "related_resource_namespace"="kube-system" "resource_kind"="ClusterIssuer" "resource_name"="letsencrypt-prod" "resource_namespace"=""
I0817 02:45:24.757599 1 base_controller.go:193] cert-manager/controller/clusterissuers "level"=0 "msg"="finished processing work item" "key"="letsencrypt-staging"
E0817 02:45:28.866502 1 reflector.go:125] pkg/client/informers/externalversions/factory.go:117: Failed to list *v1alpha1.Challenge: challenges.certmanager.k8s.io is forbidden: User "system:serviceaccount:kube-system:cluster-cert-manager" cannot list resource "challenges" in API group "certmanager.k8s.io" at the cluster scope
E0817 02:45:28.867138 1 reflector.go:125] pkg/client/informers/externalversions/factory.go:117: Failed to list *v1alpha1.Order: orders.certmanager.k8s.io is forbidden: User "system:serviceaccount:kube-system:cluster-cert-manager" cannot list resource "orders" in API group "certmanager.k8s.io" at the cluster scope
So basically, setup went in this order:
Challenge
and Order
CRDs a lotThe problems that we see in the log are caused by me not updating the corresponding resource configurations I have for the CRDs (Challenge
and Order
) and coresponding RBAC rules. To fix this, I went to the v0.10.0 tag in Github and navigated to the deploy/manifests/00-crds.yaml
file which has become very long (compared to what I used to have). One cool thing is that cert-manager
generates the deployment files for you instead of requiring you to use helm
.
For fixing the RBAC, I basically translated the rules that were listed in templates/rbac.yaml
to fit my much smaller (and less segmented) CluterRole
s.
After updating the CRDs and RBAC the errors around polling for Challenge
and Order
were resolved.
One thing that was nice about this release is that it has nothing in the Action Required section of the release notes, so it should be a pretty easy upgrade.
After updating the Deployment
, logs looked like the following:
I0817 03:09:54.935176 1 start.go:76] cert-manager "level"=0 "msg"="starting controller" "git-commit"="f1d591a53" "version"="v0.10.0"
I0817 03:09:54.936390 1 controller.go:184] cert-manager/controller/build-context "level"=0 "msg"="configured acme dns01 nameservers" "nameservers"=["10.96.0.10:53"]
I0817 03:09:54.936570 1 controller.go:149] cert-manager/controller "level"=0 "msg"="starting leader election"
I0817 03:09:54.936719 1 metrics.go:201] cert-manager/metrics "level"=0 "msg"="listening for connections on" "address"="0.0.0.0:9402"
I0817 03:09:54.937525 1 leaderelection.go:235] attempting to acquire leader lease kube-system/cert-manager-controller...
This release was much bigger – lots of changes and features here which make the migration harder, most notably:
cert-mangaer.io
& Bumping API version from v1alpha1
to v1alpha2
issuer.spec.http01
and issuer.spec.dns01
CertificateRequest
internally for issuanceThe release notes call out this need for manual intervention as well:
You will also need to manually update all your backed up cert-manager resource types to use the new apiVersion setting.
There are at least the following places that need to be updated on my side:
cert-manager
from kube-system
to it’s own namespace00-crds.yaml
)apiVersion
s)Certificate
, Issuer
, ClusterIssuer
, etc to have the right apiVersion
Ingress
annotations for the new domain cert-manager.io
(luckily for me I didn’t do this at all)This must have been super painful to upgrade to for people with bigger clusters. One thing I’m wondering is if I can actually just upgrade cert-manager
and let it just… re-do the process for every one of my sites… It seems that as long as the CRDs and RBAC stuff is settled, the old resource versions would just not be recognized and new ones would be made (since the controller is sitting there spinning forever).
After actually trying it, the following issues came up:
acme.cert-manager.io
is now where Challenge
and Order
CRDs are stored, so RBAC must be updated accordingly"issuers/status"
, "clusterissuers/status"
and other similar resources are now strictly necessaryAfter mucking with RBAC a lot more, the errors were all resolved and things were ready to go – Order
s were being processed and the logs looked like this:
I0817 03:38:26.203230 1 controller.go:135] cert-manager/controller/certificaterequests-issuer-acme "level"=0 "msg"="finished processing work item" "key"="totejo/totejo-fathom-tls-1199857171"
I0817 03:38:26.203306 1 controller.go:129] cert-manager/controller/certificaterequests-issuer-ca "level"=0 "msg"="syncing item" "key"="totejo/totejo-fathom-tls-1199857171"
I0817 03:38:26.203354 1 controller.go:129] cert-manager/controller/certificaterequests-issuer-venafi "level"=0 "msg"="syncing item" "key"="totejo/totejo-fathom-tls-1199857171"
I0817 03:38:26.203366 1 controller.go:129] cert-manager/controller/certificates "level"=0 "msg"="syncing item" "key"="totejo/totejo-fathom-tls"
I0817 03:38:26.203392 1 controller.go:129] cert-manager/controller/certificaterequests-issuer-selfsigned "level"=0 "msg"="syncing item" "key"="totejo/totejo-fathom-tls-1199857171"
I0817 03:38:26.203402 1 controller.go:129] cert-manager/controller/certificaterequests-issuer-vault "level"=0 "msg"="syncing item" "key"="totejo/totejo-fathom-tls-1199857171"
I0817 03:38:26.203368 1 controller.go:129] cert-manager/controller/certificaterequests-issuer-acme "level"=0 "msg"="syncing item" "key"="totejo/totejo-fathom-tls-1199857171"
I0817 03:38:26.203480 1 controller.go:135] cert-manager/controller/certificaterequests-issuer-ca "level"=0 "msg"="finished processing work item" "key"="totejo/totejo-fathom-tls-1199857171"
I0817 03:38:26.203494 1 controller.go:135] cert-manager/controller/certificaterequests-issuer-venafi "level"=0 "msg"="finished processing work item" "key"="totejo/totejo-fathom-tls-1199857171"
I0817 03:38:26.203505 1 controller.go:135] cert-manager/controller/certificaterequests-issuer-vault "level"=0 "msg"="finished processing work item" "key"="totejo/totejo-fathom-tls-1199857171"
I0817 03:38:26.203520 1 controller.go:135] cert-manager/controller/certificaterequests-issuer-selfsigned "level"=0 "msg"="finished processing work item" "key"="totejo/totejo-fathom-tls-1199857171"
I0817 03:38:26.203684 1 acme.go:178] cert-manager/controller/certificaterequests-issuer-acme/sign "level"=0 "msg"="acme Order resource is not in a ready state, waiting..." "related_resource_kind"="Order" "related_resource_name"="totejo-fathom-tls-1199857171-851845284" "related_resource_namespace"="totejo" "resource_kind"="CertificateRequest" "resource_name"="totejo-fathom-tls-1199857171" "resource_namespace"="totejo"
I0817 03:38:26.203757 1 controller.go:135] cert-manager/controller/certificaterequests-issuer-acme "level"=0 "msg"="finished processing work item" "key"="totejo/totejo-fathom-tls-1199857171"
I0817 03:38:26.204182 1 sync.go:379] cert-manager/controller/certificates "level"=0 "msg"="validating existing CSR data" "related_resource_kind"="CertificateRequest" "related_resource_name"="totejo-fathom-tls-1199857171" "related_resource_namespace"="totejo" "resource_kind"="Certificate" "resource_name"="totejo-fathom-tls" "resource_namespace"="totejo"
I0817 03:38:26.204264 1 sync.go:479] cert-manager/controller/certificates "level"=0 "msg"="CertificateRequest is not in a final state, waiting until CertificateRequest is complete" "related_resource_kind"="CertificateRequest" "related_resource_name"="totejo-fathom-tls-1199857171" "related_resource_namespace"="totejo" "resource_kind"="Certificate" "resource_name"="totejo-fathom-tls" "resource_namespace"="totejo" "state"="Pending"
I0817 03:38:26.204565 1 conditions.go:155] Setting lastTransitionTime for Certificate "totejo-fathom-tls" condition "Ready" to 2020-08-17 03:38:26.204561002 +0000 UTC m=+575.115294188
E0817 03:38:26.205368 1 controller.go:131] cert-manager/controller/certificates "msg"="re-queuing item due to error processing" "error"="certificates.cert-manager.io \"totejo-fathom-tls\" is forbidden: User \"system:serviceaccount:cert-manager:cluster-cert-manager\" cannot update resource \"certificates/status\" in API group \"cert-manager.io\" in the namespace \"totejo\"" "key"="totejo/totejo-fathom-tls"
I0817 03:38:26.347853 1 sync.go:56] cert-manager/controller/orders "level"=0 "msg"="updating Order resource status" "resource_kind"="Order" "resource_name"="totejo-fathom-tls-1199857171-851845284" "resource_namespace"="totejo"
I0817 03:38:26.352787 1 controller.go:135] cert-manager/controller/orders "level"=0 "msg"="finished processing work item" "key"="totejo/totejo-fathom-tls-1199857171-851845284"
I0817 03:38:26.352848 1 controller.go:129] cert-manager/controller/orders "level"=0 "msg"="syncing item" "key"="totejo/totejo-fathom-tls-1199857171-851845284"
I0817 03:38:26.352854 1 controller.go:129] cert-manager/controller/certificaterequests-issuer-acme "level"=0 "msg"="syncing item" "key"="totejo/totejo-fathom-tls-1199857171"
E0817 03:38:26.353066 1 sync.go:103] cert-manager/controller/orders "msg"="Failed to determine the list of Challenge resources needed for the Order" "error"="no configured challenge solvers can be used for this challenge" "resource_kind"="Order" "resource_name"="totejo-fathom-tls-1199857171-851845284" "resource_namespace"="totejo"
I0817 03:38:26.353125 1 controller.go:135] cert-manager/controller/orders "level"=0 "msg"="finished processing work item" "key"="totejo/totejo-fathom-tls-1199857171-851845284"
I0817 03:38:26.353134 1 acme.go:178] cert-manager/controller/certificaterequests-issuer-acme/sign "level"=0 "msg"="acme Order resource is not in a ready state, waiting..." "related_resource_kind"="Order" "related_resource_name"="totejo-fathom-tls-1199857171-851845284" "related_resource_namespace"="totejo" "resource_kind"="CertificateRequest" "resource_name"="totejo-fathom-tls-1199857171" "resource_namespace"="totejo"
I0817 03:38:26.353246 1 controller.go:135] cert-manager/controller/certificaterequests-issuer-acme "level"=0 "msg"="finished processing work item" "key"="totejo/totejo-fathom-tls-1199857171"
As all my certs are currently
After the pain of the v0.11.0 release I think they stepped off the gas a little for this release, so it was much much easier, with a usability focused update as they mentioned in the release notes:
The rest of the notable features below are all focused on usability, and as such, the upgrade process from v0.11 should be nice and easy :holiday:.
After updating the Deployment
, logs were wonderfully empty of errors, except for the following:
E0817 03:44:41.801154 1 sync.go:111] cert-manager/controller/orders "msg"="Failed to determine the list of Challenge resources needed for the Order" "error"="no configured challenge solvers can be used for this challenge" "resource_kind"="Order" "resource_name"="next-totejo-tls-2366429395-1688760916" "resource_namespace"="totejo"
I’m going to ignore this error for now – not having any configured challenge solvers is something I’ll work through when testing that the setup still works for new certificates.
The v0.13.0
release also didn’t require any special upgrade steps, though it did add some features. This also promises to be a relatively “free” upgrade.
After updating the Deployment
, logs were wonderfully empty of errors (except the one we know about regarding challenge solvers).
The v0.14.0
release was another relatively large release with some changes required for deployment. I updated the CRDs and updated the deployment and was greeted with a particular new log line prior to leader election:
W0818 01:33:54.036836 1 client_config.go:543] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
I0818 01:33:54.037933 1 controller.go:167] cert-manager/controller/build-context "msg"="configured acme dns01 nameservers" "nameservers"=["10.96.0.10:53"
It wasn’t really mentioned in the release notes (and a quick search on the releases page didn’t show any results from the alphas), but it looks like cert-manager still works all the same (the errors from before are still errors):
I0818 01:34:56.161376 1 controller.go:138] cert-manager/controller/orders "msg"="syncing item" "key"="rrc/rrc-tls-3757528704-2455984007"
E0818 01:34:56.161518 1 sync.go:111] cert-manager/controller/orders "msg"="Failed to determine the list of Challenge resources needed for the Order" "error"="no configured challenge solvers can be used for this challenge" "resource_kind"="Order" "resource_name"="vadosware-blog-blog-tls-2780654544-3641557672" "resource_namespace"="vadosware-blog"
The v0.15.0
release is a somewhat large release with new experimental features, but I’m not intending on turning any of them on, and a lot of the other features are for people using Red Hat or other special features so this one should also be relatively easy to upgrade to. They also added installing the CRDs to the helm chart but of course I don’t use helm, and I install the CRDs myself so that doesn’t affect me.
There’s a bit of a wrinkle here though – cert-manager
stopped including the built-out CRDs in their manifests
directory… So now I either have to build the CRDs with Bazel or commit to Helm. I’m going to do neither and just see if upgrading still works… Not sure why installing the CRDs isn’t part of the operator… My only other option is to just stay on v0.14.0 for a while and ask about whether they have plans to put the compiled CRDs in the releases again (or make sure they’re installed in the operator).
Despite basically ignoring the release notes, it looks like the v0.15.0 installation has gone well – the same errors are present and there wasn’t anything catastrophic. The big worry I’ve introduced here is CRD drift, if I don’t start compiling their templates (with Bazel I guess, since I’m not touching Helm with a 8 ft pole), the CRD the controller expects to see and the one that it actually sees are going to become different which could lead to crazy problems, nevermind if they add any new CRDs.
I’m going to ignore these issues for now and wait until I test out the end to end flow to make sure that everything still works to resolve them.
At this point I’m barely reading the release notes – the v0.16.0
release doesn’t seem to have any huge features in it – new API version (but not required yet), there’s an issue with old versions of kubectl
/helm
, and more information being surfaced but nothing world shattering for me it seems.
The logs look good outside of the usual errors, so let’s take some time and fix that error #
The errors we’ve been seeing since a while back look like this:
E0818 02:18:51.210020 1 sync.go:108] cert-manager/controller/orders "msg"="Failed to determine the list of Challenge resources needed for the Order" "error"="no configured challenge solvers can be used for this challenge" "resource_kind"="Order" "resource_name"="vadosware-blog-fathom-tls-835763303-941936022" "resource_namespace"="vadosware-blog"
I0818 02:18:51.210072 1 controller.go:162] cert-manager/controller/orders "msg"="finished processing work item" "key"="vadosware-blog/vadosware-blog-fathom-tls-835763303-941936022"
They’re not fatal errors, but clearly they represent a big issue with how the controller needs to work – there are no solvers to resolve the ACME challenges, and that’s basically the most important part! Luckily for me there’s a page describing how to configure the solvers which I ignored until now, so let’s update the issuers I have set. Here’s what the YAML looks like for letsencrypt staging:
---
apiVersion: cert-manager.io/v1alpha2
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
server: https://acme-v02.api.letsencrypt.org/directory
email: "<email address>"
privateKeySecretRef:
name: letsencrypt-prod
solvers:
- http01:
ingress:
class: traefik
I use traefik
for my ingress, and it’s ingressClass
name is traefik
, so it’s pretty easy to configure.
After updating my issuers, the logs exploded with content (I was tailing them via kubectl logs .... -f
) and it looks like everything is fine now:
I0818 02:28:40.917802 1 service.go:43] cert-manager/controller/challenges/http01/selfCheck/http01/ensureService "msg"="found one existing HTTP01 solver Service for challenge resource" "dnsName"="mailtrain.vadosware.io" "related_resource_kind"="Service" "related_resource_name"="cm-acme-http-solver-6fqc9" "related_resource_namespace"="vadosware-blog" "resource_kind"="Challenge" "resource_name"="vadosware-mailtrain-tls-653075268-969403002-942001616" "resource_namespace"="vadosware-blog" "type"="http-01"
I0818 02:28:40.917890 1 ingress.go:91] cert-manager/controller/challenges/http01/selfCheck/http01/ensureIngress "msg"="found one existing HTTP01 solver ingress" "dnsName"="mailtrain.vadosware.io" "related_resource_kind"="Ingress" "related_resource_name"="cm-acme-http-solver-pgfbk" "related_resource_namespace"="vadosware-blog" "resource_kind"="Challenge" "resource_name"="vadosware-mailtrain-tls-653075268-969403002-942001616" "resource_namespace"="vadosware-blog" "type"="http-01"
... lots more output ...
I0818 02:29:25.623677 1 controller.go:152] cert-manager/controller/orders "msg"="syncing item" "key"="totejo/totejo-tls-3677999627-3448051846"
I0818 02:29:25.623734 1 sync.go:99] cert-manager/controller/orders "msg"="Order has already been completed, cleaning up any owned Challenge resources" "resource_kind"="Order" "resource_name"="totejo-tls-3677999627-3448051846" "resource_namespace"="totejo"
I0818 02:29:25.624334 1 controller.go:162] cert-manager/controller/orders "msg"="finished processing work item" "key"="totejo/totejo-tls-3677999627-3448051846"
Looks like the work orders have been processed and sites are getting their certs so I think I’m all good!
cert-manager
still worksUpdating the versions and not seeing any errors in the log is great and all, but I’ve made so many changes and cut so many corners (most prominently when doing the v0.10.0
migration) that it would be a terrible idea to not at least test the certificate flow for a a new domain! so I’ll
cert-manager-test.vadosware.io
)Ingress
for that domainAfter trying Ensuring that the flow works for new domains is crucial – I haven’t been as dilligent as I could have been, so a sanity check to make sure this system still works is a good idea.
After making a cert, everything seemed to go well, and the work tasks got resolved by the cert-manager controller, but the actual domain still had a self-signed cert. First place to check is the Ingress
– let’s see if there is any output worth noting in the Events section:
Normal CreateCertificate 5m54s cert-manager Successfully created Certificate "testproject-tls"
Well that looks good… a Certifiate supposedly was created, and I see a secret, but the secret has the wrong name:
$ k get secrets -n vadosware-blog
NAME TYPE DATA AGE
default-token-dgs2j kubernetes.io/service-account-token 3 617d
mailtrain-mailtrain-secrets-fk729cfghk Opaque 1 190d
mailtrain-mariadb-secrets-ftt9hfdggc Opaque 3 190d
mailtrain-mariadb-secrets-mm5bc8c577 Opaque 2 190d
subscribe-sidecar-secrets-c26tgm5bch Opaque 2 177d
testproject-tls-2nqnd Opaque 1 9m56s
vadosware-blog-blog-tls kubernetes.io/tls 3 617d
vadosware-blog-fathom-tls kubernetes.io/tls 3 617d
vadosware-infra-gitlab-registry kubernetes.io/dockercfg 1 617d
vadosware-mailtrain-tls kubernetes.io/tls 3 190d
Also, if I list the certificates, they I don’t see the right one:
$ k get certificates -n vadosware-blog
NAME READY SECRET AGE
vadosware-blog-blog-tls True vadosware-blog-blog-tls 70d
vadosware-blog-fathom-tls True vadosware-blog-fathom-tls 582d
vadosware-mailtrain-tls True vadosware-mailtrain-tls 174d
Welp, looks like something is super wrong – it’s not immediately clear why I’m getting a weirdly named secret to begin with…
After uninstalling the Ingress
here are the errors I get (which make sense, there is no Certificate
):
E0818 02:57:03.239917 1 controller.go:156] ingress 'vadosware-blog/test' in work queue no longer exists
E0818 02:57:03.797614 1 requestmanager_controller.go:127] cert-manager/controller/CertificateRequestManager "msg"="certificate not found for key" "error"="certificate.cert-manager.io \"testproject-tls\" not found" "key"="vadosware-blog/testproject-tls"
E0818 02:57:03.797631 1 readiness_controller.go:130] cert-manager/controller/CertificateReadiness "msg"="certificate not found for key" "error"="certificate.cert-manager.io \"testproject-tls\" not found" "key"="vadosware-blog/testproject-tls"
E0818 02:57:03.797663 1 issuing_controller.go:152] cert-manager/controller/CertificateIssuing "msg"="certificate not found for key" "error"="certificate.cert-manager.io \"testproject-tls\" not found" "key"="vadosware-blog/testproject-tls"
E0818 02:57:03.797736 1 keymanager_controller.go:137] cert-manager/controller/CertificateKeyManager "msg"="certificate not found for key" "error"="certificate.cert-manager.io \"testproject-tls\" not found" "key"="vadosware-blog/testproject-tls"
E0818 02:57:03.797643 1 trigger_controller.go:142] cert-manager/controller/CertificateTrigger "msg"="certificate not found for key" "error"="certificate.cert-manager.io \"testproject-tls\" not found" "key"="vadosware-blog/testproject-tls"
Now if I install the Ingress
again, the errors that come out:
E0818 02:58:35.044655 1 controller.go:158] cert-manager/controller/CertificateReadiness "msg"="re-queuing item due to error processing" "error"="Operation cannot be fulfilled on certificates.cert-manager.io \"testproject-tls\": the object has been modified; please apply your changes to the latest version and try again" "key"="vadosware-blog/testproject-tls"
E0818 02:58:35.298214 1 controller.go:158] cert-manager/controller/CertificateKeyManager "msg"="re-queuing item due to error processing" "error"="Operation cannot be fulfilled on certificates.cert-manager.io \"testproject-tls\": the object has been modified; please apply your changes to the latest version and try again" "key"="vadosware-blog/testproject-tls"
So it looks like the certificate is not working on the CRD at all… Considering it thinks it succeeded, I’m thinking I ran into the bug about upgrading from v0.15
to v0.16
. This post is long enough – I don’t think I’ll undertake upgrading my cluster and kubectl
in the middle of this. So for now I’ll downgrade to v0.15.
Well this was a grueling process, but made much easier with the fantastic work of Jetstack (now Venafi) and the maintainers and contributors behind cert-manager
– I can continue not worrying about creating and managing TLS certs on the websites deployed to my kubernetes cluster.