tl;dr - You’re going to want to use a Hetzner-external resolver (1.1.1.1
/8.8.8.8
/9.9.9.9
) if you’re going to attempt to resolve your own server’s address(es) from your own server. This is a variation of an issue that has already been discussed some on reddit, but it’s not about Hetzner DNS being down but more about recursive queries not working (and finding this via cert-manager
running on a Kubernetes cluser).
I’m pretty sure this isn’t the first time I’ve made a “DNS was the problem post” but today is somewhat Hetzner related, and was perplexing to me as it was happening so I figured I’d write about it.
cert-manager
HTTP01 challenges failingWhile I was working on deploying SurplusCI, “all of a sudden” cert-manager
stopped being able to complete HTTP01 Challenge
s (I’d later find out that maybe they weren’t working at all). I have both DNS01 Challenge
s and HTTP01 challenges setup so my cluster, generally for the same domains, so my cluster can use either. Not being able to create new Certificate
s is a bit of a problem, so I was pretty surprised that a new site I had stood up couldn’t get the proper certificates to serve HTTPS traffic.
Cert Manager is Custom Resource definition driven and one of the great things about it is that you can just kubectl describe
the relevant resources to find out what’s going on (as opposed to just going straight to the controller logs). The involved CRDs look something like this:
How to set up Cert Manager is outside of the scope of this post but they’ve got excellent docs so check them out. Generally you make the Certificate
, which looks something like this:
---
apiVersion: cert-manager.io/v1alpha2
kind: Certificate
metadata:
name: website-tls
namespace: website
spec:
issuerRef:
name: letsencrypt-prod
kind: ClusterIssuer
secretName: website-tls
commonName: www.example.com
dnsNames:
- example.com
- www.example.com
After creating this Certificate
CRD (you can also integrate with Ingress
resources and stuff but I prefer to use the explicit Certificate
CRD) the rest of the CRDs are generally created and managed by cert-manager
behind the scenes (generally representing the work that cert-manager
is doing for you) and you don’t have to think about them. As the Certificate
and CertificateRequest
were failing I needed to kubectl describe Challenge <name of challenge>
:
Status:
Presented: true
Processing: true
Reason: Waiting for HTTP-01 challenge propagation: failed to perform self check GET request 'http://www.staging.surplusci.com/.well-known/acme-challenge/gZYhs0WQEvRWsG8UvwIWe10a07t_u4qRCElr3Xgc5X8': Get "http://www.staging.surplusci.com/.well-known/acme-challenge/gZYhs0WQEvRWsG8UvwIWe10a07t_u4qRCElr3Xgc5X8": dial tcp: lookup www.staging.surplusci.com on 10.96.0.10:53: server misbehaving
State: pending
Well that’s weird, why wouldn’t that URL be working from inside the cluster? What’s also interesting is that this URL actually works perfectly fine from outside the cluster. Another curious thing is that the DNS server that’s being used is the cluster local coredns
@ 10.96.0.10 (services on my cluster reside in CIDR range 10.96.0.0/16). What’s even more weird is that the error is server misbehaving
– that’s not the same thing as an NXDOMAIN
(non existent domain) or various other DNS issues, this is something quite interesting.
The server is misbehaving and I know which one it is, so let’s take a look:
$ k logs -f deploy/coredns -n kube-system
Found 2 pods, using pod/coredns-64dccffd-v47hs
.:53
[INFO] plugin/reload: Running configuration MD5 = eedde2eae6990b4e5b773e9f79e29392
CoreDNS-1.7.0
linux/amd64, go1.14.4, f59c03d
[ERROR] plugin/errors: 2 mail.staging.surplusci.com. A: read udp 10.244.192.126:46036->213.133.98.98:53: i/o timeout
[ERROR] plugin/errors: 2 mail.staging.surplusci.com. A: read udp 10.244.192.126:33700->213.133.99.99:53: i/o timeout
[ERROR] plugin/errors: 2 api.staging.surplusci.com. A: read udp 10.244.192.126:36809->213.133.98.98:53: i/o timeout
[ERROR] plugin/errors: 2 mail.staging.surplusci.com. A: read udp 10.244.192.126:38170->213.133.98.98:53: i/o timeout
[ERROR] plugin/errors: 2 mta-sts.staging.surplusci.com. AAAA: read udp 10.244.192.126:38204->213.133.98.98:53: i/o timeout
[ERROR] plugin/errors: 2 staging.surplusci.com. A: read udp 10.244.192.126:33409->213.133.99.99:53: i/o timeout
[ERROR] plugin/errors: 2 staging.surplusci.com. AAAA: read udp 10.244.192.126:37914->213.133.98.98:53: i/o timeout
[ERROR] plugin/errors: 2 mail.staging.surplusci.com. AAAA: read udp 10.244.192.126:48253->213.133.100.100:53: i/o timeout
[ERROR] plugin/errors: 2 www.staging.surplusci.com. A: read udp 10.244.192.126:37449->213.133.98.98:53: i/o timeout
[ERROR] plugin/errors: 2 www.staging.surplusci.com. A: read udp 10.244.192.126:40508->213.133.98.98:53: i/o timeout
[ERROR] plugin/errors: 2 api.staging.surplusci.com. A: read udp 10.244.192.126:59647->213.133.100.100:53: i/o timeout
[ERROR] plugin/errors: 2 staging.surplusci.com. A: read udp 10.244.192.126:58354->213.133.98.98:53: i/o timeout
[ERROR] plugin/errors: 2 mail.staging.surplusci.com. A: read udp 10.244.192.126:56757->213.133.99.99:53: i/o timeout
[ERROR] plugin/errors: 2 staging.surplusci.com. AAAA: read udp 10.244.192.126:52187->213.133.98.98:53: i/o timeout
[ERROR] plugin/errors: 2 mta-sts.staging.surplusci.com. AAAA: read udp 10.244.192.126:33684->213.133.99.99:53: i/o timeout
[ERROR] plugin/errors: 2 staging.surplusci.com. A: read udp 10.244.192.126:44368->213.133.99.99:53: i/o timeout
[ERROR] plugin/errors: 2 web.staging.surplusci.com. AAAA: read udp 10.244.192.126:52036->213.133.99.99:53: i/o timeout
[ERROR] plugin/errors: 2 mta-sts.staging.surplusci.com. AAAA: read udp 10.244.192.126:51881->213.133.100.100:53: i/o timeout
[ERROR] plugin/errors: 2 mail.staging.surplusci.com. A: read udp 10.244.192.126:37481->213.133.98.98:53: i/o timeout
[ERROR] plugin/errors: 2 api.staging.surplusci.com. A: read udp 10.244.192.126:39595->213.133.98.98:53: i/o timeout
[ERROR] plugin/errors: 2 mail.staging.surplusci.com. A: read udp 10.244.192.126:58556->213.133.100.100:53: i/o timeout
[ERROR] plugin/errors: 2 staging.surplusci.com. AAAA: read udp 10.244.192.126:52940->213.133.98.98:53: i/o timeout
[ERROR] plugin/errors: 2 mail.staging.surplusci.com. A: read udp 10.244.192.126:54522->213.133.100.100:53: i/o timeout
[ERROR] plugin/errors: 2 api.staging.surplusci.com. AAAA: read udp 10.244.192.126:60241->213.133.100.100:53: i/o timeout
[ERROR] plugin/errors: 2 www.staging.surplusci.com. AAAA: read udp 10.244.192.126:57825->213.133.100.100:53: i/o timeout
[ERROR] plugin/errors: 2 www.staging.surplusci.com. A: read udp 10.244.192.126:46173->213.133.100.100:53: i/o timeout
[ERROR] plugin/errors: 2 web.staging.surplusci.com. AAAA: read udp 10.244.192.126:50428->213.133.98.98:53: i/o timeout
[ERROR] plugin/errors: 2 web.staging.surplusci.com. A: read udp 10.244.192.126:39549->213.133.98.98:53: i/o timeout
Hmnnn weird – CoreDNS is timing out visiting the URLs when using the resolver @ 213.133.100.100
(one of Hetzner’s authoritative DNS servers). That’s weird, it looks like CoreDNS is doing what it’s supposed to do – what is the problem? Let’s see if i can re-create the response that CoreDNS is getting with dig
, with another project of mine RagtimeCloud:
$ dig ragtime.cloud @213.133.100.100
; <<>> DiG 9.16.18 <<>> ragtime.cloud @213.133.100.100
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: REFUSED, id: 51580
;; flags: qr rd ad; QUERY: 0, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available
;; Query time: 356 msec
;; SERVER: 213.133.100.100#53(213.133.100.100)
;; WHEN: Tue Jun 29 23:31:07 JST 2021
;; MSG SIZE rcvd: 12
For some reason there are no results, along with a warning of that “recursion is requested but not available”. Does this mean that Hetzner actually can’t resolve DNS queries from it’s nameserver that resolve to itself? I know the site is up. What do we get if we use a different resolver?
$ dig a ragtime.cloud @8.8.8.8
; <<>> DiG 9.16.18 <<>> a ragtime.cloud @8.8.8.8
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 17564
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;ragtime.cloud. IN A
;; ANSWER SECTION:
ragtime.cloud. 3599 IN A 88.198.25.107
;; Query time: 56 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: Wed Jun 30 10:44:36 JST 2021
;; MSG SIZE rcvd: 58
OK so at this point we’ve got a good idea what the issue is – the cert-manager
self-propagation check is failing, because Hetzner DNS cannot resolve addresses that point to internal nodes properly, from inside itself since recursive resolving is not supported.
Great, so what’s the fix? If I think about this pretty simply, using a different resolver is probably the easy way to solve this issue. Turns out the way resolving is managed isn’t as easy as it used to be back in the just-edit-/etc/resolv.conf
days. Here’s what it looked like for me figuring out what to do:
/etc/resolve.conf
…. Whoops, yeah systemd
makes this a little differentsystemctl
for this, maybe resolvectl
?…. Whoops x2, I have no idea how to fucking use this thing and none of the examples show you how to set it propertly. Setting ONE value is easy (resolvectl dns enp9s0 <ip>
), but I want to set all three of the entries!systemd-resolved
! I’ll just drop some files in where they say I should.
resolvectl status enp0s3
does not show any changes, though the global settings do)/etc/resolv.conf
symlinkWeirdly enough after the changes resolvectl status <interface>
didn’t say the right thing but the coredns
pods sure do:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning DNSConfigForming 43m (x819 over 17h) kubelet Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 213.133.98.98 213.133.100.100 213.133.99.99
Warning DNSConfigForming 37m (x3 over 40m) kubelet Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 8.8.8.8 213.133.98.98 2a01:4f8:0:1::add:1010
Warning DNSConfigForming 13m (x18 over 35m) kubelet Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 8.8.8.8 2001:4860:4860::8888 8.8.4.4
Warning DNSConfigForming 3m57s (x6 over 10m) kubelet Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 8.8.8.8 8.8.4.4 213.133.98.98
If the in-cluster CoreDNS is picking up my changes /etc/systemd/resolv.conf
(which is now properly mirrored by /etc/resolve.conf because of the symlink fix), then I’m good with that. Machine seems to be resolving properly so I’ll ignore the resolvectl
output and other stuff for now.
After all this, I recreated the Certificate
and cert-manager
did it’s job like normal! Easy peasy lemon squeezy.
Of course, you’d want this fix to be permanent so I’ve added a bit to my ansible
setup:
#
# Play for configuring DNS
#
---
- name: configure dns
hosts: "{{ ansible_limit | default(omit) }}"
remote_user: root
gather_facts: yes
tasks:
- name: Install /etc/systemd/resolved.conf
template:
src: ../../templates/resolved.conf.j2
dest: /etc/systemd/resolved.conf
owner: root
group: root
mode: 0644
# https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1774632
# (stub-resolv.conf is *NOT* the way)
- name: Link /run/systemd/resolved.conf -> /etc/resolv.conf
ansible.builtin.file:
src: /run/systemd/resolve/resolv.conf
dest: /etc/resolv.conf
state: link
- name: Restart systemd-resolved
become: yes
systemd:
name: systemd-resolved
daemon_reload: yes
state: restarted
enabled: yes
- name: Restart systemd-networkd
become: yes
systemd:
name: systemd-networkd
daemon_reload: yes
state: restarted
enabled: yes
And here is the contents of /etc/resolved.conf
:
# This file is part of systemd.
#
# systemd is free software; you can redistribute it and/or modify it
# under the terms of the GNU Lesser General Public License as published by
# the Free Software Foundation; either version 2.1 of the License, or
# (at your option) any later version.
#
# Entries in this file show the compile time defaults.
# You can change settings by editing this file.
# Defaults can be restored by simply deleting this file.
#
# See resolved.conf(5) for details
[Resolve]
DNS=8.8.8.8 8.8.4.4 213.133.98.98 2001:4860:4860::8888 2001:4860:4860::8844 2a01:4f8:0:1::add:1010 213.133.99.99 2a01:4f8:0:1::add:9999 213.133.100.100 2a01:4f8:0:1::add:9898
Domains=~.
#DNS=
#FallbackDNS=
#Domains=
#LLMNR=no
#MulticastDNS=no
#DNSSEC=no
#DNSOverTLS=no
#Cache=no-negative
#DNSStubListener=yes
#ReadEtcHosts=yes
Pretty basic, I haven’t used any of the advanced options of resolved
there but I figure I’ll come back to them if I need to.
Well that was a fun adventure! I’m a little worried that HTTP01 challenges have been broken this whole time, but I guess it was fine as DNS01 was also set up. Figured I’d write this up at least so people could see how this problem can affect people using tools like cert-manager
on Hetzner (or any other provider with this issue) specifically.