Awesome FOSS Logo
Discover awesome open source software
Launched 🚀🧑‍🚀

Infra Fun: Prometheus Broke My Haraka Installation

Categories
Haraka Logo - Prometheus

tl;dr - My cluster-wide prometheus scraping config was picking up port 25 as a good place to send GET /metrics requests, and broke my surprisingly fragile email setup (Haraka -> Maddy), simply by inundating it with invalid HTTP requests to a port meant for SMTP.

My mailer setup makes use of two pretty good F/OSS tools that I rarely have problems with:

They work great, and I don’t spend much time thinking about them – I often set up a new maddy instance with separate backups and what-not for every project, (for example NoMorePizzaParties.com – if you send an email to me @ admin@nomorepizzaparties.com it goes through this setup). It’s a multiplexed setup, so architecture wise it looks like this:

Inbound SMTP Proxy (credit: https://serverfault.com/questions/1076025/inbound-smtp-relay-with-haraka)

All was well and good, until a couple days ago when I realized that I wasn’t actually getting the mail I expected, my inbox had been uncharacteristically empty (even the spam stopped coming!). I remembered making some relatively recent DNS changes, so my mind immediately jumped there, but all was fine on that front (this time!).

Some alerts were also flying on my GMail account which I use to test from time to time – emails were not getting through. Outbound emails were working (ex. admin@myproject.com -> someone@example.com) but inbound emails someone@example.com -> admin@myproject.com) were not.

What followed was a 1.5 day hair-pulling exercise trying to figure out what was wrong, and this post is an attempt to gather the notes I made into some sort of coherent narrative. You really might want to read the TL;DR instead of any of what comes from now. The things at play that make this more difficult than normal (all this complexity is self-inflicted of course!):

As you can see, there are quite a few moving pieces (more than there strictly need to be), and I paid the price for my complexity with longer debugging time. Anyway, let’s get into it so hopefully some gems fall out that someone else can use in the future.

Symptom: Email is broken

To be more precise – the process of receiving messages fashioned in accordance with SMTP over port 25 was not functioning properly

GMail was one of the first to alert me, when I received the following after sending a routine email to myself:

The recipient server did not accept our requests to connect. Learn more at https://support.google.com/mail/answer/7720 [mail.nomorepizzaparties.com. XXX.XXX.XXX.XXX: unable to read banner] [mail.nomorepizzaparties.com. XXX.XXX.XXX.XXX: unable to read banner] [mail.nomorepizzaparties.com. XXX.XXX.XXX.XXX: timed out] [mail.nomorepizzaparties.com. XXX.XXX.XXX.XXX: timed out]

OK, so looks like we’ve got our first clue, and it’s not a particularly straight forward one! We’re suffering from at least two issues:

  • Requests are timing out
  • Requests that don’t time out are sending invalid banners (hint: or not sending them at all)

These problems are weird for two obvious reasons:

  • Haraka is running as a DaemonSet so every server should return roughly the same result (or fail in a similar way)!
  • All of the servers should… be working!

Well the good news is round robin DNS is working at the very least, the problem is not DNS.

Investigation: What is Haraka doing?

The first obvious place to look is haraka – I don’t change the set up much outside of adding/removing top level domains (I need to turn this into an operator at some point), but for some reason domains just won’t receive mail despite the setup not changing in that respect. As you might imagine, I took a look at the Haraka logs:

📜️ Haraka output (click to expand)
2022-01-15T11:03:21.560Z [DEBUG] [-] [core] addOCSP: not available
2022-01-15T11:03:21.560Z [NOTICE] [6B99EE3C-95F6-4EDA-B135-BFF7CE0455EC] [core] connect ip=10.244.192.65 port=36020 local_ip=:: local_port=25
2022-01-15T11:03:21.561Z [DEBUG] [6B99EE3C-95F6-4EDA-B135-BFF7CE0455EC] [core] running connect_init hooks
2022-01-15T11:03:21.561Z [DEBUG] [6B99EE3C-95F6-4EDA-B135-BFF7CE0455EC] [core] running connect_init_respond
2022-01-15T11:03:21.561Z [DEBUG] [6B99EE3C-95F6-4EDA-B135-BFF7CE0455EC] [core] running lookup_rdns hooks
2022-01-15T11:03:21.561Z [DEBUG] [6B99EE3C-95F6-4EDA-B135-BFF7CE0455EC] [core] [early_talker] state=4 esmtp=false line="GET /metrics HTTP/1.1"
2022-01-15T11:03:21.561Z [DEBUG] [6B99EE3C-95F6-4EDA-B135-BFF7CE0455EC] [core] running connect hooks
2022-01-15T11:03:21.561Z [DEBUG] [6B99EE3C-95F6-4EDA-B135-BFF7CE0455EC] [core] running connect hook in toobusy plugin
2022-01-15T11:03:21.561Z [DEBUG] [6B99EE3C-95F6-4EDA-B135-BFF7CE0455EC] [core]  hook=connect plugin=toobusy function=check_busy params="" retval=CONT msg=""
2022-01-15T11:03:21.561Z [DEBUG] [6B99EE3C-95F6-4EDA-B135-BFF7CE0455EC] [core] running connect hook in dnsbl plugin
2022-01-15T11:03:21.561Z [DEBUG] [6B99EE3C-95F6-4EDA-B135-BFF7CE0455EC] [dnsbl] skip private: 10.244.192.65
2022-01-15T11:03:21.562Z [DEBUG] [6B99EE3C-95F6-4EDA-B135-BFF7CE0455EC] [core]  hook=connect plugin=dnsbl function=connect_first params="" retval=CONT msg=""
2022-01-15T11:03:21.562Z [DEBUG] [6B99EE3C-95F6-4EDA-B135-BFF7CE0455EC] [core] running unrecognized_command hooks
2022-01-15T11:03:21.562Z [DEBUG] [6B99EE3C-95F6-4EDA-B135-BFF7CE0455EC] [core] running unrecognized_command hooks
2022-01-15T11:03:21.562Z [DEBUG] [6B99EE3C-95F6-4EDA-B135-BFF7CE0455EC] [core] running unrecognized_command hooks
2022-01-15T11:03:21.562Z [DEBUG] [6B99EE3C-95F6-4EDA-B135-BFF7CE0455EC] [core] running unrecognized_command hooks
2022-01-15T11:03:21.562Z [DEBUG] [6B99EE3C-95F6-4EDA-B135-BFF7CE0455EC] [core] running unrecognized_command hooks
2022-01-15T11:03:21.562Z [DEBUG] [6B99EE3C-95F6-4EDA-B135-BFF7CE0455EC] [core] running unrecognized_command hooks
2022-01-15T11:03:21.562Z [DEBUG] [6B99EE3C-95F6-4EDA-B135-BFF7CE0455EC] [core] running unrecognized_command hooks
2022-01-15T11:03:21.563Z [INFO] [6B99EE3C-95F6-4EDA-B135-BFF7CE0455EC] [core] client [10.244.192.65] half closed connection
2022-01-15T11:03:21.563Z [DEBUG] [6B99EE3C-95F6-4EDA-B135-BFF7CE0455EC] [core] client has disconnected
2022-01-15T11:03:21.563Z [DEBUG] [6B99EE3C-95F6-4EDA-B135-BFF7CE0455EC] [core] running disconnect hooks
2022-01-15T11:03:21.563Z [DEBUG] [6B99EE3C-95F6-4EDA-B135-BFF7CE0455EC] [core] client has disconnected
2022-01-15T11:03:21.564Z [DEBUG] [6B99EE3C-95F6-4EDA-B135-BFF7CE0455EC] [core] client has disconnected
2022-01-15T11:03:21.564Z [NOTICE] [6B99EE3C-95F6-4EDA-B135-BFF7CE0455EC] [core] disconnect ip=10.244.192.65 rdns=NXDOMAIN helo="" relay=N early=Y esmtp=N tls=N pipe=N errors=7 txns=0 rcpts=0/0/0 msgs=0/0/0 bytes=0 lr="500 Unrecognized command" time=0.004

Well that’s extra weird! Something is attempting to retrieve metrics off of the port expressly designated for SMTP traffic. Why in the world would something do that? Sounds like some bad automation afoot. This is a trail I have to run down at some point… Why not now?

Diversion: What is trying to get Prometheus metrics (GET /metrics) from Haraka?

At first I tried to find I tried to find the IP, which is k8s internal (we know this because my pod CIDR happens to be 10.244.xxx.xxx), and the IP was gone. I searched the simplest way I know how:

$ kubectl get pods --all-namespaces -o=wide | grep "10.244.192.65"

I got nothing from that, and got even more confused. Time to break out some more tooling that’s a bit more honest/informed. I choose to break out tcpdump which I’m not an expert with, but I know enough about to know that it’s a good introspective tool and probably the right tool for tracking down packets that shouldn’t be flying but are. So I picked a machine in the cluster, got on, and here’s the tcpdump output:

📜️ tcpdump output (click to expand)
12:25:13.855068 IP 10.244.136.193.41930 > 10.244.206.186.smtp: Flags [.], ack 227, win 506, options [nop,nop,TS val 4143184933 ecr 1782800009], length 0
12:25:13.855089 IP 10.244.136.193.41930 > 10.244.206.186.smtp: Flags [F.], seq 248, ack 227, win 506, options [nop,nop,TS val 4143184933 ecr 1782800009], length 0
12:25:13.856047 IP 10.244.206.186.smtp > 10.244.136.193.41930: Flags [F.], seq 227, ack 249, win 501, options [nop,nop,TS val 1782800010 ecr 4143184933], length 0
12:25:13.861182 IP 10.244.136.193.41930 > 10.244.206.186.smtp: Flags [.], ack 228, win 506, options [nop,nop,TS val 4143184939 ecr 1782800010], length 0
12:25:21.559282 IP 10.244.192.65.48078 > 10.244.206.186.smtp: Flags [S], seq 1254649318, win 64860, options [mss 1410,sackOK,TS val 2486956687 ecr 0,nop,wscale 7], length 0
12:25:21.559341 IP 10.244.206.186.smtp > 10.244.192.65.48078: Flags [S.], seq 1228986960, ack 1254649319, win 64308, options [mss 1410,sackOK,TS val 2908237046 ecr 2486956687,nop,wscale 7], length 0
12:25:21.559696 IP 10.244.192.65.48078 > 10.244.206.186.smtp: Flags [.], ack 1, win 507, options [nop,nop,TS val 2486956688 ecr 2908237046], length 0
12:25:21.559923 IP 10.244.192.65.48078 > 10.244.206.186.smtp: Flags [P.], seq 1:246, ack 1, win 507, options [nop,nop,TS val 2486956688 ecr 2908237046], length 245: SMTP: GET /metrics HTTP/1.1
12:25:21.559952 IP 10.244.206.186.smtp > 10.244.192.65.48078: Flags [.], ack 246, win 501, options [nop,nop,TS val 2908237046 ecr 2486956688], length 0
12:25:21.562254 IP 10.244.206.186.smtp > 10.244.192.65.48078: Flags [P.], seq 1:45, ack 246, win 501, options [nop,nop,TS val 2908237049 ecr 2486956688], length 44: SMTP: 220 haraka-frccz ESMTP Haraka/2.8.27 ready
12:25:21.562591 IP 10.244.192.65.48078 > 10.244.206.186.smtp: Flags [.], ack 45, win 507, options [nop,nop,TS val 2486956691 ecr 2908237049], length 0
12:25:21.562619 IP 10.244.206.186.smtp > 10.244.192.65.48078: Flags [P.], seq 45:175, ack 246, win 501, options [nop,nop,TS val 2908237049 ecr 2486956691], length 130: SMTP: 500 Unrecognized command
12:25:21.562921 IP 10.244.192.65.48078 > 10.244.206.186.smtp: Flags [F.], seq 246, ack 45, win 507, options [nop,nop,TS val 2486956691 ecr 2908237049], length 0
12:25:21.562942 IP 10.244.192.65.48078 > 10.244.206.186.smtp: Flags [.], ack 175, win 506, options [nop,nop,TS val 2486956691 ecr 2908237049], length 0
12:25:21.562971 IP 10.244.206.186.smtp > 10.244.192.65.48078: Flags [P.], seq 175:227, ack 247, win 501, options [nop,nop,TS val 2908237049 ecr 2486956691], length 52: SMTP: 500 Unrecognized command
12:25:21.563258 IP 10.244.206.186.smtp > 10.244.192.65.48078: Flags [F.], seq 227, ack 247, win 501, options [nop,nop,TS val 2908237050 ecr 2486956691], length 0
12:25:21.563291 IP 10.244.192.65.48078 > 10.244.206.186.smtp: Flags [.], ack 227, win 506, options [nop,nop,TS val 2486956691 ecr 2908237049], length 0
12:25:21.563569 IP 10.244.192.65.48078 > 10.244.206.186.smtp: Flags [.], ack 228, win 506, options [nop,nop,TS val 2486956692 ecr 2908237050], length 0

OK, a little bit more information here – in particular we can see where 10.244.192.65 is sending that GET /metrics request (glad it’s HTTP 1.1 so it’s nice and easy to parse/read) to 10.244.206.186 and it’s responding with 500 Unrecognized command because something is trying to speak HTTP to an SMTP server.

There are few more IPs showing up here, so let’s try and grab one and see if there’s a matching pod that we can catch….

$ k get pods -o=wide --all-namespaces | grep "10.244.206.186"
email                     haraka-frccz                                1/1     Running     0          3h23m   10.244.206.186    node-6.eu-central-1   <none>           <none>
totejo                    totejo-backups-27368640-cmsqs               0/1     Completed   0          35h     10.244.206.186    node-6.eu-central-1   <none>           <none>

Things get weirder, as a conflict I thought was pretty much an invariant property of Kubernetes is staring me in the face – two pods (granted one is Completed, the result of a [Cron]Job) have the same IP address! Turns out this si a thing that happens.

Running into this seeming bug at the k8s level is offputting, but brings about a theory – What if something that normally queries workloads like regular pods and jobs is mistaking a Haraka DaemonSet instance as a previously existing pod/job, scanning it’s ports/entrypoints and trying to send a GET /metrics to the first one that comes up?. Who would have set up such a mindless automation? (Narrator: the author did in fact set up such a mindless automation).

Continued Investigation: What in the world is running @ 10.244.192.65?

Well I can’t find the workload that’s running @ 10.244.192.65, but can I at least find what machine it’s running on? Enter the iproute2 command (ip route):

root@node-6 ~ # ip route
# .... a bunch of stuff ...
10.244.192.64/26 via 10.244.192.64 dev vxlan.calico onlink
10.244.192.128/26 via 10.244.192.64 dev vxlan.calico onlink
10.244.206.187 dev cali45e584c50d7 scope link
10.244.206.188 dev cali42f4f242e57 scope link
10.244.206.189 dev cali5424baea702 scope link
10.244.206.190 dev cali3fd212bb5e7 scope link
# .... a bunch of entries of stuff ...

OK, so the bulk of entries fell into the above two different kinds – one for gateways with next hops (indicated by via) and a bunch of direct routes. It’s all managed by calico so wee vxlan.calico and cali-xxxxxxxxxxxxx entries. We’re getting a peek at the VXLAN based setup here, we can see that the machine that I’m on (node-6)is actually maintaining VXLANs and a bunch of intra-node devices as well. It looks like this machine is responsible for the CIDR from which I’d expect 10.244.192.65 to come from, and infact it’s one above the bottom of the VXLAN.

Just to make sure I was doing my CIDR math right (I monstly can only read basic ones like XXX.XXX.YYY.YYY/16) I did check with a handy site cidr.xyz.

screenshot of CIDR range calculation for 10.244.192.65

Anyway back to that ip route output – since we see that there’s a next hop for 10.244.192.64/26, we know that the IP we’re looking for (10.244.192.65) is not on this machine. Turns out those IPs are on node-2! Here’s what the output of ip routeon node-2 looks like:

root@node-2 ~ # ip route
default via 162.55.94.193 dev enp9s0 proto static onlink
10.244.121.0/26 via 10.244.121.0 dev vxlan.calico onlink
10.244.121.64/26 via 10.244.121.0 dev vxlan.calico onlink
10.244.121.128/26 via 10.244.121.0 dev vxlan.calico onlink
10.244.121.192/26 via 10.244.121.0 dev vxlan.calico onlink
10.244.136.192/26 via 10.244.136.192 dev vxlan.calico onlink
10.244.137.0/26 via 10.244.136.192 dev vxlan.calico onlink
10.244.192.130 dev cali63a741b956a scope link
10.244.192.137 dev calid69c6d15b51 scope link
# ... lots more of individual links
10.244.192.188 dev calief5bb776497 scope link
10.244.192.190 dev cali56c4d6b858e scope link
10.244.206.128/26 via 10.244.206.128 dev vxlan.calico onlink
10.244.206.192/26 via 10.244.206.128 dev vxlan.calico onlink

Great! We’ve found our set of routes! Not too fast though – we found the 10.244.192.XXX range, but there is no 10.244.192.65 in the listing. Is this a case of calico just having bad entries stuck in a routing table somewhere? Restarting calico and calling it a day would be very unsatisfying! Well I tried it and it didn’t work so we must soldier on.

More Investigation: ip routeing 10.244.192.65 directly

Interestingly enough, if I run ip route I get some more interesting output that might be worth looking into:

root@node-6 ~ # ip route get 10.244.192.65
10.244.192.65 dev wireguard.cali table 1 src 10.244.206.130 uid 0
    cache

So calico’s wireguard setup has this entry cached… I don’t really understand the output but before I go look that up, let’s just search for that other IP in there as a pod work load… It’s doing something with the IP we care about:

root@node-6 ~ # ip route get 10.244.206.130
local 10.244.206.130 dev lo table local src 10.244.206.130 uid 0
    cache <local>

Well that new IP (10.244.206.130 is on the same node I was on before! But it looks to be a loopback address… I wonder what this address (and correspondingly, what workload) is tied to? Well I can try and get an idea by using netstat:

root@node-6 ~ # netstat | grep "10.244.206.130"
tcp        0      0 10.244.206.130:42692    10.244.136.218:http     TIME_WAIT
tcp        0      0 10.244.206.130:60000    10.244.137.3:imaps      ESTABLISHED
tcp        0      0 10.244.206.130:39396    10.244.137.3:imaps      ESTABLISHED
tcp        0      0 10.244.206.130:42690    10.244.136.218:http     TIME_WAIT
tcp        0      0 10.244.206.130:60616    10.244.121.167:imaps    ESTABLISHED
tcp        0      0 10.244.206.130:40360    10.244.121.167:imaps    ESTABLISHED
tcp        0      0 10.244.206.130:42686    10.244.136.218:http     TIME_WAIT
tcp        0      0 10.244.206.130:40228    10.244.121.167:imaps    ESTABLISHED
tcp        0      0 10.244.206.130:50210    10.244.192.137:imaps    ESTABLISHED
tcp        0      0 10.244.206.130:50138    10.244.192.137:imaps    ESTABLISHED
tcp        0      0 10.244.206.130:42434    10.244.192.137:imaps    ESTABLISHED
tcp        0      0 10.244.206.130:40280    10.244.121.167:imaps    ESTABLISHED

OK, that’s definitely traefik talking to maddy instances – the reason I know this is because I have a bunch of IngressRouteTCPs that are tied to the main IMAP/S ports, and it makes sense that Traefik would be talking and establishing connections with them. Before running with that theory let’s first ENHANCE the output of netstat to include the program’s name!

tcp        0      0 10.244.206.130:60000    10.244.137.3:993        ESTABLISHED 11326/traefik
tcp        0      0 10.244.206.130:34040    10.244.121.132:80       ESTABLISHED 11326/traefik
tcp        0      0 10.244.206.130:39396    10.244.137.3:993        ESTABLISHED 11326/traefik
tcp        0      0 10.244.206.130:60616    10.244.121.167:993      ESTABLISHED 11326/traefik
tcp        0      0 10.244.206.130:40360    10.244.121.167:993      ESTABLISHED 11326/traefik
tcp        0      0 10.244.206.130:45916    10.244.136.218:80       ESTABLISHED 11326/traefik
tcp        0      0 10.244.206.130:45370    10.244.121.176:3000     ESTABLISHED 11326/traefik
tcp        0      0 10.244.206.130:40228    10.244.121.167:993      ESTABLISHED 11326/traefik
tcp        0      0 10.244.206.130:50210    10.244.192.137:993      ESTABLISHED 11326/traefik
tcp        0      0 10.244.206.130:50138    10.244.192.137:993      ESTABLISHED 11326/traefik
tcp        0      0 10.244.206.130:42434    10.244.192.137:993      ESTABLISHED 11326/traefik
tcp        0      0 10.244.206.130:40280    10.244.121.167:993      ESTABLISHED 11326/traefik

OK! So we may know the thing that is making connections and gumming up the works now! Traefik!?

A comment from the future: It’s not Traefik.

I took a look at the Traefik logs to see if anything would give it away, but nothing does.

---
apiVersion: traefik.containo.us/v1alpha1
kind: IngressRouteTCP
metadata:
  name: haraka-smtp
spec:
  entryPoints:
    - smtp
  routes:
    - kind: Rule
      match: HostSNI(`*`)
      services:
        - name: haraka
          port: 25
  • For some reason I thought it was necessary to have every single TCP connection coming in on 25 go to haraka (which isn’t unreasonable if you think about it), since I wanted to make sure Traefik had the port
  • This is a bad use because it’s a misunderstanding of how traefik manages the ports – I don’t need to listen on every possible SNI to capture everything that comes by as long as have
    • I think I set this up a long time ago – As long as Traefik has the entrypoint defined (in this case smtp) on the actual traefik processes, there is no need to catch every single thing!

Here’s what that SNI looks like: traefik host SNI screenshot

Step Back: A simple check of SMTP with telnet

Let’s take a step back (and fall into another smaller pit). Why not try and make a connection to the mail server via trusty telnet? From my home computer (hint) I get some worrying results:

$ telnet mail.nimbusws.com 25
Trying XXX.XXX.XXX.XXX...
Connection failed: No route to host
Trying YYY.YYY.YYY.YYY...
Connection failed: No route to host
Trying ZZZ.ZZZ.ZZZ.ZZZ...
Connection failed: No route to host
Trying AAA.AAA.AAA.AAA...
telnet: Unable to connect to remote host: Connection timed out

Well, that’s not good. are the servers just completely unreachable? That can’t be the case, because my monitoring hasn’t said anything of the sort, and the sites are run are all still up and happily chugging along! Every once in a while I also get a timeout rather than a no route to host:

$ telnet mail.nimbusws.com 25
Trying XXX.XXX.XXX.XXX...
Connection failed: No route to host
Trying YYY.YYY.YYY.YYY...
Connection failed: Connection timed out
Trying ZZZ.ZZZ.ZZZ.ZZZ...

Very spooky behavior. Maybe maddy is the problem? – I can’t imagine that they’re being crushed under the load, but the machines definitely seem fine and not under heavy load at all so I can rule that out.

Resolution: Sometimes your IP blocks connections on port 25

Turns out my home ISP here in Tokyo blocks port 25. This is why I could do things like telnet mail.nimbusws.com 587 and it would work perfectly, but surprisingly a no route to host appears for port 25. Using my cellphone network and tethering worked pretty well, along with using the actual nodes to talk to each other.

I noticed also that MXToolbox’s SMTP checking tool was timing out (but crucially resolving the route successfully), so it’s clear that my machine at home is having the problem due to my ISP. Speaking of MXToolbox, they were incredibly helpful in the process…

Step Back: Getting an outside opinion from MXToolbox

Here’s what the MXToolbox SMTP checking tool gave me as feedback:

MXToolbox SMTP check screenshot

If you click the “More Info” link about the SMTP transaction time that went wrong:

 More Information About Smtp Transaction Time
We were able to connect to your mail server on port 25, but the diagnostic session took longer than expected. This could indicate that your email server is under heavy load.

Here are the levels for which we issue warnings and alerts:

    Over 5 Seconds: Warning
    Over 8 Seconds: Alert

Having this warning on your lookup does not necessarily mean that you will have a problem receiving mail. It is simply a warning that things are less than optimal. This could end up causing users to experience delays with inbound mail.
Additional Information
Tarpitting

It is also possible your server is “Tar pitting.” Tar pitting is a technique used by some email servers to slow down spammers. The idea is that legitimate senders will wait longer to establish a connection than spammers will.

Here is a link to a link to the TechNet article with more information about Tarpitting

Well that helps a lot – so it looks like some things are going right, but others are very much not. The 15.179 seconds makes me feel like the problem is DNS, but a quick check in the relevant pods (kubectl exec -it ....) was enough to confirm that lookups were not failing (ex. between Traefik and Haraka or Haraka and Maddy). While I was there I ran some traceroutes with the fancy ICMP packets that everyone likes sometimes are required. Anyway reachability isn’t the problem.

 - Connected via telnet from a different server (SSH in to server2, telnet to server1) so I could confirm reachability
 - Attempt removing the IngressRouteTCP that I had set up (I *do* need it, the only alternative would be to do a bunch of per-project ingress routes that lead to haraka)
 - Check my IPs for spam/blocklist settings @ zen.spanhaus https://www.spamhaus.org/zen/
 - Tried tons of shit while keeping tcpdump open
   - I know for sure now that traefik is the thing doing the /metrics checks (https://github.com/traefik/traefik/issues/8096)
   - I could turn off the prometheus integration (might be a good project to make it actually a plugin instead?)
 - Tried a completely manual email send through telnet, which works (the email went through haraka to the maddy instance it was supposed to hit, and it was denied because of missing SPF etc which is correct)
   - https://www.ndchost.com/wiki/mail/test-smtp-auth-telnet
   - https://practical365.com/how-to-send-email-via-telnet/
   - https://docs.microsoft.com/en-us/Exchange/mail-flow/test-smtp-with-telnet?view=exchserver-2019
 -

False solution #1: iptables is misconfigured?

My googling lead me (astray) to a a thread on CPanel’s forums. I started wondering if maybe my iptables was somehow misconfigured (which would be calico’s fault) – as I’d already checked ufw and it was letting through the right traffic. Running iptables -L n and looking for SMTP gave me this output:

root@node-6 ~ # iptables -L -n | grep ":25"
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:25

Nothing seemed wrong to me initially, but I thought maybe I might be missing a rule

A comment from the future: I wasn’t missing a rule

The person in the thread solved their issue was nice enough to state what was wrong:

This issue is solved.

iptables -D INPUT -p tcp --dport 25 -j DROP
iptables -D OUTPUT -p tcp --dport 25 -j DROP
service iptables save
service iptables restart

I don’t need the service stuff but seeing as I was already accepting on 25, maybe I was missing a an outgoing rule? Was it possible that SMTP connections could come in but my server couldn’t speak back?

There’s certainly no NetworkPolicy setup (administered by calico) to prevent outgoing SMTP on the k8s side, and traffic is allowed in ufw so I was skeptical. Anyway after figuring out the commands (and making sure I wasn’t about to make things worse) I ran them, to no avail. didn’t

False Solution #2: It’s the fault of the overlapping IP addresses

After telneting to the haraka one one of the nodes from another node (ex. telnet from a shell on node-2 to haraka instance on node-6), I noticed that the normal server side greeting WAS NOT coming back (surprise surprise, what these checks have all been telling me was right all along):

Looking at the Traefik logs:

time="2022-01-16T03:52:50Z" level=error msg="Error during connection: readfrom tcp AAA.AAA.AAA.AAA:41098->10.244.192.183:25: read tcp AAA.AAA.AAA.AAA:25->XXX.XXX.XXX.XXX:10052: read: connection reset by peer"

OK perfect, so Treafik has at least told me where the reques tried to go:

$ k get pods -o=wide --all-namespaces | grep "10.244.192.183"
email                     haraka-8rgq8                                1/1     Running     0          17m     10.244.192.183    node-2.eu-central-1   <none>           <none>
totejo                    fathom-backups-27370080-rn8xs               0/1     Completed   0          27h     10.244.192.183    node-2.eu-central-1   <none>           <none>

ANOTHER overlapping IP? Could this be the reason that it’s happening at all? Unfortunately this isn’t the case because not all nodes have this IP overlap, but all the nodes’ haraka instances are broken.

I need to stop using CronJobs and accelerate my move to sidecar containers (I’ve already started doing this a lot more with newer workloads, totejo is a very old project)… Anyway, after checking for more overlapping IPs from jobs, I looked at the output of Traefik:

$ k logs -f traefik-kqrqs -n ingress | grep error
time="2022-01-16T03:46:35Z" level=error msg="Error during connection: readfrom tcp AAA.AAA.AAA.AAA:55552->10.244.192.183:25: read tcp AAA.AAA.AAA.AAA:25->XXX.XXX.XXX.XXX:52515: read: connection reset by peer"
time="2022-01-16T03:46:50Z" level=error msg="Error during connection: readfrom tcp 10.244.192.65:58256->10.244.206.185:25: read tcp AAA.AAA.AAA.AAA:25->YYY.YYY.YYY.YYY:34887: read: connection reset by peer"
time="2022-01-16T03:48:50Z" level=error msg="Error during connection: readfrom tcp AAA.AAA.AAA.AAA:60514->10.244.192.183:25: read tcp AAA.AAA.AAA.AAA:25->XXX.XXX.XXX.XXX:39903: read: connection reset by peer"
time="2022-01-16T03:49:50Z" level=error msg="Error during connection: readfrom tcp AAA.AAA.AAA.AAA:34476->10.244.192.183:25: read tcp AAA.AAA.AAA.AAA:25->XXX.XXX.XXX.XXX:22273: read: connection reset by peer"
time="2022-01-16T03:50:20Z" level=error msg="Error during connection: readfrom tcp AAA.AAA.AAA.AAA:35566->10.244.192.183:25: read tcp AAA.AAA.AAA.AAA:25->XXX.XXX.XXX.XXX:24527: read: connection reset by peer"
time="2022-01-16T03:50:50Z" level=error msg="Error during connection: readfrom tcp AAA.AAA.AAA.AAA:36666->10.244.192.183:25: read tcp AAA.AAA.AAA.AAA:25->XXX.XXX.XXX.XXX:42992: read: connection reset by peer"
time="2022-01-16T03:52:50Z" level=error msg="Error during connection: readfrom tcp AAA.AAA.AAA.AAA:41098->10.244.192.183:25: read tcp AAA.AAA.AAA.AAA:25->XXX.XXX.XXX.XXX:10052: read: connection reset by peer"
time="2022-01-16T03:59:35Z" level=error msg="Error during connection: readfrom tcp AAA.AAA.AAA.AAA:56172->10.244.192.183:25: read tcp AAA.AAA.AAA.AAA:25->YYY.YYY.YYY.YYY:43713: read: connection reset by peer"
time="2022-01-16T04:01:35Z" level=error msg="Error during connection: readfrom tcp AAA.AAA.AAA.AAA:60560->10.244.192.183:25: read tcp AAA.AAA.AAA.AAA:25->XXX.XXX.XXX.XXX:39892: read: connection reset by peer"
time="2022-01-16T04:02:20Z" level=error msg="Error during connection: readfrom tcp AAA.AAA.AAA.AAA:33960->10.244.192.183:25: read tcp AAA.AAA.AAA.AAA:25->XXX.XXX.XXX.XXX:20185: read: connection reset by peer"
time="2022-01-16T04:02:50Z" level=error msg="Error during connection: write tcp AAA.AAA.AAA.AAA:25->XXX.XXX.XXX.XXX:32566: write: connection reset by peer"
time="2022-01-16T04:03:05Z" level=error msg="Error during connection: readfrom tcp AAA.AAA.AAA.AAA:35628->10.244.192.183:25: read tcp AAA.AAA.AAA.AAA:25->XXX.XXX.XXX.XXX:22512: read: connection reset by peer"

This helped I think… I do get the greeting, but it’s not fast, and I have to press <enter> or submit some input before I see the output from Haraka. It’s almost like Haraka is either intensely delayed or not flushing properly.

False Solution #3: It’s a Harak bug, that only triggers when there are multiple servers

Reading through some of the issues in the Harak GitHub repo, I thought I found an issue that might explain the problem I’m facing. The theory here was that haraka was actually delaying the server greeting thanks to this comment.

As I start to run on ideas I get desperate and come up with some wild theories. What if:

  • Traefik doing the /metrics requests (which seem to happen automatically if you do an ingressroutetcp)
  • Haraka applying it’s early_talker_delay of 1000ms increasing every time
  • the requests come in from the same address every time (Traefik), so it just keeps getting larger and larger?!?
  • Wait being so long connection gets reset??

Turns out this isn’t the right diagnosis or solution either, even setting the configurables that were related to punishing bad clients (ex. early_talker_delay=0) didn’t help.

Diversion: Pod QoS?

Another thing that bothered me was that I didn’t have any resource management in place (requests/limits) for the Haraka DaemonSet. Maybe I was getting hit by containers slowing down due to low QoS?

A quick look at the node and putting in some nice healthy limits also dispelled this notion pretty quickly.

Diversion: PROXY Protocol?

Maybe it’s an issue with PROXY protocol support? I hadn’t been using it at all up until now, if Traefik is expecting it to be turned on for haraka, it isn’t. It was possible that I updated either one and messed up the configuration. After digging into the changelogs just in case, I started reading up on the PROXY protocol v1 and v2 (there is not enough literature out there about the difference):

Turns out this wasn’t the solution either, but I did figure out how to set up PROXY protocol support for Haraka which is a good thing.

Solution: Prometheus is pinging Haraka because of bad configuration

Turns out Prometheus was misconfigured to actually try and scrape Haraka – it wasn’t the Treafik instance’s bundled Prometheus (I tried disabling it) at all! Here’s the bad Prometheus configuration (in a ConfigMap):

📜️ Prometheus Config with bad stuff commented out (click to expand)
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: monitoring
data:
  prometheus.yml: |-
    global:
      scrape_interval: 15s

      external_labels:
        monitor: prm-monitor

    scrape_configs:
      - job_name: prometheus
        ## PROBLEM - this kubernetes_sd_config will try and hit endpoints on port 25 that it sees.
        # kubernetes_sd_configs:
        #   - role: endpoints
        # relabel_configs:
        #   - source_labels: [__meta_kubernetes_service_label_app]
        #     separator: ;
        #     regex: prometheus
        #     replacement: $1
        #     action: keep
        static_configs:
          - targets: ["localhost:9090"]

      - job_name: node_exporter
        static_configs:
          - targets: ["node-exporter:9100"]
        ## PROBLEM - this kubernetes_sd_config will try and hit endpoints on port 25 that it sees.
        # kubernetes_sd_configs:
        #   - role: endpoints
        #     namespaces:
        #       names:
        #         - monitoring
        relabel_configs:
          - source_labels: [__meta_kubernetes_endpoints_name]
            regex: 'node-exporter'
            action: keep
          - source_labels: [__meta_kubernetes_service_label_app]
            separator: ;
            regex: node-exporter
            replacement: $1
            action: keep
          - action: labelmap
            regex: __meta_kubernetes_service_label_(.+)
          - source_labels: [__meta_kubernetes_namespace]
            action: replace
            target_label: kubernetes_namespace
          - source_labels: [__meta_kubernetes_service_name]
            action: replace
            target_label: kubernetes_name
          - source_labels: [__meta_kubernetes_node_name]
            action: replace
            target_label: host
          - source_labels: [__meta_kubernetes_service_name]
            separator: ;
            regex: (.*)
            target_label: job
            replacement: ${1}
            action: replace

      - job_name: kube-dns
        static_configs:
          - targets: ["kube-dns.kube-system.svc.cluster.local:9153"]

      - job_name: node-local-dns
        static_configs:
          - targets: ["node-local-dns.kube-system.svc.cluster.local:9253"]
        ## PROBLEM - this kubernetes_sd_config will try and hit endpoints on port 25 that it sees.
        # kubernetes_sd_configs:
        #   - role: endpoints
        #     namespaces:
        #       names:
        #         - kube-system
        # relabel_configs:
        #   - source_labels: [__meta_kubernetes_service_label_k8s_app]
        #     regex: node-local-dns
        #     action: keep

      - job_name: cert-manager
        static_configs:
          - targets: ["cert-manager.cert-manager.svc.cluster.local:9402"]

      - job_name: kube-state-metrics
        static_configs:
          - targets: ["kube-state-metrics.kube-system.svc.cluster.local:8080"]

      - job_name: kubevirt
        static_configs:
          - targets: ["kubevirt-operator.kubevirt.svc.cluster.local:8443"]

      - job_name: traefik
        static_configs:
          - targets: ["traefik.ingress.svc.cluster.local:8080"]
        # kubernetes_sd_configs:
        #   - role: endpoints
        #     namespaces:
        #       names:
        #         - ingress
        relabel_configs:
          - source_labels: [__meta_kubernetes_service_label_app]
            regex: traefik
            action: keep

      # - job_name: linkerd_exporter
      #   scrape_interval: 15s
      #   metrics_path: /admin/metrics/prometheus
      #   static_configs:
      #     - targets: ["l5d.linkerd.svc.cluster.local:9990"]    

Once I commented out the stuff above (mostly down to static_configs which aren’t the greatest but do work somewhat), Haraka could breathe again.

So this whole issue was caused by Haraka just being overwhelmed by invalid request by an overzealous prometheus instance. Absolutely bonkers – someone give me my ~10 hours back. After commenting out

Wrapup

Well this wasn’t a fun way to spend a bunch of hours of my life but I did learn a lot – I’ll probably need to do more to harden Haraka before using it in any kind of serious production environment – it got overwhelmed by just a few wrong requests far too easily and stopped delivering mail. Looks like it’s not very DDoS safe, but I guess mail server DDoS is a bit less common.

Anyway, hopefully this investigation and some of the methods/different things I checked helps someone in the future. If you run into the exact same problem as me, definitely send me an email to commiserate.