thanks to https://gerrit.wikimedia.org/r/c/operations/puppet/+/476393 we now have the DNS recursors in the TCP retransmits panel.
https://grafana.wikimedia.org/dashboard/db/network-performances-global?panelId=18&fullscreen&edit&tab=alert&orgId=1&from=now-30m&to=now
They all show up with ~10% retransmits rate, which is pretty high.
Digging a bit more (on dns2001), it seems like the following pattern keeps repeating itself tor every single TCP handshakes on port 53:
```
No. Delta time Source Destination Src port Dst port Protocol Info
9 0.000084 208.80.153.69 208.80.153.77 36196 53 TCP 36196 → 53 [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=2560829592 TSecr=0 WS=512
10 0.000028 208.80.153.77 208.80.153.69 53 36196 TCP 53 → 36196 [SYN, ACK] Seq=0 Ack=1 Win=28960 Len=0 MSS=1460 SACK_PERM=1 TSval=1759664423 TSecr=2560829592 WS=512
11 0.000067 208.80.153.69 208.80.153.77 36196 53 TCP 36196 → 53 [ACK] Seq=1 Ack=1 Win=29696 Len=0 TSval=2560829592 TSecr=1759664423
13 0.989525 208.80.153.77 208.80.153.69 53 36196 TCP [TCP Retransmission] 53 → 36196 [SYN, ACK] Seq=0 Ack=1 Win=28960 Len=0 MSS=1460 SACK_PERM=1 TSval=1759664680 TSecr=2560829592 WS=512
14 0.000070 208.80.153.69 208.80.153.77 36196 53 TCP [TCP Dup ACK 11#1] 36196 → 53 [ACK] Seq=1 Ack=1 Win=29696 Len=0 TSval=2560829848 TSecr=1759664423
53 0.345973 208.80.153.77 208.80.153.69 53 36196 TCP 53 → 36196 [FIN, ACK] Seq=1 Ack=1 Win=29184 Len=0 TSval=1759665222 TSecr=2560829848
54 0.000232 208.80.153.69 208.80.153.77 36196 53 TCP 36196 → 53 [FIN, ACK] Seq=1 Ack=2 Win=29696 Len=0 TSval=2560830391 TSecr=1759665222
55 0.000035 208.80.153.77 208.80.153.69 53 36196 TCP 53 → 36196 [ACK] Seq=2 Ack=2 Win=29184 Len=0 TSval=1759665222 TSecr=2560830391
```
In the capture above it seems like:
# the LVS (208.80.153.69) starts the handshake (SYN) - No 9
# dns2001 replies with a SYN-ACK - No 10
# LVS sends the expected final ACK, the server receives it (as the capture is done on the dns2001 side), but never registers that ACK - No 11
# and thus ~1s later, sends another SYN-ACK (retransmits) - No 13
# The LVS gets the 2nd SYN-ACK, and replies with an (dup) ACK - No 14
# 1/3s later, I believe because no new packets have arrived, dns2001 ack the DUP ACK, while asking the LVS to close the session (ack# similar to the dup ACK) - No 53
# TCP session gets closed properly - No 54/55
Note that a 15s capture doesn't show any real DNS traffic on port 53, only TCP handshakes, most likely health checks.
This doesn't seem to impact the healthchecks (no alarms).
If there is TCP DNS traffic, this could cause at a delay of X seconds, where X >= 1, depending on how often the retransmits happen
Question is why the server doesn't register the original SYN-ACK?