thanks to https://gerrit.wikimedia.org/r/c/operations/puppet/+/476393 we now have the DNS recursors in the TCP retransmits panel.
They all show up with ~10% retransmits rate, which is pretty high.
Digging a bit more (on dns2001), it seems like the following pattern keeps repeating itself tor every single TCP handshakes on port 53:
No. Rel time Source Mac Source Destination Src port Dest Mac Dst port Protocol Info 9 0.000084 40:a8:f0:2c:66:e8 208.80.153.69 208.80.153.77 36196 d0:94:66:5f:6a:40 53 TCP 36196 → 53 [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=2560829592 TSecr=0 WS=512 10 0.000028 d0:94:66:5f:6a:40 208.80.153.77 208.80.153.69 53 40:a8:f0:2c:66:e8 36196 TCP 53 → 36196 [SYN, ACK] Seq=0 Ack=1 Win=28960 Len=0 MSS=1460 SACK_PERM=1 TSval=1759664423 TSecr=2560829592 WS=512 11 0.000067 40:a8:f0:2c:66:e8 208.80.153.69 208.80.153.77 36196 d0:94:66:5f:6a:40 53 TCP 36196 → 53 [ACK] Seq=1 Ack=1 Win=29696 Len=0 TSval=2560829592 TSecr=1759664423 13 0.989525 d0:94:66:5f:6a:40 208.80.153.77 208.80.153.69 53 40:a8:f0:2c:66:e8 36196 TCP [TCP Retransmission] 53 → 36196 [SYN, ACK] Seq=0 Ack=1 Win=28960 Len=0 MSS=1460 SACK_PERM=1 TSval=1759664680 TSecr=2560829592 WS=512 14 0.000070 40:a8:f0:2c:66:e8 208.80.153.69 208.80.153.77 36196 d0:94:66:5f:6a:40 53 TCP [TCP Dup ACK 11#1] 36196 → 53 [ACK] Seq=1 Ack=1 Win=29696 Len=0 TSval=2560829848 TSecr=1759664423 53 0.345973 d0:94:66:5f:6a:40 208.80.153.77 208.80.153.69 53 40:a8:f0:2c:66:e8 36196 TCP 53 → 36196 [FIN, ACK] Seq=1 Ack=1 Win=29184 Len=0 TSval=1759665222 TSecr=2560829848 54 0.000232 40:a8:f0:2c:66:e8 208.80.153.69 208.80.153.77 36196 d0:94:66:5f:6a:40 53 TCP 36196 → 53 [FIN, ACK] Seq=1 Ack=2 Win=29696 Len=0 TSval=2560830391 TSecr=1759665222 55 0.000035 d0:94:66:5f:6a:40 208.80.153.77 208.80.153.69 53 40:a8:f0:2c:66:e8 36196 TCP 53 → 36196 [ACK] Seq=2 Ack=2 Win=29184 Len=0 TSval=1759665222 TSecr=2560830391
In the capture above it seems like:
- the LVS (208.80.153.69) starts the handshake (SYN) - No 9
- dns2001 replies with a SYN-ACK - No 10
- LVS sends the expected final ACK, the server receives it (as the capture is done on the dns2001 side), but never registers that ACK - No 11
- and thus ~1s later, sends another SYN-ACK (retransmits) - No 13
- The LVS gets the 2nd SYN-ACK, and replies with an (dup) ACK - No 14
- 1/3s later, I believe because no new packets have arrived, dns2001 ack the DUP ACK, while asking the LVS to close the session (ack# similar to the dup ACK) - No 53
- TCP session gets closed properly - No 54/55
Note that a 15s capture doesn't show any real DNS traffic on port 53, only TCP handshakes, most likely health checks.
This doesn't seem to impact the healthchecks (no alarms).
If there is TCP DNS traffic, this could cause a delay of X seconds, where X >= 1, depending on how often the retransmits happen
Question is why the server doesn't register the original ACK (final 3 way handshake step)?