==== Context ====
prometheus1005 / prometheus1006 are the same hardware and run the same software, relevant to this case prometheus-blackbox-exporter to send out probes to the nextwork and export metrics about the results.
==== Issue ====
I noticed icmp probes, sent out every 15s, for inference.svc.eqiad.wmnet / 10.2.2.63 are successful from prometheus1005 but not prometheus1006. Note the issue seems to be present for other services too (see below)
prometheus1005
```
ts=2022-02-22T10:15:04.929Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Beginning probe" probe=icmp timeout_seconds=2.5
ts=2022-02-22T10:15:04.929Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Resolving target address" ip_protocol=ip4
ts=2022-02-22T10:15:04.929Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Resolved target address" ip=10.2.2.63
ts=2022-02-22T10:15:04.929Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Creating socket"
ts=2022-02-22T10:15:04.930Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Unable to do unprivileged listen on socket, will attempt privileged" err="socket: permission denied"
ts=2022-02-22T10:15:04.930Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Creating ICMP packet" seq=19198 id=11532
ts=2022-02-22T10:15:04.930Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Writing out packet"
ts=2022-02-22T10:15:04.930Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Waiting for reply packets"
ts=2022-02-22T10:15:04.930Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Found matching reply packet"
ts=2022-02-22T10:15:04.930Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Probe succeeded" duration_seconds=0.000960318
```
prometheus1006
```
ts=2022-02-22T10:17:42.242Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Beginning probe" probe=icmp timeout_seconds=2.5
ts=2022-02-22T10:17:42.242Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Resolving target address" ip_protocol=ip4
ts=2022-02-22T10:17:42.242Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Resolved target address" ip=10.2.2.63
ts=2022-02-22T10:17:42.242Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Creating socket"
ts=2022-02-22T10:17:42.243Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Unable to do unprivileged listen on socket, will attempt privileged" err="socket: permission denied"
ts=2022-02-22T10:17:42.243Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Creating ICMP packet" seq=29518 id=1033
ts=2022-02-22T10:17:42.243Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Writing out packet"
ts=2022-02-22T10:17:42.243Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Waiting for reply packets"
ts=2022-02-22T10:17:44.743Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Cannot get TTL from the received packet. 'probe_icmp_reply_hop_limit' will be missing."
ts=2022-02-22T10:17:44.743Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Timeout reading from socket" err="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout"
ts=2022-02-22T10:17:44.743Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Probe failed" duration_seconds=2.500485762
```
tcpdump confirms prometheus1006 isn't getting any echo reply from prometheus1006
```
root@prometheus1006:~# tcpdump -c 5 -i any 'host 10.2.2.63 and icmp'
tcpdump: data link type LINUX_SLL2
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
10:20:57.243480 eno1 Out IP prometheus1006.eqiad.wmnet > inference.svc.eqiad.wmnet: ICMP echo request, id 1033, seq 30661, length 36
10:21:12.244217 eno1 Out IP prometheus1006.eqiad.wmnet > inference.svc.eqiad.wmnet: ICMP echo request, id 1033, seq 30746, length 36
10:21:27.244022 eno1 Out IP prometheus1006.eqiad.wmnet > inference.svc.eqiad.wmnet: ICMP echo request, id 1033, seq 30835, length 36
10:21:42.243385 eno1 Out IP prometheus1006.eqiad.wmnet > inference.svc.eqiad.wmnet: ICMP echo request, id 1033, seq 30926, length 36
10:21:57.243338 eno1 Out IP prometheus1006.eqiad.wmnet > inference.svc.eqiad.wmnet: ICMP echo request, id 1033, seq 31013, length 36
5 packets captured
8 packets received by filter
0 packets dropped by kernel
```
But echo replies are working as expected on prometheus1005
```
root@prometheus1005:~# tcpdump -c 5 -i any 'host 10.2.2.63 and icmp'
tcpdump: data link type LINUX_SLL2
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
10:21:04.930382 eno1 Out IP prometheus1005.eqiad.wmnet > inference.svc.eqiad.wmnet: ICMP echo request, id 11532, seq 21310, length 36
10:21:04.930461 eno1 In IP inference.svc.eqiad.wmnet > prometheus1005.eqiad.wmnet: ICMP echo reply, id 11532, seq 21310, length 36
10:21:19.930291 eno1 Out IP prometheus1005.eqiad.wmnet > inference.svc.eqiad.wmnet: ICMP echo request, id 11532, seq 21400, length 36
10:21:19.930391 eno1 In IP inference.svc.eqiad.wmnet > prometheus1005.eqiad.wmnet: ICMP echo reply, id 11532, seq 21400, length 36
10:21:34.929949 eno1 Out IP prometheus1005.eqiad.wmnet > inference.svc.eqiad.wmnet: ICMP echo request, id 11532, seq 21488, length 36
5 packets captured
19 packets received by filter
0 packets dropped by kernel
```
Fair enough, the bizarre thing for me is that `ping` from both hosts works as expected
```
root@prometheus1006:~# ping -c3 inference.svc.eqiad.wmnet
PING inference.svc.eqiad.wmnet (10.2.2.63) 56(84) bytes of data.
64 bytes from inference.svc.eqiad.wmnet (10.2.2.63): icmp_seq=1 ttl=64 time=0.300 ms
64 bytes from inference.svc.eqiad.wmnet (10.2.2.63): icmp_seq=2 ttl=64 time=0.119 ms
64 bytes from inference.svc.eqiad.wmnet (10.2.2.63): icmp_seq=3 ttl=64 time=0.167 ms
--- inference.svc.eqiad.wmnet ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2023ms
rtt min/avg/max/mdev = 0.119/0.195/0.300/0.076 ms
```
```
root@prometheus1005:~# ping -c3 inference.svc.eqiad.wmnet
PING inference.svc.eqiad.wmnet (10.2.2.63) 56(84) bytes of data.
64 bytes from inference.svc.eqiad.wmnet (10.2.2.63): icmp_seq=1 ttl=64 time=0.095 ms
64 bytes from inference.svc.eqiad.wmnet (10.2.2.63): icmp_seq=2 ttl=64 time=0.166 ms
64 bytes from inference.svc.eqiad.wmnet (10.2.2.63): icmp_seq=3 ttl=64 time=0.105 ms
--- inference.svc.eqiad.wmnet ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2002ms
rtt min/avg/max/mdev = 0.095/0.122/0.166/0.031 ms
```
cc #netops since I've been scratching my head on this.
Looking for icmp errors it seems other services are affected too (all from prometheus1006, prometheus1005 doesn't see errors AFAICT)
```
root@prometheus1006:~# journalctl -u prometheus-blackbox-exporter.service -f | grep -i 'icmp_.*probe failed' | cut -d' ' -f6-
ts=2022-02-22T10:44:14.743Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Probe failed" duration_seconds=2.500162682
ts=2022-02-22T10:44:14.752Z caller=main.go:169 module=icmp_aqs_ip4 target=10.2.2.12 level=debug msg="Probe failed" duration_seconds=2.500169632
ts=2022-02-22T10:44:15.486Z caller=main.go:169 module=icmp_wdqs-ssl_ip4 target=10.2.2.32 level=debug msg="Probe failed" duration_seconds=2.50001914
ts=2022-02-22T10:44:15.675Z caller=main.go:169 module=icmp_tegola-vector-tiles_ip4 target=10.2.2.60 level=debug msg="Probe failed" duration_seconds=2.5001338889999998
ts=2022-02-22T10:44:29.743Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Probe failed" duration_seconds=2.500326283
ts=2022-02-22T10:44:29.752Z caller=main.go:169 module=icmp_aqs_ip4 target=10.2.2.12 level=debug msg="Probe failed" duration_seconds=2.50023924
ts=2022-02-22T10:44:30.486Z caller=main.go:169 module=icmp_wdqs-ssl_ip4 target=10.2.2.32 level=debug msg="Probe failed" duration_seconds=2.500238693
ts=2022-02-22T10:44:30.675Z caller=main.go:169 module=icmp_tegola-vector-tiles_ip4 target=10.2.2.60 level=debug msg="Probe failed" duration_seconds=2.500353869
ts=2022-02-22T10:44:44.743Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Probe failed" duration_seconds=2.500226456
ts=2022-02-22T10:44:44.752Z caller=main.go:169 module=icmp_aqs_ip4 target=10.2.2.12 level=debug msg="Probe failed" duration_seconds=2.500232199
ts=2022-02-22T10:44:45.485Z caller=main.go:169 module=icmp_wdqs-ssl_ip4 target=10.2.2.32 level=debug msg="Probe failed" duration_seconds=2.50027115
ts=2022-02-22T10:44:45.675Z caller=main.go:169 module=icmp_tegola-vector-tiles_ip4 target=10.2.2.60 level=debug msg="Probe failed" duration_seconds=2.500333055
ts=2022-02-22T10:44:59.743Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Probe failed" duration_seconds=2.500364557
ts=2022-02-22T10:44:59.752Z caller=main.go:169 module=icmp_aqs_ip4 target=10.2.2.12 level=debug msg="Probe failed" duration_seconds=2.50027006
ts=2022-02-22T10:45:00.485Z caller=main.go:169 module=icmp_wdqs-ssl_ip4 target=10.2.2.32 level=debug msg="Probe failed" duration_seconds=2.500361923
ts=2022-02-22T10:45:00.675Z caller=main.go:169 module=icmp_tegola-vector-tiles_ip4 target=10.2.2.60 level=debug msg="Probe failed" duration_seconds=2.5002428930000002
ts=2022-02-22T10:45:14.743Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Probe failed" duration_seconds=2.500324432
ts=2022-02-22T10:45:14.752Z caller=main.go:169 module=icmp_aqs_ip4 target=10.2.2.12 level=debug msg="Probe failed" duration_seconds=2.500188895
ts=2022-02-22T10:45:15.486Z caller=main.go:169 module=icmp_wdqs-ssl_ip4 target=10.2.2.32 level=debug msg="Probe failed" duration_seconds=2.500318214
ts=2022-02-22T10:45:15.675Z caller=main.go:169 module=icmp_tegola-vector-tiles_ip4 target=10.2.2.60 level=debug msg="Probe failed" duration_seconds=2.500392832
ts=2022-02-22T10:45:29.743Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Probe failed" duration_seconds=2.5002876990000003
ts=2022-02-22T10:45:29.752Z caller=main.go:169 module=icmp_aqs_ip4 target=10.2.2.12 level=debug msg="Probe failed" duration_seconds=2.500184135
ts=2022-02-22T10:45:30.485Z caller=main.go:169 module=icmp_wdqs-ssl_ip4 target=10.2.2.32 level=debug msg="Probe failed" duration_seconds=2.5000965
```
And tcpdump confirms, no replies (also tegola-vector-tiles doesn't seem to have a reverse DNS entry)
```
root@prometheus1006:~# tcpdump -i any 'host (10.2.2.63 or 10.2.2.60 or 10.2.2.12) and icmp'
tcpdump: data link type LINUX_SLL2
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
10:52:42.243798 eno1 Out IP prometheus1006.eqiad.wmnet > inference.svc.eqiad.wmnet: ICMP echo request, id 1033, seq 41838, length 36
10:52:42.253280 eno1 Out IP prometheus1006.eqiad.wmnet > aqs.svc.eqiad.wmnet: ICMP echo request, id 1033, seq 41839, length 36
10:52:43.175895 eno1 Out IP prometheus1006.eqiad.wmnet > 10.2.2.60: ICMP echo request, id 1033, seq 41846, length 36
10:52:57.243273 eno1 Out IP prometheus1006.eqiad.wmnet > inference.svc.eqiad.wmnet: ICMP echo request, id 1033, seq 41925, length 36
10:52:57.252748 eno1 Out IP prometheus1006.eqiad.wmnet > aqs.svc.eqiad.wmnet: ICMP echo request, id 1033, seq 41926, length 36
10:52:58.175672 eno1 Out IP prometheus1006.eqiad.wmnet > 10.2.2.60: ICMP echo request, id 1033, seq 41933, length 36
10:53:12.243682 eno1 Out IP prometheus1006.eqiad.wmnet > inference.svc.eqiad.wmnet: ICMP echo request, id 1033, seq 42010, length 36
10:53:12.253333 eno1 Out IP prometheus1006.eqiad.wmnet > aqs.svc.eqiad.wmnet: ICMP echo request, id 1033, seq 42011, length 36
10:53:13.175590 eno1 Out IP prometheus1006.eqiad.wmnet > 10.2.2.60: ICMP echo request, id 1033, seq 42018, length 36
10:53:27.243694 eno1 Out IP prometheus1006.eqiad.wmnet > inference.svc.eqiad.wmnet: ICMP echo request, id 1033, seq 42099, length 36
10:53:27.253190 eno1 Out IP prometheus1006.eqiad.wmnet > aqs.svc.eqiad.wmnet: ICMP echo request, id 1033, seq 42100, length 36
10:53:28.176160 eno1 Out IP prometheus1006.eqiad.wmnet > 10.2.2.60: ICMP echo request, id 1033, seq 42107, length 36
```