Context
prometheus1005 / prometheus1006 are the same hardware and run the same software, relevant to this case prometheus-blackbox-exporter to send out probes to the network and export metrics about the results. All services in service::catalog are probed for icmp and http replies.
Issue
I noticed icmp probes, sent out every 15s, for inference.svc.eqiad.wmnet / 10.2.2.63 (and a few others, see at the bottom of the description) are successful from prometheus1005 but not prometheus1006. Note the issue seems to be present for other (but not all) services too (see below)
prometheus1005
ts=2022-02-22T10:15:04.929Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Beginning probe" probe=icmp timeout_seconds=2.5 ts=2022-02-22T10:15:04.929Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Resolving target address" ip_protocol=ip4 ts=2022-02-22T10:15:04.929Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Resolved target address" ip=10.2.2.63 ts=2022-02-22T10:15:04.929Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Creating socket" ts=2022-02-22T10:15:04.930Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Unable to do unprivileged listen on socket, will attempt privileged" err="socket: permission denied" ts=2022-02-22T10:15:04.930Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Creating ICMP packet" seq=19198 id=11532 ts=2022-02-22T10:15:04.930Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Writing out packet" ts=2022-02-22T10:15:04.930Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Waiting for reply packets" ts=2022-02-22T10:15:04.930Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Found matching reply packet" ts=2022-02-22T10:15:04.930Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Probe succeeded" duration_seconds=0.000960318
prometheus1006
ts=2022-02-22T10:17:42.242Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Beginning probe" probe=icmp timeout_seconds=2.5 ts=2022-02-22T10:17:42.242Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Resolving target address" ip_protocol=ip4 ts=2022-02-22T10:17:42.242Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Resolved target address" ip=10.2.2.63 ts=2022-02-22T10:17:42.242Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Creating socket" ts=2022-02-22T10:17:42.243Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Unable to do unprivileged listen on socket, will attempt privileged" err="socket: permission denied" ts=2022-02-22T10:17:42.243Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Creating ICMP packet" seq=29518 id=1033 ts=2022-02-22T10:17:42.243Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Writing out packet" ts=2022-02-22T10:17:42.243Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Waiting for reply packets" ts=2022-02-22T10:17:44.743Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Cannot get TTL from the received packet. 'probe_icmp_reply_hop_limit' will be missing." ts=2022-02-22T10:17:44.743Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Timeout reading from socket" err="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout" ts=2022-02-22T10:17:44.743Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Probe failed" duration_seconds=2.500485762
tcpdump confirms prometheus1006 isn't getting any echo reply from 10.2.2.63 / inference.svc.eqiad.wmnet
root@prometheus1006:~# tcpdump -c 5 -i any 'host 10.2.2.63 and icmp' tcpdump: data link type LINUX_SLL2 tcpdump: verbose output suppressed, use -v[v]... for full protocol decode listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes 10:20:57.243480 eno1 Out IP prometheus1006.eqiad.wmnet > inference.svc.eqiad.wmnet: ICMP echo request, id 1033, seq 30661, length 36 10:21:12.244217 eno1 Out IP prometheus1006.eqiad.wmnet > inference.svc.eqiad.wmnet: ICMP echo request, id 1033, seq 30746, length 36 10:21:27.244022 eno1 Out IP prometheus1006.eqiad.wmnet > inference.svc.eqiad.wmnet: ICMP echo request, id 1033, seq 30835, length 36 10:21:42.243385 eno1 Out IP prometheus1006.eqiad.wmnet > inference.svc.eqiad.wmnet: ICMP echo request, id 1033, seq 30926, length 36 10:21:57.243338 eno1 Out IP prometheus1006.eqiad.wmnet > inference.svc.eqiad.wmnet: ICMP echo request, id 1033, seq 31013, length 36 5 packets captured 8 packets received by filter 0 packets dropped by kernel
But echo replies are working as expected on prometheus1005
root@prometheus1005:~# tcpdump -c 5 -i any 'host 10.2.2.63 and icmp' tcpdump: data link type LINUX_SLL2 tcpdump: verbose output suppressed, use -v[v]... for full protocol decode listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes 10:21:04.930382 eno1 Out IP prometheus1005.eqiad.wmnet > inference.svc.eqiad.wmnet: ICMP echo request, id 11532, seq 21310, length 36 10:21:04.930461 eno1 In IP inference.svc.eqiad.wmnet > prometheus1005.eqiad.wmnet: ICMP echo reply, id 11532, seq 21310, length 36 10:21:19.930291 eno1 Out IP prometheus1005.eqiad.wmnet > inference.svc.eqiad.wmnet: ICMP echo request, id 11532, seq 21400, length 36 10:21:19.930391 eno1 In IP inference.svc.eqiad.wmnet > prometheus1005.eqiad.wmnet: ICMP echo reply, id 11532, seq 21400, length 36 10:21:34.929949 eno1 Out IP prometheus1005.eqiad.wmnet > inference.svc.eqiad.wmnet: ICMP echo request, id 11532, seq 21488, length 36 5 packets captured 19 packets received by filter 0 packets dropped by kernel
Fair enough, the bizarre thing for me is that ping from both hosts works as expected
root@prometheus1006:~# ping -c3 inference.svc.eqiad.wmnet PING inference.svc.eqiad.wmnet (10.2.2.63) 56(84) bytes of data. 64 bytes from inference.svc.eqiad.wmnet (10.2.2.63): icmp_seq=1 ttl=64 time=0.300 ms 64 bytes from inference.svc.eqiad.wmnet (10.2.2.63): icmp_seq=2 ttl=64 time=0.119 ms 64 bytes from inference.svc.eqiad.wmnet (10.2.2.63): icmp_seq=3 ttl=64 time=0.167 ms --- inference.svc.eqiad.wmnet ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2023ms rtt min/avg/max/mdev = 0.119/0.195/0.300/0.076 ms
root@prometheus1005:~# ping -c3 inference.svc.eqiad.wmnet PING inference.svc.eqiad.wmnet (10.2.2.63) 56(84) bytes of data. 64 bytes from inference.svc.eqiad.wmnet (10.2.2.63): icmp_seq=1 ttl=64 time=0.095 ms 64 bytes from inference.svc.eqiad.wmnet (10.2.2.63): icmp_seq=2 ttl=64 time=0.166 ms 64 bytes from inference.svc.eqiad.wmnet (10.2.2.63): icmp_seq=3 ttl=64 time=0.105 ms --- inference.svc.eqiad.wmnet ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2002ms rtt min/avg/max/mdev = 0.095/0.122/0.166/0.031 ms
cc netops since I've been scratching my head on this.
Looking for icmp errors it seems other services are affected too (all from prometheus1006, prometheus1005 doesn't see errors AFAICT)
root@prometheus1006:~# journalctl -u prometheus-blackbox-exporter.service -f | grep -i 'icmp_.*probe failed' | cut -d' ' -f6- ts=2022-02-22T10:44:14.743Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Probe failed" duration_seconds=2.500162682 ts=2022-02-22T10:44:14.752Z caller=main.go:169 module=icmp_aqs_ip4 target=10.2.2.12 level=debug msg="Probe failed" duration_seconds=2.500169632 ts=2022-02-22T10:44:15.486Z caller=main.go:169 module=icmp_wdqs-ssl_ip4 target=10.2.2.32 level=debug msg="Probe failed" duration_seconds=2.50001914 ts=2022-02-22T10:44:15.675Z caller=main.go:169 module=icmp_tegola-vector-tiles_ip4 target=10.2.2.60 level=debug msg="Probe failed" duration_seconds=2.5001338889999998 ts=2022-02-22T10:44:29.743Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Probe failed" duration_seconds=2.500326283 ts=2022-02-22T10:44:29.752Z caller=main.go:169 module=icmp_aqs_ip4 target=10.2.2.12 level=debug msg="Probe failed" duration_seconds=2.50023924 ts=2022-02-22T10:44:30.486Z caller=main.go:169 module=icmp_wdqs-ssl_ip4 target=10.2.2.32 level=debug msg="Probe failed" duration_seconds=2.500238693 ts=2022-02-22T10:44:30.675Z caller=main.go:169 module=icmp_tegola-vector-tiles_ip4 target=10.2.2.60 level=debug msg="Probe failed" duration_seconds=2.500353869 ts=2022-02-22T10:44:44.743Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Probe failed" duration_seconds=2.500226456 ts=2022-02-22T10:44:44.752Z caller=main.go:169 module=icmp_aqs_ip4 target=10.2.2.12 level=debug msg="Probe failed" duration_seconds=2.500232199 ts=2022-02-22T10:44:45.485Z caller=main.go:169 module=icmp_wdqs-ssl_ip4 target=10.2.2.32 level=debug msg="Probe failed" duration_seconds=2.50027115 ts=2022-02-22T10:44:45.675Z caller=main.go:169 module=icmp_tegola-vector-tiles_ip4 target=10.2.2.60 level=debug msg="Probe failed" duration_seconds=2.500333055 ts=2022-02-22T10:44:59.743Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Probe failed" duration_seconds=2.500364557 ts=2022-02-22T10:44:59.752Z caller=main.go:169 module=icmp_aqs_ip4 target=10.2.2.12 level=debug msg="Probe failed" duration_seconds=2.50027006 ts=2022-02-22T10:45:00.485Z caller=main.go:169 module=icmp_wdqs-ssl_ip4 target=10.2.2.32 level=debug msg="Probe failed" duration_seconds=2.500361923 ts=2022-02-22T10:45:00.675Z caller=main.go:169 module=icmp_tegola-vector-tiles_ip4 target=10.2.2.60 level=debug msg="Probe failed" duration_seconds=2.5002428930000002 ts=2022-02-22T10:45:14.743Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Probe failed" duration_seconds=2.500324432 ts=2022-02-22T10:45:14.752Z caller=main.go:169 module=icmp_aqs_ip4 target=10.2.2.12 level=debug msg="Probe failed" duration_seconds=2.500188895 ts=2022-02-22T10:45:15.486Z caller=main.go:169 module=icmp_wdqs-ssl_ip4 target=10.2.2.32 level=debug msg="Probe failed" duration_seconds=2.500318214 ts=2022-02-22T10:45:15.675Z caller=main.go:169 module=icmp_tegola-vector-tiles_ip4 target=10.2.2.60 level=debug msg="Probe failed" duration_seconds=2.500392832 ts=2022-02-22T10:45:29.743Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Probe failed" duration_seconds=2.5002876990000003 ts=2022-02-22T10:45:29.752Z caller=main.go:169 module=icmp_aqs_ip4 target=10.2.2.12 level=debug msg="Probe failed" duration_seconds=2.500184135 ts=2022-02-22T10:45:30.485Z caller=main.go:169 module=icmp_wdqs-ssl_ip4 target=10.2.2.32 level=debug msg="Probe failed" duration_seconds=2.5000965
And tcpdump confirms, no replies (also tegola-vector-tiles doesn't seem to have a reverse DNS entry)
root@prometheus1006:~# tcpdump -i any 'host (10.2.2.63 or 10.2.2.60 or 10.2.2.12) and icmp' tcpdump: data link type LINUX_SLL2 tcpdump: verbose output suppressed, use -v[v]... for full protocol decode listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes 10:52:42.243798 eno1 Out IP prometheus1006.eqiad.wmnet > inference.svc.eqiad.wmnet: ICMP echo request, id 1033, seq 41838, length 36 10:52:42.253280 eno1 Out IP prometheus1006.eqiad.wmnet > aqs.svc.eqiad.wmnet: ICMP echo request, id 1033, seq 41839, length 36 10:52:43.175895 eno1 Out IP prometheus1006.eqiad.wmnet > 10.2.2.60: ICMP echo request, id 1033, seq 41846, length 36 10:52:57.243273 eno1 Out IP prometheus1006.eqiad.wmnet > inference.svc.eqiad.wmnet: ICMP echo request, id 1033, seq 41925, length 36 10:52:57.252748 eno1 Out IP prometheus1006.eqiad.wmnet > aqs.svc.eqiad.wmnet: ICMP echo request, id 1033, seq 41926, length 36 10:52:58.175672 eno1 Out IP prometheus1006.eqiad.wmnet > 10.2.2.60: ICMP echo request, id 1033, seq 41933, length 36 10:53:12.243682 eno1 Out IP prometheus1006.eqiad.wmnet > inference.svc.eqiad.wmnet: ICMP echo request, id 1033, seq 42010, length 36 10:53:12.253333 eno1 Out IP prometheus1006.eqiad.wmnet > aqs.svc.eqiad.wmnet: ICMP echo request, id 1033, seq 42011, length 36 10:53:13.175590 eno1 Out IP prometheus1006.eqiad.wmnet > 10.2.2.60: ICMP echo request, id 1033, seq 42018, length 36 10:53:27.243694 eno1 Out IP prometheus1006.eqiad.wmnet > inference.svc.eqiad.wmnet: ICMP echo request, id 1033, seq 42099, length 36 10:53:27.253190 eno1 Out IP prometheus1006.eqiad.wmnet > aqs.svc.eqiad.wmnet: ICMP echo request, id 1033, seq 42100, length 36 10:53:28.176160 eno1 Out IP prometheus1006.eqiad.wmnet > 10.2.2.60: ICMP echo request, id 1033, seq 42107, length 36