Page MenuHomePhabricator

blackbox-exporter no icmp replies on prometheus1006 for a few services
Closed, ResolvedPublic

Description

Context

prometheus1005 / prometheus1006 are the same hardware and run the same software, relevant to this case prometheus-blackbox-exporter to send out probes to the network and export metrics about the results. All services in service::catalog are probed for icmp and http replies.

Issue

I noticed icmp probes, sent out every 15s, for inference.svc.eqiad.wmnet / 10.2.2.63 (and a few others, see at the bottom of the description) are successful from prometheus1005 but not prometheus1006. Note the issue seems to be present for other (but not all) services too (see below)

prometheus1005

ts=2022-02-22T10:15:04.929Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Beginning probe" probe=icmp timeout_seconds=2.5
ts=2022-02-22T10:15:04.929Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Resolving target address" ip_protocol=ip4
ts=2022-02-22T10:15:04.929Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Resolved target address" ip=10.2.2.63
ts=2022-02-22T10:15:04.929Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Creating socket"
ts=2022-02-22T10:15:04.930Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Unable to do unprivileged listen on socket, will attempt privileged" err="socket: permission denied"
ts=2022-02-22T10:15:04.930Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Creating ICMP packet" seq=19198 id=11532
ts=2022-02-22T10:15:04.930Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Writing out packet"
ts=2022-02-22T10:15:04.930Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Waiting for reply packets"
ts=2022-02-22T10:15:04.930Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Found matching reply packet"
ts=2022-02-22T10:15:04.930Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Probe succeeded" duration_seconds=0.000960318

prometheus1006

ts=2022-02-22T10:17:42.242Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Beginning probe" probe=icmp timeout_seconds=2.5
ts=2022-02-22T10:17:42.242Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Resolving target address" ip_protocol=ip4
ts=2022-02-22T10:17:42.242Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Resolved target address" ip=10.2.2.63
ts=2022-02-22T10:17:42.242Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Creating socket"
ts=2022-02-22T10:17:42.243Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Unable to do unprivileged listen on socket, will attempt privileged" err="socket: permission denied"
ts=2022-02-22T10:17:42.243Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Creating ICMP packet" seq=29518 id=1033
ts=2022-02-22T10:17:42.243Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Writing out packet"
ts=2022-02-22T10:17:42.243Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Waiting for reply packets"
ts=2022-02-22T10:17:44.743Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Cannot get TTL from the received packet. 'probe_icmp_reply_hop_limit' will be missing."
ts=2022-02-22T10:17:44.743Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Timeout reading from socket" err="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout"
ts=2022-02-22T10:17:44.743Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Probe failed" duration_seconds=2.500485762

tcpdump confirms prometheus1006 isn't getting any echo reply from 10.2.2.63 / inference.svc.eqiad.wmnet

root@prometheus1006:~# tcpdump -c 5 -i any 'host 10.2.2.63 and icmp'
tcpdump: data link type LINUX_SLL2
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
10:20:57.243480 eno1  Out IP prometheus1006.eqiad.wmnet > inference.svc.eqiad.wmnet: ICMP echo request, id 1033, seq 30661, length 36
10:21:12.244217 eno1  Out IP prometheus1006.eqiad.wmnet > inference.svc.eqiad.wmnet: ICMP echo request, id 1033, seq 30746, length 36
10:21:27.244022 eno1  Out IP prometheus1006.eqiad.wmnet > inference.svc.eqiad.wmnet: ICMP echo request, id 1033, seq 30835, length 36
10:21:42.243385 eno1  Out IP prometheus1006.eqiad.wmnet > inference.svc.eqiad.wmnet: ICMP echo request, id 1033, seq 30926, length 36
10:21:57.243338 eno1  Out IP prometheus1006.eqiad.wmnet > inference.svc.eqiad.wmnet: ICMP echo request, id 1033, seq 31013, length 36
5 packets captured
8 packets received by filter
0 packets dropped by kernel

But echo replies are working as expected on prometheus1005

root@prometheus1005:~# tcpdump -c 5 -i any 'host 10.2.2.63 and icmp'
tcpdump: data link type LINUX_SLL2
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
10:21:04.930382 eno1  Out IP prometheus1005.eqiad.wmnet > inference.svc.eqiad.wmnet: ICMP echo request, id 11532, seq 21310, length 36
10:21:04.930461 eno1  In  IP inference.svc.eqiad.wmnet > prometheus1005.eqiad.wmnet: ICMP echo reply, id 11532, seq 21310, length 36
10:21:19.930291 eno1  Out IP prometheus1005.eqiad.wmnet > inference.svc.eqiad.wmnet: ICMP echo request, id 11532, seq 21400, length 36
10:21:19.930391 eno1  In  IP inference.svc.eqiad.wmnet > prometheus1005.eqiad.wmnet: ICMP echo reply, id 11532, seq 21400, length 36
10:21:34.929949 eno1  Out IP prometheus1005.eqiad.wmnet > inference.svc.eqiad.wmnet: ICMP echo request, id 11532, seq 21488, length 36
5 packets captured
19 packets received by filter
0 packets dropped by kernel

Fair enough, the bizarre thing for me is that ping from both hosts works as expected

root@prometheus1006:~# ping -c3 inference.svc.eqiad.wmnet
PING inference.svc.eqiad.wmnet (10.2.2.63) 56(84) bytes of data.
64 bytes from inference.svc.eqiad.wmnet (10.2.2.63): icmp_seq=1 ttl=64 time=0.300 ms
64 bytes from inference.svc.eqiad.wmnet (10.2.2.63): icmp_seq=2 ttl=64 time=0.119 ms
64 bytes from inference.svc.eqiad.wmnet (10.2.2.63): icmp_seq=3 ttl=64 time=0.167 ms

--- inference.svc.eqiad.wmnet ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2023ms
rtt min/avg/max/mdev = 0.119/0.195/0.300/0.076 ms
root@prometheus1005:~# ping -c3 inference.svc.eqiad.wmnet
PING inference.svc.eqiad.wmnet (10.2.2.63) 56(84) bytes of data.
64 bytes from inference.svc.eqiad.wmnet (10.2.2.63): icmp_seq=1 ttl=64 time=0.095 ms
64 bytes from inference.svc.eqiad.wmnet (10.2.2.63): icmp_seq=2 ttl=64 time=0.166 ms
64 bytes from inference.svc.eqiad.wmnet (10.2.2.63): icmp_seq=3 ttl=64 time=0.105 ms

--- inference.svc.eqiad.wmnet ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2002ms
rtt min/avg/max/mdev = 0.095/0.122/0.166/0.031 ms

cc netops since I've been scratching my head on this.

Looking for icmp errors it seems other services are affected too (all from prometheus1006, prometheus1005 doesn't see errors AFAICT)

root@prometheus1006:~# journalctl -u prometheus-blackbox-exporter.service -f  | grep -i 'icmp_.*probe failed' | cut -d' ' -f6-
ts=2022-02-22T10:44:14.743Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Probe failed" duration_seconds=2.500162682
ts=2022-02-22T10:44:14.752Z caller=main.go:169 module=icmp_aqs_ip4 target=10.2.2.12 level=debug msg="Probe failed" duration_seconds=2.500169632
ts=2022-02-22T10:44:15.486Z caller=main.go:169 module=icmp_wdqs-ssl_ip4 target=10.2.2.32 level=debug msg="Probe failed" duration_seconds=2.50001914
ts=2022-02-22T10:44:15.675Z caller=main.go:169 module=icmp_tegola-vector-tiles_ip4 target=10.2.2.60 level=debug msg="Probe failed" duration_seconds=2.5001338889999998
ts=2022-02-22T10:44:29.743Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Probe failed" duration_seconds=2.500326283
ts=2022-02-22T10:44:29.752Z caller=main.go:169 module=icmp_aqs_ip4 target=10.2.2.12 level=debug msg="Probe failed" duration_seconds=2.50023924
ts=2022-02-22T10:44:30.486Z caller=main.go:169 module=icmp_wdqs-ssl_ip4 target=10.2.2.32 level=debug msg="Probe failed" duration_seconds=2.500238693
ts=2022-02-22T10:44:30.675Z caller=main.go:169 module=icmp_tegola-vector-tiles_ip4 target=10.2.2.60 level=debug msg="Probe failed" duration_seconds=2.500353869
ts=2022-02-22T10:44:44.743Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Probe failed" duration_seconds=2.500226456
ts=2022-02-22T10:44:44.752Z caller=main.go:169 module=icmp_aqs_ip4 target=10.2.2.12 level=debug msg="Probe failed" duration_seconds=2.500232199
ts=2022-02-22T10:44:45.485Z caller=main.go:169 module=icmp_wdqs-ssl_ip4 target=10.2.2.32 level=debug msg="Probe failed" duration_seconds=2.50027115
ts=2022-02-22T10:44:45.675Z caller=main.go:169 module=icmp_tegola-vector-tiles_ip4 target=10.2.2.60 level=debug msg="Probe failed" duration_seconds=2.500333055
ts=2022-02-22T10:44:59.743Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Probe failed" duration_seconds=2.500364557
ts=2022-02-22T10:44:59.752Z caller=main.go:169 module=icmp_aqs_ip4 target=10.2.2.12 level=debug msg="Probe failed" duration_seconds=2.50027006
ts=2022-02-22T10:45:00.485Z caller=main.go:169 module=icmp_wdqs-ssl_ip4 target=10.2.2.32 level=debug msg="Probe failed" duration_seconds=2.500361923
ts=2022-02-22T10:45:00.675Z caller=main.go:169 module=icmp_tegola-vector-tiles_ip4 target=10.2.2.60 level=debug msg="Probe failed" duration_seconds=2.5002428930000002
ts=2022-02-22T10:45:14.743Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Probe failed" duration_seconds=2.500324432
ts=2022-02-22T10:45:14.752Z caller=main.go:169 module=icmp_aqs_ip4 target=10.2.2.12 level=debug msg="Probe failed" duration_seconds=2.500188895
ts=2022-02-22T10:45:15.486Z caller=main.go:169 module=icmp_wdqs-ssl_ip4 target=10.2.2.32 level=debug msg="Probe failed" duration_seconds=2.500318214
ts=2022-02-22T10:45:15.675Z caller=main.go:169 module=icmp_tegola-vector-tiles_ip4 target=10.2.2.60 level=debug msg="Probe failed" duration_seconds=2.500392832
ts=2022-02-22T10:45:29.743Z caller=main.go:169 module=icmp_inference_ip4 target=10.2.2.63 level=debug msg="Probe failed" duration_seconds=2.5002876990000003
ts=2022-02-22T10:45:29.752Z caller=main.go:169 module=icmp_aqs_ip4 target=10.2.2.12 level=debug msg="Probe failed" duration_seconds=2.500184135
ts=2022-02-22T10:45:30.485Z caller=main.go:169 module=icmp_wdqs-ssl_ip4 target=10.2.2.32 level=debug msg="Probe failed" duration_seconds=2.5000965

And tcpdump confirms, no replies (also tegola-vector-tiles doesn't seem to have a reverse DNS entry)

root@prometheus1006:~# tcpdump  -i any 'host (10.2.2.63 or 10.2.2.60 or 10.2.2.12) and icmp'
tcpdump: data link type LINUX_SLL2
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
10:52:42.243798 eno1  Out IP prometheus1006.eqiad.wmnet > inference.svc.eqiad.wmnet: ICMP echo request, id 1033, seq 41838, length 36
10:52:42.253280 eno1  Out IP prometheus1006.eqiad.wmnet > aqs.svc.eqiad.wmnet: ICMP echo request, id 1033, seq 41839, length 36
10:52:43.175895 eno1  Out IP prometheus1006.eqiad.wmnet > 10.2.2.60: ICMP echo request, id 1033, seq 41846, length 36
10:52:57.243273 eno1  Out IP prometheus1006.eqiad.wmnet > inference.svc.eqiad.wmnet: ICMP echo request, id 1033, seq 41925, length 36
10:52:57.252748 eno1  Out IP prometheus1006.eqiad.wmnet > aqs.svc.eqiad.wmnet: ICMP echo request, id 1033, seq 41926, length 36
10:52:58.175672 eno1  Out IP prometheus1006.eqiad.wmnet > 10.2.2.60: ICMP echo request, id 1033, seq 41933, length 36
10:53:12.243682 eno1  Out IP prometheus1006.eqiad.wmnet > inference.svc.eqiad.wmnet: ICMP echo request, id 1033, seq 42010, length 36
10:53:12.253333 eno1  Out IP prometheus1006.eqiad.wmnet > aqs.svc.eqiad.wmnet: ICMP echo request, id 1033, seq 42011, length 36
10:53:13.175590 eno1  Out IP prometheus1006.eqiad.wmnet > 10.2.2.60: ICMP echo request, id 1033, seq 42018, length 36
10:53:27.243694 eno1  Out IP prometheus1006.eqiad.wmnet > inference.svc.eqiad.wmnet: ICMP echo request, id 1033, seq 42099, length 36
10:53:27.253190 eno1  Out IP prometheus1006.eqiad.wmnet > aqs.svc.eqiad.wmnet: ICMP echo request, id 1033, seq 42100, length 36
10:53:28.176160 eno1  Out IP prometheus1006.eqiad.wmnet > 10.2.2.60: ICMP echo request, id 1033, seq 42107, length 36

Event Timeline

fgiunchedi renamed this task from blackbox-exporter no icmp replies on prometheus1006 from inference.svc.eqiad.wmnet to blackbox-exporter no icmp replies on prometheus1006 for a few services.Feb 22 2022, 10:47 AM
fgiunchedi updated the task description. (Show Details)

@fgiunchedi Hey, also struggling somewhat with this.

That IP is currently being handled by lvs1019, which is directly connected to the priavte networks prometheus1005/1006 are on:

cmooney@lvs1019:~$ ip route get 10.64.0.82
10.64.0.82 dev eno2np1.1017 src 10.64.1.19 uid 31721
cmooney@lvs1019:~$ ip route get 10.64.16.62
10.64.16.62 dev ens1f0np0.1018 src 10.64.17.19 uid 31721

The LVS has ARP/ND entries for both:

cmooney@lvs1019:~$ ip neigh show | egrep 10.64.0.82\|10.64.16.62
10.64.16.62 dev ens1f0np0.1018 lladdr b0:4f:13:b6:9d:46 REACHABLE
10.64.0.82 dev eno2np1.1017 lladdr b0:4f:13:b6:9d:b6 REACHABLE
2620:0:861:102:10:64:16:62 dev ens1f0np0.1018 lladdr b0:4f:13:b6:9d:46 REACHABLE
2620:0:861:101:10:64:0:82 dev eno2np1.1017 lladdr b0:4f:13:b6:9d:b6 STALE

The two prometheus hosts are pingable from the LVS:

cmooney@lvs1019:~$ ping -c 2 -I 10.2.2.63 10.64.0.82
PING 10.64.0.82 (10.64.0.82) from 10.2.2.63 : 56(84) bytes of data.
64 bytes from 10.64.0.82: icmp_seq=1 ttl=63 time=0.121 ms
64 bytes from 10.64.0.82: icmp_seq=2 ttl=63 time=0.098 ms
cmooney@lvs1019:~$ ping -c 2 -I 10.2.2.63 10.64.16.62
PING 10.64.16.62 (10.64.16.62) from 10.2.2.63 : 56(84) bytes of data.
64 bytes from 10.64.16.62: icmp_seq=1 ttl=63 time=0.189 ms
64 bytes from 10.64.16.62: icmp_seq=2 ttl=63 time=0.186 ms

In both cases the traffic is asymmetric. As the destination IP is outside the local subnet the Prometheus hosts forward it via the active VRRP gateway on their Vlan (eqiad core routers), which sent it to lvs1019 on the primary interface (where it announced the 10.2.2.63 VIP to the CRs via BGP). As the LVS is directly connected to the private subnets the source comes from, the return traffic goes directly to the Prometheus hosts on those other interfaces, and not back via the CR like the request. Nevertheless this is common for both Prometheus servers and I don't see any iptables rules or sysctl parameters (rp_filter) that might prevent it working.

Looking closely at a PCAP on lvs1019 I can't really see much difference in the packets that it does respond to (manual pings) versus those it doesn't (Prometheus Blackbox Exporter pings). i.e. the 1st request below vs the 2nd:

cmooney@lvs1019:~$ sudo tcpdump -i any -e -l -nn host 10.64.16.62 and icmp and host 10.2.2.63
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
11:51:42.244188  In 5c:5e:ab:3d:87:c3 ethertype IPv4 (0x0800), length 72: 10.64.16.62 > 10.2.2.63: ICMP echo request, id 1033, seq 62606, length 36
11:51:45.823453  In 5c:5e:ab:3d:87:c3 ethertype IPv4 (0x0800), length 100: 10.64.16.62 > 10.2.2.63: ICMP echo request, id 56410, seq 1, length 64
11:51:45.823463 Out e4:3d:1a:79:15:70 ethertype IPv4 (0x0800), length 100: 10.2.2.63 > 10.64.16.62: ICMP echo reply, id 56410, seq 1, length 64
`

I do notice the ICMP IDs (BE/LE) are always the same in the packets Blackbox Exporter creates, whereas they are different each time using the ping utility. I wonder could lvs1019 have added some sort of dynamic block matching these fields? Doesn't seem likely and I can't see where/how it would do it, but I really am scratching my head here. On the network level I can't see any difference between the manual vs prometheus generated echo request, yet lvs1019 only responds to the manual one.

Capture also containing requests from Prometheus1005 (10.64.0.82) which do get responded to:

Mentioned in SAL (#wikimedia-operations) [2022-02-22T12:47:28Z] <godog> bounce prometheus-blackbox-exporter on prometheus1006 - T302265

Capture of requests directly on primary interface, to get full Ethernet header:

Can't really see any difference between these packets other than the addresses and the IDs (which we expect to be different). No idea why lvs1019 is responding to one and not the other.

Mentioned in SAL (#wikimedia-operations) [2022-02-22T13:32:31Z] <godog> bounce prometheus-blackbox-exporter on prometheus1005 - T302265

I've tried reducing the workload on 1006 to test the theory that somehow icmp rate limits are involved (why 1005 doesn't seem to be affected would still need an explanation). Indeed by ~halving the probes (but leaving the problematic ones in) on 1006 I'm getting failures for tegola-vector-tiles only now (the graphite failures are expected, they show up on 1005 too and graphite isn't behind LVS)

ts=2022-02-22T15:59:30.676Z caller=main.go:169 module=icmp_tegola-vector-tiles_ip4 target=10.2.2.60 level=debug msg="Probe failed" duration_seconds=2.500169357
ts=2022-02-22T15:59:45.675Z caller=main.go:169 module=icmp_tegola-vector-tiles_ip4 target=10.2.2.60 level=debug msg="Probe failed" duration_seconds=2.500225818
ts=2022-02-22T16:00:00.675Z caller=main.go:169 module=icmp_tegola-vector-tiles_ip4 target=10.2.2.60 level=debug msg="Probe failed" duration_seconds=2.500337676
ts=2022-02-22T16:00:14.650Z caller=main.go:169 module=icmp_graphite_ip4 target=10.64.16.149 level=debug msg="Probe failed" duration_seconds=2.500357998
ts=2022-02-22T16:00:15.676Z caller=main.go:169 module=icmp_tegola-vector-tiles_ip4 target=10.2.2.60 level=debug msg="Probe failed" duration_seconds=2.5003188830000003
ts=2022-02-22T16:00:30.675Z caller=main.go:169 module=icmp_tegola-vector-tiles_ip4 target=10.2.2.60 level=debug msg="Probe failed" duration_seconds=2.500210864
ts=2022-02-22T16:00:45.676Z caller=main.go:169 module=icmp_tegola-vector-tiles_ip4 target=10.2.2.60 level=debug msg="Probe failed" duration_seconds=2.500325337
ts=2022-02-22T16:01:00.675Z caller=main.go:169 module=icmp_tegola-vector-tiles_ip4 target=10.2.2.60 level=debug msg="Probe failed" duration_seconds=2.500380624
ts=2022-02-22T16:01:14.651Z caller=main.go:169 module=icmp_graphite_ip4 target=10.64.16.149 level=debug msg="Probe failed" duration_seconds=2.500463647
ts=2022-02-22T16:01:15.676Z caller=main.go:169 module=icmp_tegola-vector-tiles_ip4 target=10.2.2.60 level=debug msg="Probe failed" duration_seconds=2.500409827
ts=2022-02-22T16:01:30.675Z caller=main.go:169 module=icmp_tegola-vector-tiles_ip4 target=10.2.2.60 level=debug msg="Probe failed" duration_seconds=2.500418802

Following the "sth to do with icmp rate limit" lead I have:

  • temporarily decreased the frequency of probes from 1/15s to 1/30s on 1006 and the failures stopped
  • restored the default frequency of 1/15s on 1006, stopped blackbox on 1005 altogether and the failures still show up

I don't have time to dive too deep but: consider there's also a ping-offload service (which is a netops thing to re-route some common "ping en.wikipedia.org" traffic away from LVS to a dedicated ping-offload response box), and I think the ping-offload destination is a VM which may be in the same ganeti cluster as the prometheus instances being compared. So, it's possible that It could be there's a VM or ganeti-host level icmp ratelimit in effect, and the prometheus node in question could be sharing the limit with the ping-offload VM?

@BBlack thanks for the input. We've validated our ping-offload is not involved here, the requests are sent to lvs1019.

For whatever reason lvs1019 is not responding to certain ICMP echos. Why we're not sure, we've guessed it may be the kernel ICMP rate limiting on it but that's pure speculation.

What I mean is looking at a different layer of the ping-offload part: the co-location of the ping-offload and prometheus VMs within eqiad ganeti, may cause there to be a lot more ICMP traffic moving through some ganeti10xx than others, which may explain the observed behavior difference between prometheus1005 and prometheus1006 (e.g. based on kernel icmp ratelimiting at the ganeti-host layer?).

But yes, the lvses also have those ratelimiters, which is probably also a factor!

Stepping out another layer - *why* is prom100x pinging a service IP, to what end for what reason? Are we doing this for all service IPs?

@BBlack ok I understand where you're coming from.

We didn't see any of the prometheus vm transmitted pings fail to make it to lvs1019, so for now I think we can focus there.

To the wider question, Observability would have the full details, but from what I understand from earlier the answer is "yes": the Prometheus VMs are configured to ping all the service IPs regularly.

The relevant settings on the LVSes are in modules/lvs/manifests/kernel_config.pp:

# Add Echo Reply, Timestamp Reply, Info Reply, Address Mask Reply
# to the default rate limit bitmask. For the definition of the
# bitmask, see:
# https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt
'net.ipv4.icmp_ratemask'          => 350233, # 1010101100000011001
# Lower rate limit, as the default of 1000ms is way too large
'net.ipv4.icmp_ratelimit'         => 200,
# Bump the maximal number of ICMP packets sent per second from this
# host from 1000 to 3000. Some of our load balancers receive more
# than 1000 ICMP echo requests per second as documented in:
# https://phabricator.wikimedia.org/T163312#3193182
'net.ipv4.icmp_msgs_per_sec'      => 3000,

The ratemask determines which message types are affected. We mention there 4 we've added, and those are in addition to the 4 that are enabled by default (which are Dest Unreach, Source Quench, Time Exceeded, and Parameter Problem).

The msgs_per_sec param of 3000 is a global send rate limit: the whole LVS box can't generate more than this rate of outbound ICMP of all of these types, total.

The ratelimit sounds similar, but the difference is that it's per-target. The 200ms value here means at most 5 packets per second, per-target. A single prometheus box would qualify as a target for this purpose, and this is likely the limit you're running into.

Stepping back out to the broader question again though: I get why we normally have things like icinga "ping" checks for hosts - they let you know if it's a host -level problem or something smaller with a particular service or ferm rule. In the case of virtual service IPs that are L4-routed through LVS, though, these virtual service IPs are not hosts and don't live in any particular place, and don't really offer/gaurantee "unlimited pings" as an intentionally-supported service, either. I question whether a "ping" check on them has any actual utility (vs sticking to just L4 checks on the actual service port/protocol).

The ratelimit sounds similar, but the difference is that it's per-target. The 200ms value here means at most 5 packets per second, per-target. A single prometheus box would qualify as a target for this purpose, and this is likely the limit you're running into.

Really good info Brandon thanks.

I question whether a "ping" check on them has any actual utility (vs sticking to just L4 checks on the actual service port/protocol).

Fair point, we can discuss this also.

Prometheus doesn't run on VMs in eqiad/codfw (not sure if this fact was discovered already)

Stepping back out to the broader question again though: I get why we normally have things like icinga "ping" checks for hosts - they let you know if it's a host -level problem or something smaller with a particular service or ferm rule. In the case of virtual service IPs that are L4-routed through LVS, though, these virtual service IPs are not hosts and don't live in any particular place, and don't really offer/gaurantee "unlimited pings" as an intentionally-supported service, either. I question whether a "ping" check on them has any actual utility (vs sticking to just L4 checks on the actual service port/protocol).

The current pings from prometheus hosts are the network probes I'm working on (and you and I discussed) and mimic what Icinga does although at a greater frequency (i.e. we have service IPs as hosts in icinga and they get pinged, although every 30s by Icinga (once a minute by two hosts) and every ~7s by Prometheus (two hosts every 15s)).

I'm happy to discuss (and tend to agree) whether checking L7/L4 is enough in this case, though IMHO not in this task to not scope creep (this task's parent or other means). I'm not sure what you are suggesting with ""unlimited pings" as an intentionally-supported service" since AFAIK that is noone's expectation nor plan

As far as this task goes to me it still remains a mystery why it looks like prometheus1005 doesn't seem to experience rate limits (as seen by blackbox exporter) whereas prometheus1006 does.

As far as this task goes to me it still remains a mystery why it looks like prometheus1005 doesn't seem to experience rate limits (as seen by blackbox exporter) whereas prometheus1006 does.

This is the real mystery I guess. The kernel ICMP rate limits are done with a separate token bucket for each destination address (so in our case the source of the echo requests, the two Prometheus hosts). I dug a bit into it but I haven't seen any mechanism to display the kernel's current table of IPs/bucket size.

But yeah it's a bit mysterious that one source would have this problem and the other doesn't (assuming they are sending the same number of pings/sec). Also strange that manually-initiated pings from prometheus1006 are responded to, while pings generated in the same time window by the blackbox exporter process are ignored.

The current pings from prometheus hosts are the network probes I'm working on (and you and I discussed) and mimic what Icinga does although at a greater frequency (i.e. we have service IPs as hosts in icinga and they get pinged, although every 30s by Icinga (once a minute by two hosts) and every ~7s by Prometheus (two hosts every 15s)).

I would guess that quadrupling the per-source-host frequency of the pings is what's tripping this vs the icinga setup in general (60s down to 15s between pings, per source-host and service-IP). We could deep dive on the kernel ratelimiter implementation details, but the bottom line is there's a known ratelimiter on ICMP echo responses at the LVS layer, and it's set to a per-host at a rate of ~5 icmp/sec, but it's probably very approximate and uses a 200ms timing gap enforcement to accomplish it. Checking just the rate of outbound icmp from prom1005 or prom1006 to all of 10.2.2.0/24, I see an average rate of ~4/sec coming out of each, which is just barely under the LVS 5/sec limiter (and maybe I'm missing some public-subnet service IPs in that quick check, which would add to it!). Most likely prom's outbound rate is currently self-limiting due to the intermittent delays and/or losses on the LVS end in both cases?

As far as this task goes to me it still remains a mystery why it looks like prometheus1005 doesn't seem to experience rate limits (as seen by blackbox exporter) whereas prometheus1006 does.

There could be a lot of random-seeming factors in why one of the two hosts is showing failures more-consistently (even down to how hash buckets collide somewhere in the LVS host's kernel), but both can be made to show manual ping failures at sufficient rate settings (playing with ping's -i and -W params, etc) against the LVS service IPs. I'm not sure there's much value in trying to understand it - either way the the existing rate out of both prom hosts is too close to the limits, so either prometheus has to slow down, or we have to look at increasing ratelimits on the LVSes (which has public-facing consequences, too).

Thank you @cmooney and @BBlack for the explanations and for digging into this. We're close enough to the limit that I'm convinced now digging further might not be worth the investigation (save except for "intellectual closure"). Even considering that removing icmp probes altogether is the better™ thing to do (i.e. https://gerrit.wikimedia.org/r/c/operations/puppet/+/765548).

fgiunchedi claimed this task.

With the icmp probes gone I don't think there's much point in keeping this task open, tentatively resolving