Why eqiad Prometheus cluster have such a higher than normal % of TCP retransmits?
It seems like Prometheus is trying to query endpoints such as analytics1049.eqiad.wmnet:51010 which resolves to both a v4 and v6 address.
But the service listening on those ports are only listening on IPv4
eg. tcp 0 0 10.64.21.108:51010 0.0.0.0:* LISTEN 8303/java
Prometheus tries to establish a TCP session over IPv6 first, then retries a couple times before giving up and successfully trying IPv4.
Seems like the same goes for other services such as:
Another curious one is for example on analytics1047, which is setting up a tcp6 socket but binding to a v4 IP:
tcp6 0 0 10.64.21.106:8141 :::* LISTEN 1667/java
Unrelated, the (some?) coudvirt hosts have the prometheus rsyslog exporter listening on port 9105.
tcp6 0 0 :::9105 :::* LISTEN 33652/prometheus-rs
but it can't be queried from prometheus1004
prometheus1004:~$ curl -v cloudvirt1015.eqiad.wmnet:9105/metrics hangs
While the other exporter listening on 9100 replies fine.
As Prometheus is configured to query that endpoint, it tries, retries, and fails.