We have alerts like this flapping:
`
FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown
`
The alert comes from checking https://admin.toolforge.org/healthz, which should return OK but returns 503 from time to time for unknown reasons.
The 503 is captured by fourohfour:
tools.admin@tools-bastion-12:~$ curl https://admin.toolforge.org/healthz [..] <h1>No webservice</h1> <div class="content-text"> <p>The tool responsible for the URL you have requested, <code>https://admin.toolforge.org/healthz</code>, is not currently responding.</p> [...] tools.admin@tools-bastion-12:~$ curl https://admin.toolforge.org/healthz OK tools.admin@tools-bastion-12:~$ curl https://admin.toolforge.org/healthz [..] <h1>No webservice</h1> <div class="content-text"> <p>The tool responsible for the URL you have requested, <code>https://admin.toolforge.org/healthz</code>, is not currently responding.</p> [...]
We have checked nginx logs, CPU/memory limits, but did not see anything weird.
Pods in the admin tool are also healthy, apparently.