Page MenuHomePhabricator

toolforge: admin tool /healthz returns 503 from time to time
Closed, ResolvedPublic

Description

We have alerts like this flapping:

`
FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown
`

The alert comes from checking https://admin.toolforge.org/healthz, which should return OK but returns 503 from time to time for unknown reasons.

The 503 is captured by fourohfour:

tools.admin@tools-bastion-12:~$ curl https://admin.toolforge.org/healthz
[..]
<h1>No webservice</h1>

<div class="content-text">
  
  <p>The tool responsible for the URL you have requested, <code>https://admin.toolforge.org/healthz</code>, is not currently responding.</p>
[...]

tools.admin@tools-bastion-12:~$ curl https://admin.toolforge.org/healthz
OK
tools.admin@tools-bastion-12:~$ curl https://admin.toolforge.org/healthz
[..]
<h1>No webservice</h1>

<div class="content-text">
  
  <p>The tool responsible for the URL you have requested, <code>https://admin.toolforge.org/healthz</code>, is not currently responding.</p>
[...]

We have checked nginx logs, CPU/memory limits, but did not see anything weird.

https://grafana-rw.wmcloud.org/d/TJuKfnt4z/kubernetes-namespace?orgId=1&var-cluster=prometheus-tools&var-namespace=ingress-nginx-gen2

Pods in the admin tool are also healthy, apparently.

Related Objects

Event Timeline

@taavi found one of the pods in the admin tool is secretly misbehaving:

1tools.admin@tools-bastion-12:~$ kubectl get pod
2NAME READY STATUS RESTARTS AGE
3admin-75b8b6668-hw5dg 1/1 Running 0 64d
4admin-75b8b6668-jllch 1/1 Running 0 64d
5tools.admin@tools-bastion-12:~$ kubectl port-forward admin-75b8b6668-hw5dg 8000 &
6[1] 59515
7tools.admin@tools-bastion-12:~$ Forwarding from 127.0.0.1:8000 -> 8000
8Forwarding from [::1]:8000 -> 8000
9tools.admin@tools-bastion-12:~$ curl localhost:8000/healthz
10Handling connection for 8000
11<?xml version="1.0" encoding="iso-8859-1"?>
12<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
13 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
14<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
15 <head>
16 <title>503 Service Unavailable</title>
17 </head>
18 <body>
19 <h1>503 Service Unavailable</h1>
20 </body>
21</html>
22tools.admin@tools-bastion-12:~$ kill %1
23tools.admin@tools-bastion-12:~$
24[1]+ Terminated /usr/bin/kubectl port-forward admin-75b8b6668-hw5dg 8000
25tools.admin@tools-bastion-12:~$ kubectl port-forward admin-75b8b6668-jllch 8000 &
26[1] 59528
27tools.admin@tools-bastion-12:~$ Forwarding from 127.0.0.1:8000 -> 8000
28Forwarding from [::1]:8000 -> 8000
29tools.admin@tools-bastion-12:~$ curl localhost:8000/healthz
30Handling connection for 8000
31OK

Mentioned in SAL (#wikimedia-cloud) [2024-05-22T08:36:52Z] <taavi> configure health-check-path: /healthz in service.template T365562

taavi claimed this task.