toolforge: admin tool /healthz returns 503 from time to time
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	aborrero
	May 22 2024, 8:33 AM

Description

We have alerts like this flapping:

`
FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown
`

The alert comes from checking https://admin.toolforge.org/healthz, which should return OK but returns 503 from time to time for unknown reasons.

The 503 is captured by fourohfour:

tools.admin@tools-bastion-12:~$ curl https://admin.toolforge.org/healthz
[..]
<h1>No webservice</h1>

<div class="content-text">
  
  <p>The tool responsible for the URL you have requested, <code>https://admin.toolforge.org/healthz</code>, is not currently responding.</p>
[...]

tools.admin@tools-bastion-12:~$ curl https://admin.toolforge.org/healthz
OK
tools.admin@tools-bastion-12:~$ curl https://admin.toolforge.org/healthz
[..]
<h1>No webservice</h1>

<div class="content-text">
  
  <p>The tool responsible for the URL you have requested, <code>https://admin.toolforge.org/healthz</code>, is not currently responding.</p>
[...]

We have checked nginx logs, CPU/memory limits, but did not see anything weird.

https://grafana-rw.wmcloud.org/d/TJuKfnt4z/kubernetes-namespace?orgId=1&var-cluster=prometheus-tools&var-namespace=ingress-nginx-gen2

Pods in the admin tool are also healthy, apparently.

Related Objects

Mentioned Here: P62847 (An Untitled Masterwork)

Event Timeline

aborrero created this task.May 22 2024, 8:33 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 22 2024, 8:33 AM

@taavi found one of the pods in the admin tool is secretly misbehaving:

P62847 (An Untitled Masterwork)

1	tools.admin@tools-bastion-12:~$ kubectl get pod
2	NAME READY STATUS RESTARTS AGE
3	admin-75b8b6668-hw5dg 1/1 Running 0 64d
4	admin-75b8b6668-jllch 1/1 Running 0 64d
5	tools.admin@tools-bastion-12:~$ kubectl port-forward admin-75b8b6668-hw5dg 8000 &
6	[1] 59515
7	tools.admin@tools-bastion-12:~$ Forwarding from 127.0.0.1:8000 -> 8000
8	Forwarding from [::1]:8000 -> 8000
9	tools.admin@tools-bastion-12:~$ curl localhost:8000/healthz
10	Handling connection for 8000
11	<?xml version="1.0" encoding="iso-8859-1"?>
12	<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
13	"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
14	<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
15	<head>
16	<title>503 Service Unavailable</title>
17	</head>
18	<body>
19	<h1>503 Service Unavailable</h1>
20	</body>
21	</html>
22	tools.admin@tools-bastion-12:~$ kill %1
23	tools.admin@tools-bastion-12:~$
24	[1]+ Terminated /usr/bin/kubectl port-forward admin-75b8b6668-hw5dg 8000
25	tools.admin@tools-bastion-12:~$ kubectl port-forward admin-75b8b6668-jllch 8000 &
26	[1] 59528
27	tools.admin@tools-bastion-12:~$ Forwarding from 127.0.0.1:8000 -> 8000
28	Forwarding from [::1]:8000 -> 8000
29	tools.admin@tools-bastion-12:~$ curl localhost:8000/healthz
30	Handling connection for 8000
31	OK

Mentioned in SAL (#wikimedia-cloud) [2024-05-22T08:36:52Z] <taavi> configure health-check-path: /healthz in service.template T365562

taavi closed this task as Resolved.May 22 2024, 8:52 AM

taavi claimed this task.

toolforge: admin tool /healthz returns 503 from time to timeClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

toolforge: admin tool /healthz returns 503 from time to time
Closed, ResolvedPublic
Actions