Page MenuHomePhabricator

2025-09-28 ToolforgeWebHighErrorRate: High 5xx rate on Toolforge web services
Closed, ResolvedPublicBUG REPORT

Assigned To
Authored By
fnegri
Sep 28 2025, 7:50 AM
Referenced Files
F66709539: image.png
Sep 29 2025, 8:47 AM
F66709527: image.png
Sep 29 2025, 8:44 AM
F66709523: image.png
Sep 29 2025, 8:44 AM
F66706671: Screenshot 2025-09-28 at 10.01.53.png
Sep 28 2025, 8:05 AM
F66706651: Screenshot 2025-09-28 at 09.27.07.png
Sep 28 2025, 7:50 AM
F66706650: Screenshot 2025-09-28 at 09.28.12.png
Sep 28 2025, 7:50 AM

Description

This alert paged this morning at 6:58 UTC.

FIRING: ToolforgeWebHighErrorRate: High 5xx rate on Toolforge web services #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeWebHighErrorRate - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/infra-k8s-haproxy?var-frontend=k8s-ingress-https - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeWebHighErrorRate

Alert source: https://prometheus.svc.toolforge.org/tools/graph?g0.expr=sum%28irate%28haproxy_frontend_http_responses_total%7Bcode%3D%225xx%22%2Cfrontend%3D%22k8s-ingress-https%22%7D%5B5m%5D%29%29+%2F+sum%28irate%28haproxy_frontend_http_responses_total%7Bfrontend%3D%22k8s-ingress-https%22%7D%5B5m%5D%29%29+%3E+0.25&g0.tab=1

[07:05:42] * dhinus paged ToolforgeWebHighErrorRate: High 5xx rate on Toolforge 
[07:07:45] <dhinus>	 the metric peaked at 45% and is going down, now it's around 30%
[07:14:25] <dhinus>	 the grafana haproxy dashboard shows a spike in sessions, but it's well below the 2k limit
[07:29:55] <dhinus>	 interesting: the frontend shows a decrease in 2xx and a corresponding increase in 5xx, whereas the backend shows only a _decrease_ in 2xx responses, but not a corresponding increase in 5xx.

Screenshot 2025-09-28 at 09.28.12.png (1×896 px, 271 KB)

Screenshot 2025-09-28 at 09.27.07.png (916×2 px, 344 KB)

Event Timeline

fnegri changed the task status from Open to In Progress.Sep 28 2025, 7:51 AM
fnegri claimed this task.
fnegri triaged this task as Unbreak Now! priority.

The tool geohack shows a big drop in requests at the ingress layer:

Screenshot 2025-09-28 at 10.01.53.png (1×1 px, 189 KB)

My hypothesis is that these requests are getting terminated with 5xx at the haproxy layer.

This seems to match the haproxy logs containing lots of k8s-ingress-https~ k8s-ingress-https/<NOSRV> for geohack.php

2025-09-28T08:01:14.756996+00:00 tools-k8s-haproxy-5 haproxy[2394470]: 172.16.18.101:38738 [28/Sep/2025:08:01:14.755] k8s-ingress-https~ k8s-ingress-https/<NOSRV> 0/-1/-1/-1/0 503 1940 - - PR-- 1490/1454/0/0/0 0/0 "GET /geohack.php?pagename=Biathlon-Weltcup_2025/26&language=de&params=59.916666666667_N_10.75_E_region:NO-03_type:landmark&title=Oslo HTTP/1.1" 0/0000000000000000/0/0/1 -/TLSv1.2/ECDHE-ECDSA-AES256-GCM-SHA384

Mentioned in SAL (#wikimedia-cloud-feed) [2025-09-28T08:10:46Z] <wmbot~dcaro@acme> START - Cookbook wmcs.toolforge.k8s.reboot_stuck_workers for tools-k8s-worker-nfs-1 (T405850)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-09-28T08:12:53Z] <wmbot~dcaro@acme> END (PASS) - Cookbook wmcs.toolforge.k8s.reboot_stuck_workers (exit_code=0) for tools-k8s-worker-nfs-1 (T405850)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-09-28T08:13:29Z] <wmbot~dcaro@acme> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-9 (T405850)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-09-28T08:15:12Z] <wmbot~dcaro@acme> END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-9 (T405850)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-09-28T08:35:38Z] <wmbot~dcaro@acme> START - Cookbook wmcs.toolforge.k8s.reboot_stuck_workers for tools-k8s-worker-nfs-1, tools-k8s-worker-nfs-19, tools-k8s-worker-nfs-23, tools-k8s-worker-nfs-67, tools-k8s-worker-nfs-7, tools-k8s-worker-nfs-9 (T405850)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-09-28T08:54:42Z] <wmbot~dcaro@acme> END (PASS) - Cookbook wmcs.toolforge.k8s.reboot_stuck_workers (exit_code=0) for tools-k8s-worker-nfs-1, tools-k8s-worker-nfs-19, tools-k8s-worker-nfs-23, tools-k8s-worker-nfs-67, tools-k8s-worker-nfs-7, tools-k8s-worker-nfs-9 (T405850)

fnegri lowered the priority of this task from Unbreak Now! to High.

Lowering to high as most tools seem to be working fine now, but some investigation is still underway.

Unassigning myself as I have to go offline shortly.

Current hypothesis is that this was a combination of:

  • haproxy correctly applying rate limiting to the geohack tool
  • NFS issues causing some other tools to return errors

Something was hammering geohack and has now stopped. The way the alert is setup, even when rate limiting is working as intended, it can cause a big increase in 5xx responses at the frontend layer of haproxy, and hence trigger the ToolforgeWebHighErrorRate alert. @taavi has an idea for modifying the alert so that it does not trigger in this case:

10:44 <@taavi> I guess one option would be to raise https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/templates/toolforge/k8s/haproxy/k8s-ingress.cfg.erb#75 to match the frontend maxconn limit, then we could change the alert to be based on haproxy_backend_http_responses_total (instead of the frontend
10:44 <@taavi> equivalent) which would ignore the per-tool rate limiting

There were a bunch of tools that started giving 500s right before the first bump of denied responses, or the second one:

Some of the tools:

image.png (1×3 px, 669 KB)

The timeline with the double bump in denied responses (the blue in the graph):

image.png (481×1 px, 148 KB)

My guess is that NFS had one or more hiccups, that got some tools stuck, and then a crawler started hitting those, that were not replying (stuck on NFS) and just piling up sessions until they reached the per-tool session limit with haproxy starting to deny requests. It seemed to improve right around the moment that the workers that had stuck tools were rebooted (might just be coincidence, but well, timing matches).

When stacking the processes in D state graph, there's a correlation for the sunday morning bump, but not for the previous one though:

image.png (1×3 px, 1 MB)

fnegri claimed this task.
fnegri moved this task from Next Up to Done on the Toolforge (Toolforge iteration 24) board.

I will mark this task as Resolved as the alert has not fired since. I created T406010: Rate limiting errors should not trigger ToolforgeWebHighErrorRate to discuss how to avoid unnecessary pages.