Page MenuHomePhabricator

Wikimedia Toolforge Error: Failed to load resource: the server responded with a status of 500
Closed, ResolvedPublicBUG REPORT

Assigned To
Authored By
Iniquity
Sep 12 2025, 7:10 PM
Referenced Files
F66018640: image.png
Sep 12 2025, 7:33 PM
F66018643: image.png
Sep 12 2025, 7:28 PM
F66018633: image.png
Sep 12 2025, 7:20 PM
F66018602: image.png
Sep 12 2025, 7:10 PM

Description

Steps to replicate the issue (include links if applicable):

What happens?:

image.png (421×1 px, 57 KB)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Iniquity triaged this task as Unbreak Now! priority.Sep 12 2025, 7:11 PM
Iniquity changed the subtype of this task from "Bug Report" to "Production Error".Sep 12 2025, 7:14 PM
JJMC89 changed the subtype of this task from "Production Error" to "Bug Report".Sep 12 2025, 7:15 PM
JJMC89 subscribed.

Toolforge is not part of prod.

Toolforge is not part of prod.

I didn't know, thanks!

I cannot reproduce. The tool's home page successfully loads and so does the example linked there.

I cannot reproduce. The tool's home page successfully loads and so does the example linked there.

Try to updating several times. Or you can try to open https://admin.toolforge.org/

There's an active alert on one haproxy, probably flapping, looking

image.png (327×640 px, 76 KB)

It seems it's not logging anything since the 8th of September:

root@tools-k8s-haproxy-5:~# journalctl -f -u haproxy.service 
Sep 08 08:31:42 tools-k8s-haproxy-5 haproxy[1720669]: [WARNING]  (1720669) : Server k8s-api/tools-k8s-control-8.tools.eqiad1.wikimedia.cloud is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 2ms. 2 active and 0 backup servers left. 93 sessions active, 0 requeued, 0 remaining in queue.
Sep 08 08:32:23 tools-k8s-haproxy-5 haproxy[1720669]: [WARNING]  (1720669) : Server k8s-api/tools-k8s-control-8.tools.eqiad1.wikimedia.cloud is UP, reason: Layer7 check passed, code: 200, check duration: 7ms. 3 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
Sep 08 08:33:04 tools-k8s-haproxy-5 haproxy[1720669]: [WARNING]  (1720669) : Server k8s-api/tools-k8s-control-8.tools.eqiad1.wikimedia.cloud is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 1ms. 2 active and 0 backup servers left. 1 sessions active, 0 requeued, 0 remaining in queue.
Sep 08 08:33:37 tools-k8s-haproxy-5 haproxy[1720669]: [WARNING]  (1720669) : Server k8s-api/tools-k8s-control-8.tools.eqiad1.wikimedia.cloud is UP, reason: Layer7 check passed, code: 200, check duration: 20ms. 3 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
Sep 08 08:34:37 tools-k8s-haproxy-5 haproxy[1720669]: [WARNING]  (1720669) : Server k8s-api/tools-k8s-control-8.tools.eqiad1.wikimedia.cloud is UP. 3 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
Sep 08 08:39:43 tools-k8s-haproxy-5 haproxy[1720669]: [WARNING]  (1720669) : Server k8s-api/tools-k8s-control-9.tools.eqiad1.wikimedia.cloud is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 1ms. 2 active and 0 backup servers left. 139 sessions active, 0 requeued, 0 remaining in queue.
Sep 08 08:40:29 tools-k8s-haproxy-5 haproxy[1720669]: [WARNING]  (1720669) : Server k8s-api/tools-k8s-control-9.tools.eqiad1.wikimedia.cloud is UP, reason: Layer7 check passed, code: 200, check duration: 13ms. 3 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
Sep 08 08:41:19 tools-k8s-haproxy-5 haproxy[1720669]: [WARNING]  (1720669) : Server k8s-api/tools-k8s-control-9.tools.eqiad1.wikimedia.cloud is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 1ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Sep 08 08:41:50 tools-k8s-haproxy-5 haproxy[1720669]: [WARNING]  (1720669) : Server k8s-api/tools-k8s-control-9.tools.eqiad1.wikimedia.cloud is UP, reason: Layer7 check passed, code: 200, check duration: 14ms. 3 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
Sep 08 08:42:50 tools-k8s-haproxy-5 haproxy[1720669]: [WARNING]  (1720669) : Server k8s-api/tools-k8s-control-9.tools.eqiad1.wikimedia.cloud is UP. 3 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.

Just restarted it, and it seems to be coming up, will let it startup for a bit.

Both haproxies are failing the health checks for the last 3h or so:

image.png (658×1 px, 21 KB)

tools-k8s-haproxy-6 logs for haproxy services stopped also on sep 8th, just restarted haproxy there also

It seems like we are hitting the HAProxy session limit for ingresses:

image.png (388×794 px, 57 KB)

Looking at the Nginx (front proxy) logs, it seems what's happening here is that there's some tool that's getting more traffic than it can handle, which is slowing it down, which is causing requests for that tool to pile up further, ultimately hitting that Toolforge-wide limit and returning 500s. I'm going to send a patch to limit active connections per tool to limit the damage to that tool instead of taking everything down with it.

Change #1187892 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:toolforge::proxy: Limit in-flight connections per tool

https://gerrit.wikimedia.org/r/1187892

I think it might be geohack getting most the connections:

root@tools-proxy-9:~# tail -n 10000 /var/log/nginx/access.log | awk '{print $1}' | sort | uniq -c | sort -h | tail
    130 hub.toolforge.org
    140 freebase.toolforge.org
    198 tabernacle.toolforge.org
    253 glamtools.toolforge.org
    336 sigma.toolforge.org
    341 reasonator.toolforge.org
    359 wikimap.toolforge.org
    426 guc.toolforge.org
    604 panoviewer.toolforge.org
   4603 geohack.toolforge.org

Change #1187892 merged by Majavah:

[operations/puppet@production] P:toolforge::proxy: Limit in-flight connections per tool

https://gerrit.wikimedia.org/r/1187892

bd808 lowered the priority of this task from Unbreak Now! to High.Sep 12 2025, 7:50 PM
bd808 subscribed.

Lowering from UBN! to High.

@taavi's patch is working as expected, geohack is getting throttled, and the rest of tools started working as expected.

Change #1187894 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:toolforge::proxy: Allow throttled tools to load error page assets

https://gerrit.wikimedia.org/r/1187894

dcaro claimed this task.

Closing now, will think on followups if any next week, thanks everyone!

Change #1187894 merged by Majavah:

[operations/puppet@production] P:toolforge::proxy: Allow throttled tools to load error page assets

https://gerrit.wikimedia.org/r/1187894