Page MenuHomePhabricator

[infra,haproxy,ingress] 2025-09-23 Ingress hitting the backend session limit and started replying with 5xxs
Closed, ResolvedPublicBUG REPORT

Assigned To
Authored By
DamianZaremba
Sep 22 2025, 9:40 PM
Referenced Files
F66700585: Screenshot 2025-09-26 at 10.50.20.png
Sep 26 2025, 8:51 AM
F66700583: Screenshot 2025-09-26 at 10.49.28.png
Sep 26 2025, 8:51 AM
F66463906: image.png
Sep 24 2025, 8:45 AM
F66463901: image.png
Sep 24 2025, 8:45 AM
F66239515: image.png
Sep 23 2025, 9:30 AM
F66239361: image.png
Sep 23 2025, 9:30 AM
F66227854: image.png
Sep 23 2025, 8:05 AM

Description

Multiple tools are reporting 503, yet their services/ingress are healthy:

m00m00:lima-kilo damian$ curl -i https://cluebotng-review.toolforge.org/admin/
HTTP/2 503 
server: nginx/1.22.1
date: Mon, 22 Sep 2025 21:37:30 GMT
content-type: text/html
content-length: 107
cache-control: no-cache
strict-transport-security: max-age=31622400
x-clacks-overhead: GNU Terry Pratchett
permissions-policy: browsing-topics=()
report-to: {"group": "wm_nel", "max_age": 604800, "endpoints": [{"url": "https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0"}]}
nel: {"report_to": "wm_nel", "max_age": 604800, "failure_fraction": 0.05, "success_fraction": 0.0}
content-security-policy-report-only: default-src 'self' 'unsafe-eval' 'unsafe-inline' blob: data: filesystem: mediastream: *.toolforge.org wikibooks.org *.wikibooks.org wikidata.org *.wikidata.org wikimedia.org *.wikimedia.org wikinews.org *.wikinews.org wikipedia.org *.wikipedia.org wikiquote.org *.wikiquote.org wikisource.org *.wikisource.org wikiversity.org *.wikiversity.org wikivoyage.org *.wikivoyage.org wiktionary.org *.wiktionary.org *.wmcloud.org *.wmflabs.org wikimediafoundation.org mediawiki.org *.mediawiki.org wss://cluebotng-review.toolforge.org; report-uri https://csp-report.toolforge.org/collect;

<html><body><h1>503 Service Unavailable</h1>
No server is available to handle this request.
</body></html>
m00m00:lima-kilo damian$ curl -i https://cluebotng.toolforge.org
HTTP/2 503 
server: nginx/1.22.1
date: Mon, 22 Sep 2025 21:37:58 GMT
content-type: text/html
content-length: 107
cache-control: no-cache
strict-transport-security: max-age=31622400
x-clacks-overhead: GNU Terry Pratchett
permissions-policy: browsing-topics=()
report-to: {"group": "wm_nel", "max_age": 604800, "endpoints": [{"url": "https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0"}]}
nel: {"report_to": "wm_nel", "max_age": 604800, "failure_fraction": 0.05, "success_fraction": 0.0}
content-security-policy-report-only: default-src 'self' 'unsafe-eval' 'unsafe-inline' blob: data: filesystem: mediastream: *.toolforge.org wikibooks.org *.wikibooks.org wikidata.org *.wikidata.org wikimedia.org *.wikimedia.org wikinews.org *.wikinews.org wikipedia.org *.wikipedia.org wikiquote.org *.wikiquote.org wikisource.org *.wikisource.org wikiversity.org *.wikiversity.org wikivoyage.org *.wikivoyage.org wiktionary.org *.wiktionary.org *.wmcloud.org *.wmflabs.org wikimediafoundation.org mediawiki.org *.mediawiki.org wss://cluebotng.toolforge.org; report-uri https://csp-report.toolforge.org/collect;

<html><body><h1>503 Service Unavailable</h1>
No server is available to handle this request.
</body></html>
m00m00:lima-kilo damian$ curl -i https://admin.toolforge.org/
HTTP/2 503 
server: nginx/1.22.1
date: Mon, 22 Sep 2025 21:39:14 GMT
content-type: text/html
content-length: 107
cache-control: no-cache
strict-transport-security: max-age=31622400
x-clacks-overhead: GNU Terry Pratchett
permissions-policy: browsing-topics=()
report-to: {"group": "wm_nel", "max_age": 604800, "endpoints": [{"url": "https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0"}]}
nel: {"report_to": "wm_nel", "max_age": 604800, "failure_fraction": 0.05, "success_fraction": 0.0}
content-security-policy-report-only: default-src 'self' 'unsafe-eval' 'unsafe-inline' blob: data: filesystem: mediastream: *.toolforge.org wikibooks.org *.wikibooks.org wikidata.org *.wikidata.org wikimedia.org *.wikimedia.org wikinews.org *.wikinews.org wikipedia.org *.wikipedia.org wikiquote.org *.wikiquote.org wikisource.org *.wikisource.org wikiversity.org *.wikiversity.org wikivoyage.org *.wikivoyage.org wiktionary.org *.wiktionary.org *.wmcloud.org *.wmflabs.org wikimediafoundation.org mediawiki.org *.mediawiki.org wss://admin.toolforge.org; report-uri https://csp-report.toolforge.org/collect;

<html><body><h1>503 Service Unavailable</h1>
No server is available to handle this request.
</body></html>

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
dcaro triaged this task as High priority.Sep 23 2025, 8:01 AM
dcaro edited projects, added Toolforge (Toolforge iteration 24); removed Toolforge.

We hit the limit of open sessions on the tools ingress:

image.png (456×1 px, 249 KB)

Change #1190580 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] toolforge.haproxy: increase the backend connections

https://gerrit.wikimedia.org/r/1190580

Change #1190580 merged by David Caro:

[operations/puppet@production] toolforge.haproxy: increase the backend connections

https://gerrit.wikimedia.org/r/1190580

Change #1190583 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] toolforge.haproxy: update also the maxconn per-backend

https://gerrit.wikimedia.org/r/1190583

Change #1190583 merged by David Caro:

[operations/puppet@production] toolforge.haproxy: update also the maxconn per-backend

https://gerrit.wikimedia.org/r/1190583

I have increased the limit of connections that the haproxies have configured, as they seem to be able to handle more than that.

The traffic seems to be stabilizing too, though under the levels that were there before the config changes, so I'm not sure that helped right now, or just the reload and external traffic slowing down was enough.

I don't see any specific tool having any peaks of traffic:

image.png (1×2 px, 635 KB)

image.png (1×2 px, 1 MB)

I stored a sample with traffic logs from the ingress nodes in a control node, I'll try to extract a sample also if/when we have another peak to compare, but right now things seem to be calming down. Will give it a thought on what to investigate next.

root@tools-k8s-control-9:~# cat nginx_logs.2025-09-23_09\:11.log  | grep -o '\[tool-[^]]*' | sort | uniq -c | sort -h | tail -n 10
    376 [tool-copyvios-copyvios-8000
    411 [tool-hub-hub-8000
    440 [tool-scholia-scholia-8000
    460 [tool-templatecount-templatecount-8000
    488 [tool-iw-unused-8000
    616 [tool-armake-armake-8000
    654 [tool-panoviewer-panoviewer-8000
    715 [tool-ftl-ftl-8000
    743 [tool-csp-report-csp-report-8000
   6229 [tool-geohack-geohack-8000
dcaro renamed this task from Web services appear to be down to [infra,haproxy,ingress] 2025-09-23 Ingress hitting the backend session limit and started replying with 5xxs.Sep 23 2025, 9:32 AM
dcaro changed the task status from Open to In Progress.
dcaro moved this task from Next Up to In Progress on the Toolforge (Toolforge iteration 24) board.

Last night we did pass the 1k sessions per backend, and the haproxies were able to get going:

image.png (483×1 px, 259 KB)

there's though a suspicious dip in requests to one of the tools:

image.png (1×2 px, 1 MB)

fnegri subscribed.

Reopening as last night @DamianZaremba reported that the ingress maxed out again at 2k sessions.

I can see the rate-limiting in this graph (haproxy_server_current_sessions):

Screenshot 2025-09-26 at 10.49.28.png (681×913 px, 138 KB)

But I'm confused by the fact that this other graph (haproxy_backend_current_sessions) has a similar shape but different values, and no rate-limiting:

Screenshot 2025-09-26 at 10.50.20.png (758×918 px, 132 KB)

fnegri lowered the priority of this task from High to Medium.Sep 26 2025, 8:52 AM

Lowering priority to medium as after the spike last night things seem to have stabilized.

I would imagine it's because the backend /server/ has a 2k (session) limit, but the backend has 2k * nodes (connection) limit, thus 1 (haproxy_server_current_sessions) maxes out at 2k and 1 (haproxy_backend_current_sessions) maxes out at 6k (3 nodes).

For me there are 2 questions here:

  1. What is the actual limit of the ingress nodes - HAProxy should just pass up to that through
  2. What traffic is diving these spikes - is it requests or is it session hands (ingress should close those out, so goes back to 1)

This has been happening on and off for the last days, so while currently stable, does need a resolution

https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/templates/toolforge/k8s/haproxy/k8s-ingress.cfg.erb#75 = 2000 * 3 = 6000

https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/templates/toolforge/k8s/haproxy/k8s-ingress.cfg.erb#79 = 2000

https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/toolforge/k8s/haproxy.pp#13 -> where 2k comes from

https://www.haproxy.com/documentation/haproxy-configuration-tutorials/alerts-and-monitoring/prometheus/

haproxy_server_current_sessions
gauge

Number of current sessions on the frontend, backend or server. Labels: proxy=backend name, server=server name.

Here one would assume it's /server/

haproxy_backend_current_sessions
gauge

Number of current sessions on the frontend, backend, or server. Label: proxy=backend name.

Here one would assume here it is /backend/

https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/templates/toolforge/k8s/haproxy/k8s-ingress.cfg.erb#75 = 2000 * 3 = 6000

The graph above shows it goes slightly above 6000 though, while the haproxy_server_current_sessions graph is flatlining exactly at 2000.

This has been happening on and off for the last days, so while currently stable, does need a resolution

Yes, we'll keep working on this. So far only one spike after we raised the limit to 2k, but I'm also expecting more spikes in the coming days.

The graph above shows it goes slightly above 6000 though, while the haproxy_server_current_sessions graph is flatlining exactly at 2000.

I'm not exactly sure how haproxy decides if it's a connection or a session, but there is a queue value which is 700something, which might explain that.

Yes, we'll keep working on this. So far only one spike after we raised the limit to 2k, but I'm also expecting more spikes in the coming days.

Thanks. Unfortunately we have monitoring so I generally notice all of these events.