Page MenuHomePhabricator

"best-of" tool Unexpectedly Returning 500s
Closed, ResolvedPublicBUG REPORT

Description

Steps to replicate the issue:

What happens?:
Application returns a 500
The node application does not seem to notice this happened so it appears to be between the app and the user.

What should have happened instead?:

200 is always expected.

best-of is a fairly vanilla vite/node service. i can run it locally and hammer this endpoint with no errors.

I wrote a small script to repeatedly hammer the application and instrumented the application to print when it thinks it got the request and when it returned from the request. My script thinks it got a 500 at https://best-of.toolforge.org/api/category/random?foo=7 but my tool is convinced that it served that "Response sent: GET /api/category/random ?foo=7 - 200 (80ms)"

I've confirmed the app isn't randomly crashing/restarting. The latency in serving is never very high (<300ms). The grafana dashboard for the service never shows any 500s (https://grafana.wmcloud.org/d/TJuKfnt4z/tool-dashboard?orgId=1&var-cluster=P8433460076D33992&var-namespace=tool-best-of&from=now-6h&to=now&timezone=utc).

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
bd808 renamed this task from Toolforge Tool Unexpectedly Returning 500s to "best-of" tool Unexpectedly Returning 500s.Nov 18 2025, 5:14 AM
bd808 added a project: Tools.

I captured a 500 response at the haproxy outer edge of best-of.toolforge.org:

2025-11-18T00:43:37.753944+00:00 tools-k8s-haproxy-8 haproxy[766]: $MY_IPV6 [18/Nov/2025:00:43:37.660] k8s-ingress-https~ k8s-ingress-http/tools-k8s-ingress-9.tools.eqiad1.wikimedia.cloud 0/0/0/92/92 500 18294 - - PH-- 1662/1393/330/115/0 0/0 "GET https://best-of.toolforge.org/api/category/random HTTP/2.0" 0/0000000000000000/0/0/0 best-of.toolforge.org/TLSv1.3/TLS_AES_256_GCM_SHA384 host:"best-of.toolforge.org"

I think this log shows a 500 being returned from the upstream server which is the nginx-ingress Pod running on the tools-k8s-ingress-9 instance. The haproxy config returns the same HTML body no matter what upstream 500 error body it sees so I think we need to trap a failure at the next upstream, the Kubernetes ingress.

i tried swapping from polka to express as my server and that had no effect. really unsure how this could be on my side but it seems unlikely it isn't given no other services are impacted.

I'm doing a quick check, putting a while loop requesting that url:

In [6]: while response.status_code == 200:
   ...:     response = requests.get("https://best-of.toolforge.org/api/category/random?foo=7")
   ...:     time.sleep(1)
   ...:     print(".")
   ...:

while looking at the k8s ingress logs + webservice logs, see if I catch it when it returns 500

That did not take long:

In [10]: print(response.text)
<!DOCTYPE html>
<html lang="en">
<meta charset="utf-8">
<title>Wikimedia Error</title>
<link rel="shortcut icon" href="https://tools-static.wmflabs.org/admin/errors/favicon.ico">
<style>
* { margin: 0; padding: 0; }
body { background: #fff; font: 15px/1.6 sans-serif; color: #333; }
.content { margin: 7% auto 0; padding: 2em 1em 1em; max-width: 640px; display: flex; flex-direction: row; flex-wrap: wrap; }
.footer { clear: both; margin-top: 14%; border-top: 1px solid #e5e5e5; background: #f9f9f9; padding: 2em 0; font-size: 0.8em; text-align: center; }
img { margin: 0 2em 2em 0; }
a img { border: 0; }
h1 { margin-top: 1em; font-size: 1.2em; }
.content-text { flex: 1; }
p { margin: 0.7em 0 1em 0; }
a { color: #0645ad; text-decoration: none; }
a:hover { text-decoration: underline; }
code { font-family: sans-serif; }
summary { font-weight: bold; cursor: pointer; }
details[open] { background: #970302; color: #dfdedd; }
.text-muted { color: #777; }
@media (prefers-color-scheme: dark) {
  a { color: #9e9eff; }
  body { background: transparent; color: #ddd; }
  .footer { border-top: 1px solid #444; background: #060606; }
  #logo { filter: invert(1) hue-rotate(180deg); }
  .text-muted { color: #888; }
}
</style>
<meta name="color-scheme" content="light dark">
<div class="content" role="main">
<a href="https://wikitech.wikimedia.org/wiki/Portal:Toolforge"><img id="logo" src="https://tools-static.wmflabs.org/admin/errors/toolforge-logo.png" srcset="https://tools-static.wmflabs.org/admin/errors/toolforge-logo-2x.png 2x" alt="Wikimedia Toolforge" width="120" height="120">
</a>
<div class="content-text">
<h1>Wikimedia Toolforge Error</h1>

<p>Our servers are currently experiencing a technical problem. This is probably temporary and should be fixed soon. Please try again later.</p>
</div>
</div>
<div clas

No traces of the log on k8s ingress logs:

oot@tools-k8s-control-9:~# grep "python-requests/2.32.4" *18.logs
ingress-nginx-gen2-controller-5c65947957-t7rp5.2025-11-18.logs:192.168.36.64 - - [18/Nov/2025:08:13:38 +0000] "POST / HTTP/1.1" 200 2 "-" "libup-runner python-requests/2.32.4" 96651 0.456 [tool-php-security-checker-php-security-checker-8000] [] 192.168.171.100:8000 2 0.012 200 02af52d76bc339240f6dd4da4ce31832
ingress-nginx-gen2-controller-5c65947957-t7rp5.2025-11-18.logs:192.168.254.192 - - [18/Nov/2025:08:14:09 +0000] "POST / HTTP/1.1" 200 2 "-" "libup-runner python-requests/2.32.4" 96651 0.463 [tool-php-security-checker-php-security-checker-8000] [] 192.168.171.100:8000 2 0.021 200 c998b7e3357784541366b722178c5370
ingress-nginx-gen2-controller-5c65947957-t7rp5.2025-11-18.logs:192.168.36.64 - - [18/Nov/2025:08:14:33 +0000] "POST / HTTP/1.1" 200 2 "-" "libup-runner python-requests/2.32.4" 96651 0.449 [tool-php-security-checker-php-security-checker-8000] [] 192.168.171.100:8000 2 0.009 200 b7a861d9cfff6b939ccd3224fca0f80d
ingress-nginx-gen2-controller-5c65947957-t7rp5.2025-11-18.logs:192.168.254.192 - - [18/Nov/2025:08:15:17 +0000] "POST / HTTP/1.1" 200 2 "-" "libup-runner python-requests/2.32.4" 96651 0.468 [tool-php-security-checker-php-security-checker-8000] [] 192.168.171.100:8000 2 0.019 200 39625441b6cbcee85fb412dc9fc30ea4
ingress-nginx-gen2-controller-5c65947957-t7rp5.2025-11-18.logs:192.168.166.0 - - [18/Nov/2025:08:17:10 +0000] "POST / HTTP/1.1" 200 2 "-" "libup-runner python-requests/2.32.4" 56422 0.660 [tool-php-security-checker-php-security-checker-8000] [] 192.168.171.100:8000 2 0.008 200 ed46184355ce4f5052c4117cf329ac37
ingress-nginx-gen2-controller-5c65947957-t7rp5.2025-11-18.logs:192.168.166.0 - - [18/Nov/2025:08:17:14 +0000] "GET /api/category/random?foo=7 HTTP/1.1" 200 86057 "-" "python-requests/2.32.4" 291 1.350 [tool-best-of-best-of-8000] [] 192.168.62.55:8000 86057 1.351 200 b998752badb0d51a62c275357573ed4e
ingress-nginx-gen2-controller-5c65947957-t7rp5.2025-11-18.logs:192.168.166.0 - - [18/Nov/2025:08:17:16 +0000] "GET /api/category/random?foo=7 HTTP/1.1" 200 75572 "-" "python-requests/2.32.4" 291 0.943 [tool-best-of-best-of-8000] [] 192.168.62.55:8000 75572 0.944 200 e76b3392d54f5d5965b5574d665ecbe7
ingress-nginx-gen2-controller-5c65947957-t7rp5.2025-11-18.logs:192.168.166.0 - - [18/Nov/2025:08:17:17 +0000] "POST / HTTP/1.1" 200 2 "-" "libup-runner python-requests/2.32.4" 96651 0.680 [tool-php-security-checker-php-security-checker-8000] [] 192.168.171.100:8000 2 0.009 200 00308101ef9c922ee6e511ead9d7a549
ingress-nginx-gen2-controller-5c65947957-t7rp5.2025-11-18.logs:192.168.166.0 - - [18/Nov/2025:08:17:20 +0000] "GET /api/category/random?foo=7 HTTP/1.1" 200 71701 "-" "python-requests/2.32.4" 291 0.137 [tool-best-of-best-of-8000] [] 192.168.62.55:8000 71701 0.137 200 01b23606ea2f4a82deee7a648e77e819
ingress-nginx-gen2-controller-5c65947957-t7rp5.2025-11-18.logs:192.168.166.0 - - [18/Nov/2025:08:17:26 +0000] "GET /api/category/random?foo=7 HTTP/1.1" 200 65749 "-" "python-requests/2.32.4" 291 0.938 [tool-best-of-best-of-8000] [] 192.168.62.55:8000 65749 0.938 200 a2b124c78d2d449e4c523405d025ee2b
ingress-nginx-gen2-controller-5c65947957-t7rp5.2025-11-18.logs:192.168.254.192 - - [18/Nov/2025:08:17:33 +0000] "GET /api/category/random?foo=7 HTTP/1.1" 200 71678 "-" "python-requests/2.32.4" 291 1.138 [tool-best-of-best-of-8000] [] 192.168.62.55:8000 71678 1.137 200 1f50b528cbfd67f5c5816aa1bf41920e
ingress-nginx-gen2-controller-5c65947957-t7rp5.2025-11-18.logs:192.168.166.0 - - [18/Nov/2025:08:17:39 +0000] "POST / HTTP/1.1" 200 2 "-" "libup-runner python-requests/2.32.4" 56422 0.445 [tool-php-security-checker-php-security-checker-8000] [] 192.168.171.100:8000 2 0.008 200 7377fe08f5171d846067d8ce732e330a
ingress-nginx-gen2-controller-5c65947957-t7rp5.2025-11-18.logs:192.168.254.192 - - [18/Nov/2025:08:17:40 +0000] "GET /api/category/random?foo=7 HTTP/1.1" 200 114390 "-" "python-requests/2.32.4" 291 1.183 [tool-best-of-best-of-8000] [] 192.168.62.55:8000 114390 1.183 200 e30f99a823a5ab66173f454b3b9bd8c3
ingress-nginx-gen2-controller-5c65947957-t7rp5.2025-11-18.logs:192.168.36.64 - - [18/Nov/2025:08:18:06 +0000] "POST / HTTP/1.1" 200 2 "-" "libup-runner python-requests/2.32.4" 56422 0.455 [tool-php-security-checker-php-security-checker-8000] [] 192.168.171.100:8000 2 0.010 200 64790b4d06d53bc9a98172554210f87e

On the haproxy logs I 500s from two tools-k8s-ingress nodes:

2025-11-18T08:36:58.988078+00:00 tools-k8s-haproxy-8 haproxy[766]: 213.55.247.35:24685 [18/Nov/2025:08:36:58.896] k8s-ingress-https~ k8s-ingress-http/tools-k8s-ingress-8.tools.eqiad1.wikimedia.cloud 0/0/1/89/90 500 18286 - - PH-- 5885/5616/221/80/0 0/0 "GET /api/category/random?foo=7 HTTP/1.1" 0/0000000000000000/0/0/0 best-of.toolforge.org/TLSv1.3/TLS_AES_256_GCM_SHA384 host:"best-of.toolforge.org"
...
2025-11-18T08:38:34.923862+00:00 tools-k8s-haproxy-8 haproxy[766]: 213.55.247.35:24817 [18/Nov/2025:08:38:34.840] k8s-ingress-https~ k8s-ingress-http/tools-k8s-ingress-9.tools.eqiad1.wikimedia.cloud 0/0/4/78/82 500 18286 - - PH-- 5967/5689/332/110/0 0/0 "GET /api/category/random?foo=7 HTTP/1.1" 0/0000000000000000/0/0/0 best-of.toolforge.org/TLSv1.3/TLS_AES_256_GCM_SHA384 host:"best-of.toolforge.org"

Will try to capture some traffic or something on those nodes to figure out where the request goes if it's not logged.

Might not be related, but for sure it does not help, that our ingress pods are being scheduled somewhere else than the ingress workers due to (I think) memory constraints, as the ingress workers have ~8G memory, and ~3G get already used by the minimal setup:

root@tools-k8s-control-9:~# kubectl top nodes | grep ingress
tools-k8s-ingress-7       89m          2%     2473Mi          31%       
tools-k8s-ingress-8       370m         9%     4275Mi          54%       
tools-k8s-ingress-9       159m         3%     2074Mi          26%

While we request already 5G for the ingress containers:

root@tools-k8s-control-9:~# kubectl get deployment -n ingress-nginx-gen2 ingress-nginx-gen2-controller -o json | jq '.spec.template.spec.containers[].resources'
{
  "limits": {
    "cpu": "3",
    "memory": "6G"
  },
  "requests": {
    "cpu": "2",
    "memory": "5G"
  }
}

So requests have to jump from haproxy -> ingress-nginx -> whichever worker the pod is running

We were also having some crawlers hitting the haproxy:

image.png (348×2 px, 104 KB)

But they stopped, and the errors are still there, so nice to see the throttling there working as expected.

The PH-- session state translates to:

PH   The proxy blocked the server's response, because it was invalid,
     incomplete, dangerous (cache control), or matched a security filter.
     In any case, an HTTP 502 error is sent to the client. One possible
     cause for this error is an invalid syntax in an HTTP header name
     containing unauthorized characters. It is also possible but quite
     rare, that the proxy blocked a chunked-encoding request from the
     client due to an invalid syntax, before the server responded. In this
     case, an HTTP 400 error is sent to the client and reported in the
     logs. Finally, it may be due to an HTTP header rewrite failure on the
     response. In this case, an HTTP 500 error is sent (see
     "tune.maxrewrite" and "http-response strict-mode" for more
     inforomation).

I wonder if this is hitting some HAProxy-level timeout?

Aren't the PH part of the other set of flags?

  - On the first character, a code reporting the first event which caused the
    session to terminate :
...
        P : the session was prematurely aborted by the proxy, because of a
            connection limit enforcement, because a DENY filter was matched,
            because of a security check which detected and blocked a dangerous
            error in server response which might have caused information leak
            (e.g. cacheable cookie).
...
  - on the second character, the TCP or HTTP session state when it was closed :
...
        H : the proxy was waiting for complete, valid response HEADERS from the
            server (HTTP only).

/me gets always confused with haproxy logs

I have been testing also another tool (tool-db-usage), and I'm getting the 500s too it seems, looking for the logs

Looks the same to me:

2025-11-18T10:22:37.611073+00:00 tools-k8s-haproxy-8 haproxy[766]: 213.55.247.35:24591 [18/Nov/2025:10:22:37.577] k8s-ingress-https~ k8s-ingress-http/tools-k8s-ingress-8.tools.eqiad1.wikimedia.cloud 0/0/3/28/31 500 18300 - - PH-- 1983/1716/489/183/0 0/0 "GET / HTTP/1.1" 0/0000000000000000/0/0/0 tool-db-usage.toolforge.org/TLSv1.3/TLS_AES_256_GCM_SHA384 host:"tool-db-usage.toolforge.org"

Is there a way to see the limit-per-tool sticky table content or something?

https://github.com/haproxy/haproxy/issues/1597 suggests increasing tune.maxrewrite. It also references a metric that's not yet in Prometheus (due to T343885), but manually connecting to the built-in exporter port seems to support that theory:

# HELP haproxy_frontend_failed_header_rewriting_total Total number of failed HTTP header rewrites since the worker process started
# TYPE haproxy_frontend_failed_header_rewriting_total counter
haproxy_frontend_failed_header_rewriting_total{proxy="k8s-api"} 0
haproxy_frontend_failed_header_rewriting_total{proxy="k8s-ingress-api-gateway-tcp"} 0
haproxy_frontend_failed_header_rewriting_total{proxy="k8s-ingress-http"} 0
haproxy_frontend_failed_header_rewriting_total{proxy="k8s-ingress-https"} 2299649
haproxy_frontend_failed_header_rewriting_total{proxy="stats"} 0

I will send a patch bumping the maxrewrite value.

Is there a way to see the limit-per-tool sticky table content or something?

The above suggests this issue is not caused by that, but you can do echo "show table limit-per-tool" | sudo socat unix-connect:/run/haproxy/admin.sock stdio (from here).

Change #1206834 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:toolforge: haproxy: Increase tune.maxrewrite

https://gerrit.wikimedia.org/r/1206834

Change #1206835 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] toolforge:haproxy: added limit rates to the logs

https://gerrit.wikimedia.org/r/1206835

Change #1206834 merged by Majavah:

[operations/puppet@production] P:toolforge: haproxy: Increase tune.maxrewrite

https://gerrit.wikimedia.org/r/1206834

taavi claimed this task.

I'm being optimistic and resolving, as we no longer seem to be serving any PH-- 5xx responses which is a noticeable improvement.

Change #1206835 merged by David Caro:

[operations/puppet@production] toolforge:haproxy: added limit rates to the logs

https://gerrit.wikimedia.org/r/1206835