Page MenuHomePhabricator

Elevated 503 backend fetch failed reported by users
Closed, DeclinedPublic

Assigned To
None
Authored By
Ladsgroup
May 12 2024, 2:49 PM
Referenced Files
F57468088: image.png
Sep 6 2024, 11:14 AM
F57468067: image.png
Sep 6 2024, 11:14 AM
F53225493: grafik.png
May 14 2024, 8:38 PM
F52905505: image.png
May 13 2024, 1:20 PM
F52700201: grafik.png
May 12 2024, 6:58 PM
F52629237: grafik.png
May 12 2024, 2:49 PM
F52629165: grafik.png
May 12 2024, 2:49 PM

Description

Many users have notified me that they are getting errors like this basically in every other pageview or edit in the past couple of days:

grafik.png (795×346 px, 138 KB)

grafik.png (793×359 px, 145 KB)

I can't connect them to any visible outage or issues. Appservers don't seem to be saturated or refusing connections. It can't be deploys (and pods shutting down, etc.) since this is from a Sunday.

This seems to be esams only? At least, the users reporting to me seems to only hit issues hitting esams.

Event Timeline

KOfori renamed this task from Elavated 503 backend fetch failed reported by users to Elevated 503 backend fetch failed reported by users.May 12 2024, 6:34 PM

Third report from third user:

grafik.png (562×747 px, 125 KB)

Different cp hosts (but all on the lower end of hosts deployed?) but all are esams.

we had a big spike of 503s on eqiad/drmrs/esams yesterday during EU morning: https://grafana.wikimedia.org/goto/J4YqQuYIR?orgId=1:

image.png (699×1 px, 151 KB)

we had a big spike of 503s on eqiad/drmrs/esams yesterday during EU morning: https://grafana.wikimedia.org/goto/J4YqQuYIR?orgId=1:

image.png (699×1 px, 151 KB)

I saw that but the timing doesn't match. What I'm getting from users is a constant 10% or something of all pageviews being like this for days now.

we had a big spike of 503s on eqiad/drmrs/esams yesterday during EU morning: https://grafana.wikimedia.org/goto/J4YqQuYIR?orgId=1:

image.png (699×1 px, 151 KB)

I saw that but the timing doesn't match. What I'm getting from users is a constant 10% or something of all pageviews being like this for days now.

anonymous requests too or just logged-in users?

Haven't checked logged out, what I get is logged in users.

i.e. can't check logged out ones but these reports are for logged in users.

Fourth one from fourth user:

grafik.png (587×949 px, 114 KB)

I got a fifth one too, I'll send them over tomorrow.

It is still happening, a user reported that for two days, he is constantly getting 503, Here is an example I have: Backened fetch failed. Timestamp: 2024-09-06-06:01:47 UTC. via cp3069. Varnish XID 178259095

esams seems to be as healthy as usual per https://grafana.wikimedia.org/goto/ix3gNVeSR?orgId=1:

image.png (1×1 px, 275 KB)

ATS metrics match HAProxy data https://grafana.wikimedia.org/goto/8WWHHV6IR?orgId=1:

image.png (1×1 px, 305 KB)

I'll try to give it a deeper look

that specific request triggered a timeout while trying to read the POST request body:

Sep  6 06:01:47 cp3069 varnish-frontend-fetcherr[1010101]: @cee: {"time": "2024-09-06T06:01:47.736362", "message": "req.body read error: 11 (Resource temporarily unavailable) [omitted output]

in those cases varnish returns a generic 503 but it's caused by a timeout fetching data from the client

It's client side :(

If it is client side then why is it a 5xx error and not a 4xx error such as HTTP 408?