Page MenuHomePhabricator

504 responses (gateway timeout) for /api/rest_v1/feed/featured
Open, MediumPublic

Description

This evening, it was reported internally that https://zh.wikipedia.org/api/rest_v1/feed/featured/2025/07/24 is intermittently timing out with a 504 response (at the 15s REST gateway timeout for Wikifeeds), in part correlated with client IP.

Specifically, it would seem that IPs mapped to CDN sites preferring eqiad as a backend are generally succeeding, whereas the remainder (i.e., preferring codfw) are not (as can be confirmed by, e.g., using curl --connect-to to select text-lb.$DC.wikimedia.org).

While this would suggest an issue specific to codfw, that's not the case: From an internal host, I consistently observe timeouts on https://rest-gateway.discovery.wmnet:4113/zh.wikipedia.org/v1/feed/featured/2025/07/24 in either eqiad or codfw, with slightly higher chance of success in the former (i.e., the externally visible effect is likely due to caching being slightly more effective at papering over eqiad).

Looking at REST gateway upstream request timeouts in codfw, there's a clear increase in at ~ 18:10 UTC today for wikifeeds. The same is visible in eqiad, though to a lesser extent.

This is consistent with what I'm seeing for internal calls to the equivalent upstream (https://wikifeeds.discovery.wmnet:4101/zh.wikipedia.org/v1/aggregated/featured/2025/07/23), which again time out at ~ 60s (i.e., the local cluster request timeout). Oddly, I can't reproduce this for any other wiki I've spot checked.

Looking at the wikifeeds application logs, I can't find any significant change in log messages bracketing ~ 18:10 UTC.

I'm out of ideas at the moment as to what might have caused this. The only event at ~ 18:10 UTC in the SAL is the train finishing rolling to group2, which does indeed include zhwiki, but also includes plenty of other wikis that don't exhibit this issue (e.g., enwiki).

tl;dr - Something has made the /v1/aggregated/featured wikifeeds endpoint very slow, thus far only observed for zhwiki, starting at ~ 18:10 UTC on 2025-07-24. In eqiad, response duration seems to be bimodal, in that the request will either complete promptly or after ~ 4m (T400425#11033423).

Event Timeline

One additional data point:

To find out exactly how slow these requests are (if indeed they succeed) I nsenter'd the netns of a wikifeeds pod in eqiad, so I could hit /zh.wikipedia.org/v1/aggregated/featured/2025/07/23 on the app directly with curl - i.e., without being subject to envoy's 1m local timeout.

Requests that do not respond promptly (1-2s) consistently succeed after 4m + a bit (0-2s), which strongly suggests something is timing out at that point and unblocking the response. Alas, I can't seem to figure out what, and it seems wikifeeds does not produce log messages that suggest what may have timed out (just the mix of 404 and 403 errors it consistently produces).

Joe triaged this task as Medium priority.Jul 25 2025, 5:36 AM
Joe removed projects: SRE, serviceops-deprecated.
Joe added subscribers: Jgiannelos, Joe.

I can't reproduce anymore, from either core datacenter. Given a 4 minute timeout is nowhere to be found in our mesh network, I would assume what @Scott_French observed is an internal timeout of wikifeeds.

I therefore think this is a bug of the software itself. In absence of reproducibility now, though, I am triaging the task as "medium" severity, and removing the SRE related tags as we can't really debug this further ourselves.

@Jgiannelos does that 4 minutes timeout sound like something we could see in node/wikifeeds?

Thanks for investigating, Joe. It looks like the situation started improving considerably around 02:20 UTC based on REST gateway -> wikifeeds upstream request timeouts, which then trailed off over the next 4h or so.