Page MenuHomePhabricator

504 from /api/rest_v1/page/random/summary
Closed, ResolvedPublic

Description

Looks like this endpoint yields 504 (and /page/summary too sometimes) from varnish's POV, summary for the last 7 days:

GET https://en.wikipedia.org/api/rest_v1/page/random/summary	27,503
GET https://de.wikipedia.org/api/rest_v1/page/random/summary	21,792
GET https://ru.wikipedia.org/api/rest_v1/page/random/summary	9,649
GET https://en.wikipedia.org/api/rest_v1/page/summary/	6,494
GET https://it.wikipedia.org/api/rest_v1/page/random/summary	5,671
GET https://fr.wikipedia.org/api/rest_v1/page/random/summary	4,946
GET https://es.wikipedia.org/api/rest_v1/page/random/summary	3,949
GET https://nl.wikipedia.org/api/rest_v1/page/random/summary	2,722
GET https://zh.wikipedia.org/api/rest_v1/page/random/summary	2,582
GET https://he.wikipedia.org/api/rest_v1/page/random/summary	2,352

See also: https://logstash.wikimedia.org/goto/b0e43fe267a84f6734fecf6a6dfdbd09

Event Timeline

mobrovac edited subscribers, added: Pchelolo, Mholloway, bearND, mobrovac; removed: Aklapper.

According to RESTBase logs, MCS does not respond in time and the socket is closed:

Error: socket hang up
    at createHangUpError (_http_client.js:253:15)
    at Socket.socketOnEnd (_http_client.js:345:23)
    at emitNone (events.js:91:20)
    at Socket.emit (events.js:185:7)
    at endReadableNT (_stream_readable.js:974:12)
    at _combinedTickCallback (internal/process/next_tick.js:80:11)
    at process._tickCallback (internal/process/next_tick.js:104:9)

The problem seems to have started at 2018-12-03T00:00Z. This happens only in eqiad, i.e. for live requests only. I checked the SCB graphs, and the nodes are not overloaded nor do they have problems. The only thing I can see in SAL around that time is:

00:34 legoktm@deploy1001: Synchronized php-1.33.0-wmf.6/includes/Title.php: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/477182 (duration: 00m 52s)

But that is not related. There were no RB or MCS deploys around that date either.

Actually, if you zoom in, the rate had jumped drastically around 14:30, however the only thing I see in the deploy log that could be related is nodejs security updates installed on scb around that time period.

Please note, that this is not a regular 'summary' endpoint, it's a random article endpoint hydrated with summary, so what is actually timing out in MCS is /v1/page/random/title endpoint. That one is a very thin wrapper over MediaWiki API call, so not quite clear how this could timeout unless MW API timeouts.

Not sure what's going on there. +1 to @Pchelolo statement about /page/random/title using MW API. It's using the random generator to request 12 random pages and finds the most interesting page title of the dozen results. We did not deploy any new code around this timeframe. During the last week or two we deployed code which should actually reduce pressure on the event loop, and still continuing to do so. this week.

Is there a good dashboard showing response times for the MW API random generator?

Is there a good dashboard showing response times for the MW API random generator?

I do not think so, we're using query module for random article requests, so the metrics are distributed over all the query submodules and random article is not the most popular one..

I suppose for MediaWiki to parse text extracts for 12 pages might not be a particularly fast operation.

I've been looking at this endpoint lately anyway in connection with the performance work, since I had noticed the response times rising as well. I'm sure there's a better way to build an API to get a random title for a high-quality article. For example, if T63840: Add equivalent to Special:RandomInCategory to the API (either in the existing list=random or its own module) were implemented, we could just wrap an MW API call for RandomInCategory:Wikipedia_good_articles. (The downside is that an equivalent category probably doesn't exist in every language.)

So why exactly are we building the full summary response (actually, wrong summary since it's not based on MCS, but on TextExtracts) if we eventually only use the 'title' property in RESTBase and redirect to a proper RESTBase endpoint with that title or hydrate the response in RESTBase for feed?

So why exactly are we building the full summary response (actually, wrong summary since it's not based on MCS, but on TextExtracts) if we eventually only use the 'title' property in RESTBase and redirect to a proper RESTBase endpoint with that title or hydrate the response in RESTBase for feed?

The heuristic for determining the quality of an article is based on the length of its TextExtract, the presence of a PageImage and title description, and its not being a disambiguation page or list article.

aha. Ok. So improving how this API work might make sense, but the main question is still why a sudden appearance of these 504? They were not present at all before and now we suddenly get huge numbers of them. Suspicious.

I would suspect the internal promisification stuff, but that wasn't deployed until 12/6.

The kernel and Nodejs got updated this morning and the SCB nodes in eqiad got restarted. After this, it seems things have calmed down. Let's keep the ticket open and monitor to see if it starts happening again.

When looking at the RESTBase dashboard, the 504s seem pretty clearly to result from the 12/3 security update.

10:52 moritzm: rolling upgrade of scb in codfw to nodejs security update

11:00:

Screen Shot 2018-12-13 at 9.28.34 AM.png (900×1 px, 351 KB)

I'm glad it's calmed down but I still think /page/random and other endpoints need performance work to avoid getting pushed over the edge over by things like this.

When looking at the RESTBase dashboard, the 504s seem pretty clearly to result from the 12/3 security update.

2/3 the upgrade was done on SCB in codfw, however, the 504s in this task are for external requests, which are served from eqiad. I'm failing to understand the possible connection between MCS in codfw and eqiad.

Interestingly, after 12/13 upgrade in eqiad and rolling reboot of the nodes the 504s completely disappeared.

Pchelolo claimed this task.

Seems like the things has calmed down. Even though we were not able to figure out the core reason, it seems by now we never will. Closing the ticket.