Page MenuHomePhabricator

API appserver CPU exhaustion probably due to DiscussionTools
Closed, ResolvedPublic

Description

For approximately 24 minutes, API App servers on eqiad run out of threads (all busy) This caused sending errors to some clients using the action api and parsoid in the primary datacenter.

While saturation got resolved, latencies are still high and not fully recovering

Event Timeline

Bugreporter triaged this task as Unbreak Now! priority.Dec 18 2022, 6:03 PM
RhinosF1 subscribed.

SRE are aware and working on it. Likely traffic from a sports event.

Unable to login through the API, the website works though.

@RPI2026F1 do you still have login issues? We still have high latencies, but api calls should be going through.

@jcrespo I can login, just slower than usual

jcrespo lowered the priority of this task from Unbreak Now! to High.EditedDec 18 2022, 7:38 PM

We believe issues are gone. Latencies may vary for clients, but no breakage (there was high traffic load at the end of the World cup).

Mentioned in SAL (#wikimedia-operations) [2022-12-18T19:40:47Z] <sukhe> ran sudo cumin -b 1 -s 30 'A:mw-api and A:eqiad' 'restart-php7.4-fpm' [at 18:55 UTC]: T325477

Latencies will continue to be higher than usual due to load. This will require further debugging to understand the root cause. The immediate cause (trigger) was the traffic spike due to World Cup final ending, and while we have some possible culprit, we will require more research tomorrow to understand and communicate why errors happened. A document about the incident will be written soon.

For SREs: This is the temporary doc with notes: https://docs.google.com/document/d/1zhFUcPxxp7C5uHERgKFH6hvVhUy1g4S0jM2lGOXgwSM/edit#

tstarling added subscribers: ppelberg, DLynch, tstarling.

Since 2022-12-08 22:20 we have had high traffic to action=discussiontoolspageinfo , with a daily peak of around 2k req/s. So that is T321961: [Config Change] Start mobile DiscussionTools A/B test.

The outage was more a latency spike than a request rate spike. So there may have been some backend issue. But it seems prudent to roll back the DiscussionTools A/B test.

I confirmed using the API log that most requests to action=discussiontoolspageinfo are for the A/B test wikis listed in that task. Presumably the high rate of API queries is unintentional.

Change 868544 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/mediawiki-config@master] Revert "Start mobile DiscussionTools A/B test"

https://gerrit.wikimedia.org/r/868544

discussiontoolspageinfo request rate

dtpi reqs.png (337×637 px, 38 KB)

A bit noisy starting around 14:50, but other days show that kind of pattern near the daily peak. It gets slowly worse. Latency creeps up from ~55ms to ~70ms. Then there's a step change at 17:57 seen in the number of idle PHP-FPM workers:

workers.png (337×638 px, 51 KB)

The recovery at 18:18 was due to a restart of PHP-FPM. So any PHP requests running at that time would not have reported their running time to the API dashboard. So bear that in mind when we look at the API load breakdown:

api load.png (340×638 px, 49 KB)

There was a 503 spike, the API appservers serving about 4.6 k req/s of 503s. I think this was a response to the lack of idle workers, not a demand spike.

503.png (302×639 px, 31 KB)

The hourly excimer flame graph doesn't show any backend requests which would provide an explanation. 28% discussiontoolspageinfo, which is mostly CPU-heavy, evenly split between Parsoid and a Remex parse of the Parsoid output.

So I guess we just ran out of CPU:

cpu.png (264×638 px, 34 KB)

Peak CPU usage was 92%.

Change 868544 merged by jenkins-bot:

[operations/mediawiki-config@master] Revert "Start mobile DiscussionTools A/B test"

https://gerrit.wikimedia.org/r/868544

Mentioned in SAL (#wikimedia-operations) [2022-12-19T03:08:51Z] <tstarling@deploy1002> Synchronized wmf-config/InitialiseSettings.php: disable wgDiscussionToolsABTest T325477 T321961 (duration: 15m 23s)

@tstarling can those charts get more granular? I'd be very interested to know whether it was the transcludedfrom or threaditemshtml prop being requested from discussiontoolspageinfo.

The stack trace for the network request shows controller.js init() calling getPageData():

	// TODO: Isn't this too early to load it? We will only need it if the user tries replying...
	getPageData(
		mw.config.get( 'wgRelevantPageName' ),
		mw.config.get( 'wgCurRevisionId' )
	);

The TODO comment may be relevant here.

Yeah, the issue here is (mostly) us including the general DiscussionTools JS for some test-related effects, and it having those early-loading side-effects. We should either pull out the test-maintenance code into a completely separate bundle (or inline it into the page somewhere), or resolve that TODO.

(Under non-test circumstances the fetch-data-on-load behavior is less concerning because the JS is included more selectively. I think it's only causing a load issue here because it's running relatively complicated things on the main namespace, which very-few wikis would be causing normally.)

Does the response have any private data in it? I think if ApiDiscussionToolsPageInfo::execute() called $this->getMain()->setCacheMode( 'public' ) and you set the query string parameters maxage=86400 and smaxage=86400, then the response would have an Expires header and would be cached in the CDN. You have the oldid in there already for invalidation.

tstarling renamed this task from large number of 503 errors to API appserver CPU exhaustion probably due to DiscussionTools.Dec 19 2022, 5:09 AM

It shouldn't. That particular call is just asking whether anything on the page is inside a transclusion, to work out whether it can actually be used for comment-leaving purposes.

Maybe only set the cache header if it's on the oldid path -- it is valid to call it without that if you just want the latest revision, in which case caching would be a bad idea, and I'm not certain offhand whether we depend on that case.

I might be missing something obvious but https://es.m.wikipedia.org/wiki/Sede_de_la_Organizaci%C3%B3n_de_las_Naciones_Unidas is an article. Why DT is being loaded in articles? Isn't is supposed to load in discussion namespaces only (Talk + wgExtraSignatureNamespaces)?

See:

the issue here is (mostly) us including the general DiscussionTools JS for some test-related effects, and it having those early-loading side-effects. We should either pull out the test-maintenance code into a completely separate bundle (or inline it into the page somewhere), or resolve that TODO.

Change 869051 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/extensions/DiscussionTools@master] Emergency: discussiontoolspageinfo return empty response in non-talk ns

https://gerrit.wikimedia.org/r/869051

Change 868867 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/extensions/DiscussionTools@wmf/1.40.0-wmf.14] Emergency: discussiontoolspageinfo return empty response in non-talk ns

https://gerrit.wikimedia.org/r/868867

Change 869163 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/extensions/DiscussionTools@master] In discussiontoolspageinfo, don't parse non-talk pages

https://gerrit.wikimedia.org/r/869163

Change 869051 merged by jenkins-bot:

[mediawiki/extensions/DiscussionTools@master] Emergency: discussiontoolspageinfo return empty response in non-talk ns

https://gerrit.wikimedia.org/r/869051

Change 868867 merged by jenkins-bot:

[mediawiki/extensions/DiscussionTools@wmf/1.40.0-wmf.14] Emergency: discussiontoolspageinfo return empty response in non-talk ns

https://gerrit.wikimedia.org/r/868867

Change 869169 had a related patch set uploaded (by DLynch; author: DLynch):

[mediawiki/extensions/DiscussionTools@master] Only preload getPageData if there's thread data for the page

https://gerrit.wikimedia.org/r/869169

Mentioned in SAL (#wikimedia-operations) [2022-12-19T09:05:47Z] <ladsgroup@deploy1002> Started scap: Backport for [[gerrit:868867|Emergency: discussiontoolspageinfo return empty response in non-talk ns (T325477)]]

Mentioned in SAL (#wikimedia-operations) [2022-12-19T09:07:35Z] <ladsgroup@deploy1002> ladsgroup and ladsgroup: Backport for [[gerrit:868867|Emergency: discussiontoolspageinfo return empty response in non-talk ns (T325477)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2022-12-19T09:15:12Z] <ladsgroup@deploy1002> Finished scap: Backport for [[gerrit:868867|Emergency: discussiontoolspageinfo return empty response in non-talk ns (T325477)]] (duration: 09m 24s)

Change 869169 merged by jenkins-bot:

[mediawiki/extensions/DiscussionTools@master] Only preload getPageData if there's thread data for the page

https://gerrit.wikimedia.org/r/869169

Change 868869 had a related patch set uploaded (by Majavah; author: DLynch):

[mediawiki/extensions/DiscussionTools@wmf/1.40.0-wmf.14] Only preload getPageData if there's thread data for the page

https://gerrit.wikimedia.org/r/868869

Change 868869 merged by jenkins-bot:

[mediawiki/extensions/DiscussionTools@wmf/1.40.0-wmf.14] Only preload getPageData if there's thread data for the page

https://gerrit.wikimedia.org/r/868869

Mentioned in SAL (#wikimedia-operations) [2022-12-19T10:20:08Z] <taavi@deploy1002> Started scap: Backport for [[gerrit:868869|Only preload getPageData if there's thread data for the page (T325477)]]

Mentioned in SAL (#wikimedia-operations) [2022-12-19T10:21:50Z] <taavi@deploy1002> taavi and taavi: Backport for [[gerrit:868869|Only preload getPageData if there's thread data for the page (T325477)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2022-12-19T10:28:07Z] <taavi@deploy1002> Finished scap: Backport for [[gerrit:868869|Only preload getPageData if there's thread data for the page (T325477)]] (duration: 07m 58s)

I also need to write an incident report- I will probably require some of your help.

Change 869163 merged by jenkins-bot:

[mediawiki/extensions/DiscussionTools@master] Clean up ApiDiscussionToolsPageInfo hack

https://gerrit.wikimedia.org/r/869163

A draft has been posted at https://wikitech.wikimedia.org/wiki/Incidents/2022-12-18_World_Cup Could you help filling it in? While I was able to state some of the operational bits (from a limited perspective), I am mostly ignorant about the code details/deployments/fixes leading to the outage and its resolution- as both Discussion Tool developers and Service Ops SRE will have more know-how to document that. Please feel free to amend my possible misconceptions.

A draft has been posted at https://wikitech.wikimedia.org/wiki/Incidents/2022-12-18_World_Cup Could you help filling it in? While I was able to state some of the operational bits (from a limited perspective), I am mostly ignorant about the code details/deployments/fixes leading to the outage and its resolution- as both Discussion Tool developers and Service Ops SRE will have more know-how to document that. Please feel free to amend my possible misconceptions.

@jcrespo Thank you for working on this!