Maniphest T208549

HHVM CPU usage when deploying MediaWiki
Closed, DuplicatePublicPRODUCTION ERROR
Actions

Assigned To

Authored By

	thcipriani
	Nov 1 2018, 9:35 PM

Description

Rolling forward 1.33.0-wmf.2 caused a wave of 60 second timeouts that sustained for quite a while.

The HHVM processes of appservers were using all CPU allocated to them.

Rolling back to 1.33.0-wmf.1 didn't seem to solve the issue.

Related Objects

Mentioned In: T224857: Enhance MediaWiki deployments for support of php7.x
T208813: Scap deployers should have the ability to depool and restart HHVM
T208605: Remove usage of templates in QuickSurveys: Muhogan approach in RelatedArticles and QuickSurveys is error prone
Mentioned Here: rMW4b1db1190bb8: Tidy Message::parseAsBlock() by enabling tidy in MessageCache
rMW58abac2d1489: Change ParserOptions `tidy` default to `true`

Event Timeline

thcipriani created this task.Nov 1 2018, 9:35 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 1 2018, 9:35 PM

thcipriani renamed this task from HHVM CPU usage with 1.33.0-wmf.2 to HHVM CPU usage when deploying MediaWiki.Nov 1 2018, 9:36 PM

thcipriani triaged this task as Unbreak Now! priority.

thcipriani added a parent task: T206656: 1.33.0-wmf.2 deployment blockers.

Restricted Application added subscribers: Liuxinyu970226, TerraCodes. · View Herald TranscriptNov 1 2018, 9:37 PM

Perf data from mw1234: https://phabricator.wikimedia.org/P7755

What I briefly observed yesterday night during the outage was:

On ill-behaving HHVMs, threads kept getting in deadlocks and using 100% of a core waiting on said locks
Given we have a request timeout of 60 seconds, it seemed like servers would recover from the ill condition, but they'd get back to deadlocks as soon as traffic came back to them
I didn't check the size of the bytecode cache databases to see if that exploded. If that's the case, I might want to purge it today.

So, we tried to redeploy the code on the mwdebug servers today, and I load tested both servers with ab before and after the deployment of the code.

Overall:

no cpu spikes were noticed apart from normal request-related ones
no notable performance difference was noticed when parsing a mildly complex page (enwiki's Australia page) at medium (-c 5) and high (-c 10) concurrency (remember the mwdebug servers are 1-vcpu VMs)

This, together with the behaviour from yesterday and the perf data collected seems to suggest that something went wrong on a portion of the hhvm hosts because of a particular state they were in. This is not a satisfactory answer but is frankly the only explanation I have given I can't seem to reproduce the high cpu usage with timeouts.

I don't see this as a blocker to try to redeploy the train on monday.

In T208549#4716470, @Joe wrote:

I don't see this as a blocker to try to redeploy the train on monday.

Current plan is to roll forward the mwdebug servers again at 18:00UTC Monday. If after some prodding those servers do not exhibit the behavior described here, to roll 1.33.0-wmf.2 out to all wikis on all servers shortly thereafter.

A bunch of new messages are getting tidied and going through remex now ( https://gerrit.wikimedia.org/r/#/q/topic:deprecate-wgtidy ) -- although that *shouldn't* cause problems, here are a couple of wild theories:

MessageCache is missing somehow ( 4b1db1190bb8f2a115c6a81a5ee487b7d18cd303 ) and tying up CPU (but I'm pretty sure the misses would show in logs). Rolling back would cause misses on the updated messages so "not solve the issue".
ParserCache is missing somehow ( 58abac2d1489cdfaaf2ffdf2f9e1214509760b31 ) but I'm pretty sure that would show up in analytics. Same thing, roll back could cause misses of the previous misses.
Some system message is triggering some weird infinite-loop bug in Remex. Of course, remex has been used for tidy in all article content for months now, but I'm always ready to be surprised. If there's a specific URL which triggers the CPU hog behavior, that would be a big clue. But wouldn't explain why rolling back didn't help.
It's not my fault at all. (I like this one the best, but it doesn't help you fine folks in ops any.) ;)

AFAIK none of the parser/tidy related code has any locking so I'm guessing I'm off the hook for this problem, which seems to be a deadlock. But I figured I'd still offer some wild ideas just in case they are helpful by chance.

In T208549#4716336, @Joe wrote:

I didn't check the size of the bytecode cache databases to see if that exploded. If that's the case, I might want to purge it today.

Did you end up purging the bytecode cache? If so, was it before you redeployed to mwdebug?

In T208549#4716528, @cscott wrote:

A bunch of new messages are getting tidied and going through remex now ( https://gerrit.wikimedia.org/r/#/q/topic:deprecate-wgtidy ) -- although that *shouldn't* cause problems, here are a couple of wild theories:

MessageCache is missing somehow ( 4b1db1190bb8f2a115c6a81a5ee487b7d18cd303 ) and tying up CPU (but I'm pretty sure the misses would show in logs). Rolling back would cause misses on the updated messages so "not solve the issue".

Seems to be there:

thcipriani@mwdebug1001:~$ grep -i -A1 'for the same reason' /srv/mediawiki/php-1.33.0-wmf.2/includes/cache/MessageCache.php                        // For the same reason, tidy the output!
                        $this->mParserOptions->setTidy( true );

ParserCache is missing somehow ( 58abac2d1489cdfaaf2ffdf2f9e1214509760b31 ) but I'm pretty sure that would show up in analytics. Same thing, roll back could cause misses of the previous misses.

Also seems like it's there

thcipriani@mwdebug1001:~$ grep -A3 'if any of this changes as' /srv/mediawiki/php-1.33.0-wmf.2/includes/parser/ParserOptions.php                      
                        // *UPDATE* ParserOptions::matches() if any of this changes as needed
                        self::$defaults = [
                                'dateformat' => null,
                                'tidy' => true,

Some system message is triggering some weird infinite-loop bug in Remex. Of course, remex has been used for tidy in all article content for months now, but I'm always ready to be surprised. If there's a specific URL which triggers the CPU hog behavior, that would be a big clue. But wouldn't explain why rolling back didn't help.

We haven't been able to find any url to trigger as yet, unfortunately :(

It's not my fault at all. (I like this one the best, but it doesn't help you fine folks in ops any.) ;)

AFAIK none of the parser/tidy related code has any locking so I'm guessing I'm off the hook for this problem, which seems to be a deadlock. But I figured I'd still offer some wild ideas just in case they are helpful by chance.

PeterBowman subscribed.Nov 4 2018, 12:39 AM

colewhite assigned this task to Joe.Nov 5 2018, 7:49 PM

Jdlrobson subscribed.Nov 5 2018, 10:10 PM

@akosiaris helped to roll wmf.2 out today. It required us to use cumin to depool each appserver and restart hhvm as part of deployment:

sudo cumin -b 3 -m async -p 99 'C:profile::mediawiki::webserver and *.eqiad.wmnet' 'depool' 'sudo -u mwdeploy scap pull' 'sudo -u mwdeploy scap wikiversions-compile' 'systemctl restart hhvm.service' 'pool'

Closing this task as resolved since we were able to roll forward with the MediaWiki deployment.

Jdlrobson mentioned this in T208605: Remove usage of templates in QuickSurveys: Muhogan approach in RelatedArticles and QuickSurveys is error prone.Nov 5 2018, 11:00 PM

thcipriani mentioned this in T208813: Scap deployers should have the ability to depool and restart HHVM.Nov 5 2018, 11:03 PM

Liuxinyu970226 unsubscribed.Nov 6 2018, 12:36 PM

I see this still happening. Last night it happened when deploying l10n for an extension.

Happened again today when rolling forward wmf.17.

Makes gaining any confidence in a new deployment very difficult.

Krinkle closed this task as a duplicate of T204871: Investigate the spikes of "web request took longer than 60 seconds and timed out" during deployments on HHVM.Feb 15 2019, 6:21 PM

thcipriani mentioned this in T224857: Enhance MediaWiki deployments for support of php7.x.Jun 3 2019, 9:14 PM

• mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:08 PM

TerraCodes unsubscribed.Feb 10 2021, 7:36 PM

HHVM CPU usage when deploying MediaWikiClosed, DuplicatePublicPRODUCTION ERRORActions

Description

Related Objects

Event Timeline

HHVM CPU usage when deploying MediaWiki
Closed, DuplicatePublicPRODUCTION ERROR
Actions