Page MenuHomePhabricator

HHVM CPU usage when deploying MediaWiki
Closed, DuplicatePublicPRODUCTION ERROR

Description

Rolling forward 1.33.0-wmf.2 caused a wave of 60 second timeouts that sustained for quite a while.

The HHVM processes of appservers were using all CPU allocated to them.

Rolling back to 1.33.0-wmf.1 didn't seem to solve the issue.

Event Timeline

thcipriani renamed this task from HHVM CPU usage with 1.33.0-wmf.2 to HHVM CPU usage when deploying MediaWiki.Nov 1 2018, 9:36 PM
thcipriani triaged this task as Unbreak Now! priority.

What I briefly observed yesterday night during the outage was:

  • On ill-behaving HHVMs, threads kept getting in deadlocks and using 100% of a core waiting on said locks
  • Given we have a request timeout of 60 seconds, it seemed like servers would recover from the ill condition, but they'd get back to deadlocks as soon as traffic came back to them
  • I didn't check the size of the bytecode cache databases to see if that exploded. If that's the case, I might want to purge it today.

So, we tried to redeploy the code on the mwdebug servers today, and I load tested both servers with ab before and after the deployment of the code.

Overall:

  • no cpu spikes were noticed apart from normal request-related ones
  • no notable performance difference was noticed when parsing a mildly complex page (enwiki's Australia page) at medium (-c 5) and high (-c 10) concurrency (remember the mwdebug servers are 1-vcpu VMs)

This, together with the behaviour from yesterday and the perf data collected seems to suggest that something went wrong on a portion of the hhvm hosts because of a particular state they were in. This is not a satisfactory answer but is frankly the only explanation I have given I can't seem to reproduce the high cpu usage with timeouts.

I don't see this as a blocker to try to redeploy the train on monday.

I don't see this as a blocker to try to redeploy the train on monday.

Current plan is to roll forward the mwdebug servers again at 18:00UTC Monday. If after some prodding those servers do not exhibit the behavior described here, to roll 1.33.0-wmf.2 out to all wikis on all servers shortly thereafter.

A bunch of new messages are getting tidied and going through remex now ( https://gerrit.wikimedia.org/r/#/q/topic:deprecate-wgtidy ) -- although that *shouldn't* cause problems, here are a couple of wild theories:

  1. MessageCache is missing somehow ( 4b1db1190bb8f2a115c6a81a5ee487b7d18cd303 ) and tying up CPU (but I'm pretty sure the misses would show in logs). Rolling back would cause misses on the updated messages so "not solve the issue".
  2. ParserCache is missing somehow ( 58abac2d1489cdfaaf2ffdf2f9e1214509760b31 ) but I'm pretty sure that would show up in analytics. Same thing, roll back could cause misses of the previous misses.
  3. Some system message is triggering some weird infinite-loop bug in Remex. Of course, remex has been used for tidy in all article content for months now, but I'm always ready to be surprised. If there's a specific URL which triggers the CPU hog behavior, that would be a big clue. But wouldn't explain why rolling back didn't help.
  4. It's not my fault at all. (I like this one the best, but it doesn't help you fine folks in ops any.) ;)

AFAIK none of the parser/tidy related code has any locking so I'm guessing I'm off the hook for this problem, which seems to be a deadlock. But I figured I'd still offer some wild ideas just in case they are helpful by chance.

  • I didn't check the size of the bytecode cache databases to see if that exploded. If that's the case, I might want to purge it today.

Did you end up purging the bytecode cache? If so, was it before you redeployed to mwdebug?

A bunch of new messages are getting tidied and going through remex now ( https://gerrit.wikimedia.org/r/#/q/topic:deprecate-wgtidy ) -- although that *shouldn't* cause problems, here are a couple of wild theories:

  1. MessageCache is missing somehow ( 4b1db1190bb8f2a115c6a81a5ee487b7d18cd303 ) and tying up CPU (but I'm pretty sure the misses would show in logs). Rolling back would cause misses on the updated messages so "not solve the issue".

Seems to be there:

thcipriani@mwdebug1001:~$ grep -i -A1 'for the same reason' /srv/mediawiki/php-1.33.0-wmf.2/includes/cache/MessageCache.php                        // For the same reason, tidy the output!
                        $this->mParserOptions->setTidy( true );
  1. ParserCache is missing somehow ( 58abac2d1489cdfaaf2ffdf2f9e1214509760b31 ) but I'm pretty sure that would show up in analytics. Same thing, roll back could cause misses of the previous misses.

Also seems like it's there

thcipriani@mwdebug1001:~$ grep -A3 'if any of this changes as' /srv/mediawiki/php-1.33.0-wmf.2/includes/parser/ParserOptions.php                      
                        // *UPDATE* ParserOptions::matches() if any of this changes as needed
                        self::$defaults = [
                                'dateformat' => null,
                                'tidy' => true,
  1. Some system message is triggering some weird infinite-loop bug in Remex. Of course, remex has been used for tidy in all article content for months now, but I'm always ready to be surprised. If there's a specific URL which triggers the CPU hog behavior, that would be a big clue. But wouldn't explain why rolling back didn't help.

We haven't been able to find any url to trigger as yet, unfortunately :(

  1. It's not my fault at all. (I like this one the best, but it doesn't help you fine folks in ops any.) ;)

AFAIK none of the parser/tidy related code has any locking so I'm guessing I'm off the hook for this problem, which seems to be a deadlock. But I figured I'd still offer some wild ideas just in case they are helpful by chance.

thcipriani added a subscriber: akosiaris.

@akosiaris helped to roll wmf.2 out today. It required us to use cumin to depool each appserver and restart hhvm as part of deployment:

sudo cumin -b 3 -m async -p 99 'C:profile::mediawiki::webserver and *.eqiad.wmnet' 'depool' 'sudo -u mwdeploy scap pull' 'sudo -u mwdeploy scap wikiversions-compile' 'systemctl restart hhvm.service' 'pool'

Closing this task as resolved since we were able to roll forward with the MediaWiki deployment.

thcipriani reopened this task as Open.EditedFeb 13 2019, 8:29 PM
thcipriani lowered the priority of this task from Unbreak Now! to High.

I see this still happening. Last night it happened when deploying l10n for an extension.

Happened again today when rolling forward wmf.17.

Makes gaining any confidence in a new deployment very difficult.

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:08 PM