Rolling forward 1.33.0-wmf.2 caused a wave of 60 second timeouts that sustained for quite a while.
The HHVM processes of appservers were using all CPU allocated to them.
Rolling back to 1.33.0-wmf.1 didn't seem to solve the issue.
Rolling forward 1.33.0-wmf.2 caused a wave of 60 second timeouts that sustained for quite a while.
The HHVM processes of appservers were using all CPU allocated to them.
Rolling back to 1.33.0-wmf.1 didn't seem to solve the issue.
What I briefly observed yesterday night during the outage was:
So, we tried to redeploy the code on the mwdebug servers today, and I load tested both servers with ab before and after the deployment of the code.
Overall:
This, together with the behaviour from yesterday and the perf data collected seems to suggest that something went wrong on a portion of the hhvm hosts because of a particular state they were in. This is not a satisfactory answer but is frankly the only explanation I have given I can't seem to reproduce the high cpu usage with timeouts.
I don't see this as a blocker to try to redeploy the train on monday.
Current plan is to roll forward the mwdebug servers again at 18:00UTC Monday. If after some prodding those servers do not exhibit the behavior described here, to roll 1.33.0-wmf.2 out to all wikis on all servers shortly thereafter.
A bunch of new messages are getting tidied and going through remex now ( https://gerrit.wikimedia.org/r/#/q/topic:deprecate-wgtidy ) -- although that *shouldn't* cause problems, here are a couple of wild theories:
AFAIK none of the parser/tidy related code has any locking so I'm guessing I'm off the hook for this problem, which seems to be a deadlock. But I figured I'd still offer some wild ideas just in case they are helpful by chance.
Did you end up purging the bytecode cache? If so, was it before you redeployed to mwdebug?
Seems to be there:
thcipriani@mwdebug1001:~$ grep -i -A1 'for the same reason' /srv/mediawiki/php-1.33.0-wmf.2/includes/cache/MessageCache.php // For the same reason, tidy the output! $this->mParserOptions->setTidy( true );
- ParserCache is missing somehow ( 58abac2d1489cdfaaf2ffdf2f9e1214509760b31 ) but I'm pretty sure that would show up in analytics. Same thing, roll back could cause misses of the previous misses.
Also seems like it's there
thcipriani@mwdebug1001:~$ grep -A3 'if any of this changes as' /srv/mediawiki/php-1.33.0-wmf.2/includes/parser/ParserOptions.php // *UPDATE* ParserOptions::matches() if any of this changes as needed self::$defaults = [ 'dateformat' => null, 'tidy' => true,
- Some system message is triggering some weird infinite-loop bug in Remex. Of course, remex has been used for tidy in all article content for months now, but I'm always ready to be surprised. If there's a specific URL which triggers the CPU hog behavior, that would be a big clue. But wouldn't explain why rolling back didn't help.
We haven't been able to find any url to trigger as yet, unfortunately :(
- It's not my fault at all. (I like this one the best, but it doesn't help you fine folks in ops any.) ;)
AFAIK none of the parser/tidy related code has any locking so I'm guessing I'm off the hook for this problem, which seems to be a deadlock. But I figured I'd still offer some wild ideas just in case they are helpful by chance.
@akosiaris helped to roll wmf.2 out today. It required us to use cumin to depool each appserver and restart hhvm as part of deployment:
sudo cumin -b 3 -m async -p 99 'C:profile::mediawiki::webserver and *.eqiad.wmnet' 'depool' 'sudo -u mwdeploy scap pull' 'sudo -u mwdeploy scap wikiversions-compile' 'systemctl restart hhvm.service' 'pool'
Closing this task as resolved since we were able to roll forward with the MediaWiki deployment.
I see this still happening. Last night it happened when deploying l10n for an extension.
Happened again today when rolling forward wmf.17.
Makes gaining any confidence in a new deployment very difficult.