Page MenuHomePhabricator

wmf/1.30.0-wmf.2 performance issue for Wikipedias
Closed, ResolvedPublic

Description

Timeline:

  • 19:09 wmf.2 was rolled forward to group2 (wikipedias)
  • 19:21 wikipedias main_page rendering becomes slow, alerts flood -operations
  • 20:25 wikipedias all rolled back to wmf.1 and alerts cease

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
debt added a subscriber: debt.May 26 2017, 7:25 AM
phuedx added a subscriber: phuedx.May 26 2017, 7:44 AM
Volans added a subscriber: Volans.May 26 2017, 8:48 AM

I believe this would merit an incident report to find out exactly what happened?

Agreed. Created one this morning: https://wikitech.wikimedia.org/wiki/Incident_documentation/20170525-MediaWiki

tl;dr is that there were many confounding factors happening at the same time, but I should have rolled back the train sooner.

The start of the incident didn't match the train, there was database lag, redis instances out of memory, but, in retrospect, that is probably the time when it is especially important to rollback train — when there are many possible causes for problems.

There was a wikitext preprocessor patch in wmf.2 that I'm responsible for: T146304. I'm assuming that the fact that the start of the incident didn't coincide immediately with the deploy points the finger more at caching or database issues that the wikitext parser, but I'm watching this thread in case my assumption is incorrect. (Also the preprocessor change has to do with -{...}- markup, which is more likely to exist on zhwiki and friends than the enwiki main page.)

To recap where this issue left off yesterday:

Attempt to recreate issue with @Legoktm and @Volans Moved wikipedias on mwdebug1002 over to 1.30.0-wmf.2, tried to recreate slowdown both logged out and logged in, but had no success. We also moved a single appserver's wikipedieas (mw1170) to wmf.2 to see if there was any affect from additional traffic levels and were unable to recreate. On the appserver we tested with curl and ab and have not been able to recreate the slowdown.

Complete list of changes in this version is https://www.mediawiki.org/wiki/MediaWiki_1.30/wmf.2 if that's helpful.

greg added a comment.May 26 2017, 5:37 PM

Status update from RelEng:

We don't have any plans to roll out wmf.2 any further today (Friday the 26th). We still don't have a root cause and we're going into a 3 day weekend.

We have confirmation from the Performance Team that they (Krinkle) will begin investigating this from their side in a couple of hours time (around 19:30 UTC).

Ping us (probably thcipriani on IRC) if you want any other production app/debug servers to have wmf.2 put on them for testing. We'll/he'll be around and available for that.

Nirmos added a subscriber: Nirmos.May 26 2017, 7:17 PM

There are no xhprof runs in xhgui for the initial deployment window when the performance regression happened. With the attempts to reproduce on single machines failing, I think we'll have to plan a wider re-deployment to be able to profile this as the issue is happening.

There are no xhprof runs in xhgui for the initial deployment window when the performance regression happened. With the attempts to reproduce on single machines failing, I think we'll have to plan a wider re-deployment to be able to profile this as the issue is happening.

Would like to avoid a redeployment, but it seems there is very little information to go on without one. We can tentatively schedule for the regular train window today if someone from Performance-Team will be around to gather needed info to keep the redeployment window as small as possible.

Would it make sense to bisect wmf.2 at least before the redeploy? That would give a ~50% possibility of no issue occurring w/ the deploy, while still teaching us something.

Gilles added a subscriber: Krinkle.May 30 2017, 3:48 PM

@Krinkle are you available to do that? 19:00 UTC today

@cscott my understanding is that the issue couldn't be reproduced at all outside of the complete deployment. We can git bisect the actual redeployment, though.

Would it make sense to bisect wmf.2 at least before the redeploy? That would give a ~50% possibility of no issue occurring w/ the deploy, while still teaching us something.

@cscott my understanding is that the issue couldn't be reproduced at all outside of the complete deployment. We can git bisect the actual redeployment, though.

git bisect for redeployment is not safe necessarily because that assumes that all commits on master are deployable, but is that true? For example, there might be patches that might have been reverted based on testing and that shouldn't be deployed. So, some care required there.

cscott added a comment.EditedMay 30 2017, 3:55 PM

@Gilles Yes, that is what I meant. It's a little complicated of course, but (for instance) we could start by deploying just the changes to extensions. From https://www.mediawiki.org/wiki/MediaWiki_1.30/wmf.2 that looks like it shouldn't have any bad effects (we should test beforehand, of course) and would rule out more than half the changelist.

EDIT: yes, as @ssastry points out, bisection would have to be done w/ care. It may or may not be less risky than deploying a "known bad" wmf.2. I'm not saying it should definitely be done, just suggesting it be considered.

Joe added a comment.May 30 2017, 4:43 PM

Since the problem presented itself only after ~ 15 minutes after the deploy, it could be that something that we were able to cache in WANCache before is now somehow uncacheable and thus very expensive to compute.

That would explain why with only one machine on the new version, we cannot see the regression: since the other N machines are able to save the content in cache, the cache is populated and the symptom never reappears.

One test in this direction would be: quickly install memcached on mwdebug1001, reconfigure nutcracker to connect just to the local instance, and repeat the load test on wmf2 there.

Sounds worth a try, that would also explain why group0 and group1 are fine at the moment. I'm dealing with a sick child, though, so I'd rather not venture into that kind of live hacking myself in the next 3 hours if I can avoid it. Hopefully @aaron or @Krinkle appear before then.

Sounds worth a try, that would also explain why group0 and group1 are fine at the moment. I'm dealing with a sick child, though, so I'd rather not venture into that kind of live hacking myself in the next 3 hours if I can avoid it. Hopefully @aaron or @Krinkle appear before then

I'm available to move mwdebug100{1,2} to wmf.2 when there are folks that can help pinpoint the root issue.

It seems that memcached is already running on those hosts, the nutcracker config would need to be edited for for testing.

We're trying all of that as we speak with @aaron, on mwdebug1001

We did what @Joe suggested on mwdebug1001 and we couldn't reproduce the issue.

@thcipriani @greg We unfortunately don't have hourly-precision debug profiles from production samples anymore (Flame graphs expire after 24 hours - T166624 deals with the currently buggy retention of those, and about extending that to at least a week going forward).

We also didn't run manual xhprof runs during the incident, so we simply don't have enough data. Together with the large changelog (see https://www.mediawiki.org/wiki/MediaWiki_1.30/wmf.2) I don't think we'll be able to pinpoint this.

We tried re-enabling wmf.2 on a single app server, but were unable to reproduce or otherwise observe the issue.

@Gilles and I agree that we should go ahead with re-enabling it in production and monitor the situation from that point onward. If it happens again, we'll capture the before/after profiles and go from there. (Possibly re-reverting until we've found the problem in the captured data.)

greg added a comment.May 30 2017, 9:56 PM

@Gilles and I agree that we should go ahead with re-enabling it in production and monitor the situation from that point onward. If it happens again, we'll capture the before/after profiles and go from there. (Possibly re-reverting until we've found the problem in the captured data.)

Thanks for trying :/

Alright, so, what's the timeline for that since I presume (at least) one of you needs to be around for that? I know both of you are EU hours and it's getting very late there.

The best time for me would be Thursday EU morning, i.e. from 07:00am GMT onwards. Maybe @Krinkle has sooner availability to offer tomorrow.

greg added a comment.May 31 2017, 4:01 PM

So.... do we have anyone else watching this ticket who can do what @Krinkle describes above today, Wednesday? I'd rather not wait until the Thursday after we halted the train to do this investigation. This is going to make the next time we do the train that much worse (we'll have two weeks of code changes again, instead of one).

Is there any documentation that we/RelEng can follow to get the same information?

@greg I'm working late this week (evening instead of morning), I could monitor a deployment in the described way today between 11 PM and midnight (CEST), which is in 4 hours. Let me know.

greg added a comment.May 31 2017, 5:12 PM

Thanks @Krinkle. Opsen following along: can someone be around at that time to have Ops coverage?

Krinkle added a comment.EditedMay 31 2017, 7:24 PM

(Continued investigation using the data we did capture in SAL and Logstash)

  • Alerts for HHVM response times taking 10+ seconds only happened in the inactive data center (codfw).
  • Various DB maintenance was happening in codfw in the hours before the wmf.2 deployment to group2 wikis.
  • Slow response entries found in eqiad app server logs directly (T166345#3293878) may be a red herring as those do happen otherwise as well, but possibly more frequently during the wmf.2 deployment.

As such, it may be possible that nothing went wrong in this branch and that only codfw slowed down due to unrelated maintenance during that time. In the hopes of confirming this I looked in the DBQuery logs.

https://logstash.wikimedia.org/goto/175c3180ef057dd6bcfaf6f46a900f4a

There I find:

  • A doubling of entries between 19:20 and 19:45, with another spike 20:15.
  • Most warnings (~1600) had server=127.0.0.1 (e.g. Job runners).
  • Most warnings (~1300) had wiki=enwiki.
  • Most common warning messages:
countmessage
1000Wikimedia\Rdbms\DatabaseMysqlBase::lock failed to acquire lock 'LinksUpdate:job:pageid:15580374'
457Wikimedia\Rdbms\LoadBalancer::{closure}: found writes/callbacks pending.
Wikimedia\Rdbms\DatabaseMysqlBase::lock failed to acquire lock 'LinksUpdate:job:pageid:15580374'
t mwversion	   	1.30.0-wmf.2
t normalized_message	   Wikimedia\Rdbms\DatabaseMysqlBase::lock failed to acquire lock 'LinksUpdate:job:pageid:15580374'
t server	   	127.0.0.1
t shard	   	s1
t url	   	/rpc/RunJobs.php?wiki=enwiki&type=refreshLinksPrioritized&maxtime=60&maxmem=300M

Looking at the original alerts in codfw from https://wikitech.wikimedia.org/wiki/Incident_documentation/20170525-MediaWiki

19:22:56 icinga-wm PROBLEM - HHVM rendering on mw2112 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[..]

Logstash for host:mw2112 during this time:
https://logstash.wikimedia.org/goto/15c23690ca636858c6b2f2b5722c973b

2017-05-25T19:05:44	mediawiki	WARNING	enwiki	mw2112	Server db2069 has 10.999843120575 seconds of lag (>= 6)
...
2017-05-25T19:06:07	mediawiki	WARNING	enwiki	mw2112	Server db2069 has 16.433945894241 seconds of lag (>= 6)
...
2017-05-25T19:20:59	mediawiki	INFO	enwiki	mw2112	Pool key 'enwiki:pcache:idhash:15580374-0!*!0!!*!4!*:revid:777395573' (ArticleView): Timeout waiting for the lock
2017-05-25T19:22:52	scap	 - 	 - 	mw2112	Touching /srv/mediawiki/wmf-config/InitialiseSettings.php
2017-05-25T19:22:52	scap	 - 	 - 	mw2112	Copying to mw2112.codfw.wmnet from mw2117.codfw.wmnet
2017-05-25T19:22:57	mediawiki	INFO	enwiki	mw2112	Pool key 'enwiki:pcache:idhash:15580374-0!*!0!!*!4!*:revid:777395573' (ArticleView): Timeout waiting for the lock
...
2017-05-25T19:23:38	scap	 - 	 - 	mw2112	Touching /srv/mediawiki/wmf-config/InitialiseSettings.php
2017-05-25T19:23:38	scap	 - 	 - 	mw2112	Copying to mw2112.codfw.wmnet from mw2117.codfw.wmnet
2017-05-25T19:23:40	mediawiki	INFO	enwiki	mw2112	Pool key 'enwiki:pcache:idhash:15580374-0!*!0!!*!4!*:revid:777395573' (ArticleView): Timeout waiting for the lock
...
2017-05-25T19:24:55	mediawiki	INFO	enwiki	mw2112	Pool key 'enwiki:pcache:idhash:15580374-0!*!0!!*!4!*:revid:777395573' (ArticleView): Timeout waiting for the lock
...
2017-05-25T19:28:00	mediawiki	WARNING	enwiki	mw2112	Server db2065 has 5475.878898859 seconds of lag (>= 6)
...
2017-05-25T19:28:04	mediawiki	INFO	enwiki	mw2112	9.72 Main_Page
2017-05-25T19:28:05	mediawiki	INFO	enwiki	mw2112	Could not insert 1 enqueue job(s).
2017-05-25T19:28:05	mediawiki	ERROR	enwiki	mw2112	Lua script error on server "rdb2003.codfw.wmnet:6380": ERR Error running script (call to f_2aac70021ade78213b297e3e8316fa24f82d2897): @user_script:30: @user_script: 30: -OOM command not allowed when used memory > 'maxmemory'.
Krinkle moved this task from Inbox to Doing on the Performance-Team board.May 31 2017, 7:51 PM
Krinkle updated the task description. (Show Details)May 31 2017, 9:48 PM
Krinkle added a comment.EditedMay 31 2017, 9:51 PM

Profiles captured just now, moments before the wmf.1>wmf.2 promotion on Wikipedia wikis (group2).

First capture after the promotion:

Capture 8 minutes after the promotion, after a no-op file touch of wmf-config/InitialiseSettings.php

After purging Main Page via on-wiki interface (action=purge)

DannyH added a subscriber: DannyH.May 31 2017, 11:14 PM
Joe added a comment.Jun 1 2017, 6:28 AM

(Continued investigation using the data we did capture in SAL and Logstash)

  • Alerts for HHVM response times taking 10+ seconds only happened in the inactive data center (codfw).

This is incorrect.

I checked all appservers in both eqiad and codfw and all showed a large slowness in rendering the enwiki main page at the time.

Gilles added a comment.EditedJun 1 2017, 7:27 AM

Based on last night's short redeploy, it's extremely likely that the issue has to do with parsing. @cscott 's change will be the first one to look at, imho. Basically once the branch was deployed, nothing happened until we realized this probably had to do with parsing the enwiki Main Page. So we purged it manually and sure enough, the issue reoccured immediately.

This means that the 12ish minutes that elapsed last time were pure chance. The parser cache for the Main Page just happened to expire at that point or a user purged the page and it triggered the issue.

The symptoms suggest that either the page can't be parsed, is taking an extremely long time to do so (possibly minutes) or the parsed content can't be saved in the parser cache. Most requests to the main page hit pool counter, wait a while and give up, resulting in visible PHP timeouts in monitoring.

It's possible that other pages than the enwiki Main Page would be affected and just haven't had their parser cache deleted/expired in either deployment timeframes. It's through pure chance that this showed up last time.

The reason why the tests on single machines all failed, including the last attempt pointing nutcracker to localhost, is that the parser cache entry in the database could always be found.

My plan to reproduce this is to figure out what code to run in eval.php to parse the enwiki main page without looking in the parser cache, run that on mwdebug1001 against wmf.1 and wmf.2. That should confirm the theory, and then once we have it reproduced in isolation we can try reverting the parser change, git bisecting the branch that commit wasn't the culprit, etc.

Gilles added a comment.Jun 1 2017, 7:45 AM

I guess this API call might do?

https://en.wikipedia.org/w/api.php?action=parse&page=Main_Page
Gilles added a comment.Jun 1 2017, 8:08 AM

The API seems to fetch something cached, won't do. I tried running the following code:

$title = Title::newFromText( 'Main_Page' );
$page = WikiPage::factory( $title );
$content = $page->getContent( Revision::RAW );
$popts = $page->makeParserOptions( 'canonical' );
$p_result = $content->getParserOutput( $title, $page->getLatest(), $popts, false );
$report = $p_result->getLimitReportData();
echo $report['limitreport-cputime'];

But parsing worked and took the same amount of time on wmf.1 and wmf.2.

I'm not familiar with parsing at all, though, so I'm not sure those are the right parameters to simulate the kind of parsing that happens on purge.

Gilles added a comment.EditedJun 1 2017, 8:18 AM

With mwdebug1001 pointed to wmf.2, ?action=purge on the main page works and the pp report shows that the page was parsed by mwdebug1001... (and parsing was fast). Furthermore, the parser cache entry rendered on wmdebug1001 sticks on subsequent loggedin pageviews.

Gilles added a comment.Jun 1 2017, 8:45 AM

Last night after the purge the main page was showing up several times in the slow-parse log. First instance: https://logstash.wikimedia.org/goto/2a150e06fb29611f84984e93a279baf6

And this measures the time this takes:

		$this->parserOutput = $content->getParserOutput(
			$this->page->getTitle(),
			$this->revid,
			$this->parserOptions
		);

Should be the same as what I'm trying on mwdebug1001... unless parserOptions are different in that context.

Gilles added a comment.EditedJun 1 2017, 9:09 AM

@Krinkle browsing thourgh xhgui I don't see your hit on action=purge.

I don't see anything fundamentally different between an xhprof run following a purge right now and the one you've liked to. But none of the runs from last night in xhgui would meet the slow-parse criteria (actual parsing taking longer than 3s).

All main page slow parses during the redeploy: https://logstash.wikimedia.org/goto/62e3c45e1922b46d1f3f60481bb83853

Something noteworthy here is that the slow parses are mostly on codfw. The high frequency there suggests that PoolCounter isn't doing its job in codfw. It's impossible that so many parse run would get through the PoolCounter gates, so to speak.

Someone should correct me if I'm wrong, but the wiki on codfw is set to read-only (that's what I see when I browse ?action=purge on a codfw app server). Which means that PoolWorkArticleView might consider that the run is uncacheable and uses acquireForMe instead of acquireForAnyone. Meaning that when the slow parse hits, codfw has no DC-wide PoolCounter coordination and keeps trying to parse like crazy.

If we look at the logstash link above, we see that for eqiad (except for the first 2 that are really close to each other) the span of time between slow parses makes more sense in regards to PoolCounter.

So that might explain why codfw is disproportionately affected. It doesn't explain however why the page keeps getting parsed, as on mwdebug1001 when running wmf.2 it saves to and loads from the parser cache fine.

Gilles added a comment.EditedJun 1 2017, 10:38 AM

I tried pointing nutcracker to localhost again on mwdebug1001. Purged the main page. On subsequent loggedin loads I see the parser cacher hit for the render done on mwdebug1001, like before, and it seems like it was fast:

<!-- 
NewPP limit report
Parsed by mwdebug1001
Cached time: 20170601102757
Cache expiry: 3600
Dynamic content: true
CPU time usage: 0.988 seconds
Real time usage: 1.147 seconds
Preprocessor visited node count: 3079/1000000
Preprocessor generated node count: 0/1500000
Post‐expand include size: 108174/2097152 bytes
Template argument size: 5868/2097152 bytes
Highest expansion depth: 14/40
Expensive parser function count: 7/500
Lua time usage: 0.117/10.000 seconds
Lua memory usage: 4.33 MB/50 MB
-->
<!--
Transclusion expansion time report (%,ms,calls,template)
100.00%  717.140      1 -total
 35.16%  252.162      1 Wikipedia:Main_Page/Tomorrow
 27.43%  196.728      1 Wikipedia:Today's_featured_article/June_1,_2017
 26.79%  192.144      2 Template:Wikipedia_languages
 25.25%  181.060      7 Template:Main_page_image
 21.21%  152.135     88 Template:Wikipedia_languages/core
 10.01%   71.792      7 Template:Str_number/trim
  9.16%   65.679      1 Template:Did_you_know/Queue/1
  8.84%   63.383      1 Template:Lang-cy
  8.52%   61.130      2 Template:In_the_news
-->
</div>
<!-- Saved in parser cache with key enwiki:pcache:idhash:15580374-0!*!0!!*!7!* and timestamp 20170601102755 and revision id 777395573
 -->

However on mwdebug1001 if I curl locally, I get the following:

<!-- 
NewPP limit report
Parsed by mw2125
Cached time: 20170601102441
Cache expiry: 3600
Dynamic content: true
CPU time usage: 7.452 seconds
Real time usage: 7.598 seconds
Preprocessor visited node count: 3079/1000000
Preprocessor generated node count: 0/1500000
Post‐expand include size: 108192/2097152 bytes
Template argument size: 5868/2097152 bytes
Highest expansion depth: 14/40
Expensive parser function count: 7/500
Lua time usage: 0.434/10.000 seconds
Lua memory usage: 4.33 MB/50 MB
-->
<!--
Transclusion expansion time report (%,ms,calls,template)
100.00% 5371.807      1 -total
 40.65% 2183.573      1 Wikipedia:Main_Page/Tomorrow
 33.54% 1801.895      2 Template:Wikipedia_languages
 31.62% 1698.721      1 Wikipedia:Today's_featured_article/June_1,_2017
 31.46% 1689.859      2 Template:Start_date
 26.55% 1426.366      7 Template:Main_page_image
 20.28% 1089.569      7 Template:Str_number/trim
 10.83%  581.601      2 Template:Main_Page_banner
  4.68%  251.574      1 Template:Did_you_know/Queue/1
  4.18%  224.516     21 Template:If_empty
-->

<!-- Saved in parser cache with key enwiki:pcache:idhash:15580374-0!*!0!!*!4!* and timestamp 20170601102433 and revision id 777395573
 -->

I presume the parser cache key is different because this is loggedin versus loggedout. We see codfw show up here again, taking a long time to parse, despite running wmf.1.

Also, with its nutcracker pointing locally, why is mwdebug1001 picking up parser cache written by codfw? It's probably fetching it from DB, but why doesn't it save it in memcache, then?

OK, I think it's still the issue of the parser cache being backed by the DB, which mwdebug1001 picks up for the loggedout case, because another server running wmf.1 will hit the loggedout page before I do on mwdebug1001. Even if the local memcache entry is missing, mwdebug1001 will pick it up from the DB. Which means that even when purging, mwdebug1001 never truly does the parsing + saving work for loggedout.

I'll try to disable the DB section of the parser cache on mwdebug1001.

Gilles added a comment.EditedJun 1 2017, 11:04 AM

That did what I expected, now I get a parser run from mwdebug1001 for the loggedout case as well. Unfortunately both loggedin and loggedout are fast on mwdebug1001...

Loggedin:

<!-- 
NewPP limit report
Parsed by mwdebug1001
Cached time: 20170601105727
Cache expiry: 3600
Dynamic content: true
CPU time usage: 1.296 seconds
Real time usage: 1.703 seconds
Preprocessor visited node count: 3079/1000000
Preprocessor generated node count: 0/1500000
Post‐expand include size: 108174/2097152 bytes
Template argument size: 5868/2097152 bytes
Highest expansion depth: 14/40
Expensive parser function count: 7/500
Lua time usage: 0.150/10.000 seconds
Lua memory usage: 4.33 MB/50 MB
-->
<!--
Transclusion expansion time report (%,ms,calls,template)
100.00% 1235.778      1 -total
 34.42%  425.318      1 Wikipedia:Today's_featured_article/June_1,_2017
 32.90%  406.613      1 Wikipedia:Main_Page/Tomorrow
 21.56%  266.381      7 Template:Main_page_image
 17.79%  219.883      2 Template:Wikipedia_languages
 13.12%  162.168     88 Template:Wikipedia_languages/core
 12.17%  150.402      1 Template:Lang-cy
 11.02%  136.129      1 Template:Language_with_name
 10.81%  133.610      1 Template:Did_you_know/Queue/1
  9.79%  120.926      1 Template:Lang
-->
</div>
<!-- Saved in parser cache with key enwiki:pcache:idhash:15580374-0!*!0!!*!7!* and timestamp 20170601105726 and revision id 777395573
 -->

Loggedout:

<!-- 
NewPP limit report
Parsed by mwdebug1001
Cached time: 20170601105734
Cache expiry: 3600
Dynamic content: true
CPU time usage: 0.348 seconds
Real time usage: 0.466 seconds
Preprocessor visited node count: 3079/1000000
Preprocessor generated node count: 0/1500000
Post‐expand include size: 108174/2097152 bytes
Template argument size: 5868/2097152 bytes
Highest expansion depth: 14/40
Expensive parser function count: 7/500
Lua time usage: 0.079/10.000 seconds
Lua memory usage: 4.33 MB/50 MB
-->
<!--
Transclusion expansion time report (%,ms,calls,template)
100.00%  341.087      1 -total
 36.64%  124.970      1 Wikipedia:Today's_featured_article/June_1,_2017
 30.14%  102.806      1 Wikipedia:Main_Page/Tomorrow
 27.76%   94.698      7 Template:Main_page_image
 18.50%   63.098      2 Template:Wikipedia_languages
 14.42%   49.191     88 Template:Wikipedia_languages/core
 12.99%   44.319      7 Template:Str_number/trim
 10.88%   37.116      1 Template:Lang-cy
 10.24%   34.918      1 Template:Language_with_name
  9.75%   33.250      1 Template:Did_you_know/Queue/1
-->
</div>
<!-- Saved in parser cache with key enwiki:pcache:idhash:15580374-0!*!0!!*!4!* and timestamp 20170601105733 and revision id 777395573
 -->

However, while the loggedin parser cache "sticks", the loggedout one doesn't! I think I might have reproduced the bug on mwdebug1001 at last. It's not about how long it takes to parse, it's that the loggedout entry doesn't stick in the parser cache as it should. Which means that every loggedout hit on the page in hvvm parses the page.

Looking at logstash I see this suspicious message for the loggedout case:

ParserOutput key valid, but rejected by RejectParserCacheValue hook handler.

The hook mentioned is found in TimedMediaHandler, and in wmf.2 there is this commit to this very hook:

https://gerrit.wikimedia.org/r/#/c/354564/5/TimedMediaHandler.hooks.php

I think we have a winner...

why is mwdebug1001 picking up parser cache written by codfw

Note on the architecture of parsercache- parsercache disk backend (pcXXXX mysqls) are active-active thoughout all the datacenters. Which means that something inserted on eqiad gets shown (with around 40ms of delay) on codfw, and vice versa. That is what I was requested at the time to avoid cold caches on datacenter failover, although I do not know how that affects memcached. Of course, that can be changed at any time if performance needs another model.

I'd like to figure out a few things about codfw here that I spotted during this investigation and am not sure about:

  • is Mediawiki itself on codfw supposed to be configured as read-only? in fact it still displays the switchover message on pages with the readonly information
  • is codfw supposed to parse things and save the output? If so, where is the traffic that triggers it coming from?
  • the parser cache poolcounter logic doesn't seem to do what it's supposed to on codfw and as a result parsing dogpiles on codfw for slow parsing articles that get a lot of traffic

New plot twist... TMH was wrongly accused. I did more digging in mwrepl and found out that there is another hook defined in config... and that's the one actually responsible for rejecting the parser cache output:

https://gerrit.wikimedia.org/r/#/c/353596/2/wmf-config/CommonSettings.php

On mwdebug1001, enwiki Main Page rendered locally:

hphpd> $value = $property->getValue(ParserCache::singleton())->get( 'enwiki:pcache:idhash:15580374-0!*!0!!*!4!*', $casToken, BagOStuff::READ_VERIFIED );
$value = $property->getValue(ParserCache::singleton())->get( 'enwiki:pcache:idhash:15580374-0!*!0!!*!4!*', $casToken, BagOStuff::READ_VERIFIED );
hphpd> echo strpos( $value->getText(), 'mw-parser-output');
echo strpos( $value->getText(), 'mw-parser-output');
12
TheDJ added a subscriber: TheDJ.Jun 1 2017, 12:36 PM

@Gilles Krinkle and I poured over that TMH change during the hackathon. The intent was to make sure the beta feature didn't break the parser cache, and at the same time, when we would switch the site wide default, that we would purge pages that then were no longer valid. Testing on vagrant seemed to confirm the right behaviour back then. but we might have missed something of course.

@TheDJ I'm now pretty sure that the change to TMH is fine and it's this mw-parser-output business.

@Gilles the mw-parser-output change in CommonSettings was created as part of https://wikitech.wikimedia.org/wiki/Incident_documentation/20170512-API can/should be removed now.

Gilles added a comment.EditedJun 1 2017, 12:53 PM

Is it expected that the parser output in wmf.2 has the mw-parser-output wrapper for loggedout? That seems to be the regression. It's absent in wmf.1. The following:

$title = Title::newFromText( 'Main_Page' );
$page = WikiPage::factory( $title );
$content = $page->getContent( Revision::RAW );
$popts = $page->makeParserOptions( 'canonical' );
$p_result = $content->getParserOutput( $title, $page->getLatest(), $popts );
echo strpos( $p_result->getText(), 'mw-parser-output');

echoes "12" on wmf.2, nothing on wmf.1

elukey added a subscriber: elukey.Jun 1 2017, 12:54 PM

Removing the config hook would probably make the slowdown issue go away, but parser cache for loggedout would get "polluted" by mw-parser-output again. I don't know if the latter is a big deal.

Removing the config hook would probably make the slowdown issue go away, but parser cache for loggedout would get "polluted" by mw-parser-output again. I don't know if the latter is a big deal.

I believe the issues with mw-parser-output are resolved. Adding in @Legoktm @Anomie is that correct?

Gilles added a comment.Jun 1 2017, 1:06 PM

mw-parser-output making it into the loggedout parser cache is reproducible on VM. The history of that issue is too complicated for me to figure out if that's intentional in wmf.2 (in which case the config hook needs to go) or accidental (in which case the mw-parser-output leak into the loggedout parser cache needs to be fixed).

is Mediawiki itself on codfw supposed to be configured as read-only? in fact it still displays the switchover message on pages with the readonly information

I do not know what it is supposed to be, but it does. The message is mostly irrelevant as no production traffic should read it. It needs to be in read only mode because if edits arrive by accident, it should try not to write to the local datacenter. Some extensions do not respect the read-only mode, though, so it is enforced too by setting all databases in read only mode.

is codfw supposed to parse things and save the output? If so, where is the traffic that triggers it coming from?

Metrics and monitoring like icinga checks and load balancing checks would regularly query mediawiki pages even if it is only exposed internally. There is also a maintenance script (like the one written by Krinkle, IIRC) that would write to any dcs's parsercache. If a passive slave is not supposed to have effects on the other datacenter, it can write to the parsercache with session bin_log=0. However, The active-active storage for parsing was a big deal for performance team, and it was setup like that at the request of both Perf (Aaron specifically) and Operations. As I said, other models can be setup if needed, that is not a problem. But I understood it is probably necessary if in the short future active-active is the intended solution.

mw-parser-output making it into the loggedout parser cache is reproducible on VM. The history of that issue is too complicated for me to figure out if that's intentional in wmf.2 (in which case the config hook needs to go) or accidental (in which case the mw-parser-output leak into the loggedout parser cache needs to be fixed).

I believe that it is the former and the breakage that this caused further downstream (in api output) has been fixed.

BTW, I think many of those monitoring checks specifically parse the Main_Page.

Gilles added a comment.Jun 1 2017, 1:17 PM

That makes sense, and it explains the slowdown because folks forgot about the config hook in production, testing locally on wmf.2 must have worked with extensions, apis, etc. group0 and group1 are probably affected right now but it's not noticeable. Essentially, right now on a varnish miss for loggedout you always parse and never benefit from the parser cache.

It looks fairly safe to me to remove the config hook and redeploy wmf.2.

Gilles added a comment.Jun 1 2017, 1:18 PM

BTW, I think many of those monitoring checks specifically parse the Main_Page.

Indeed, that's the case. I think that's why it made the issue visible where the deployment to group0 and group1 didn't. We might have noticed the problem sooner if we had the same monitoring checks for group0 and group1 wikis.

What about the passive->active DC replication, should we do something about that, even if looking at it on a separate ticket?

Gilles added a comment.Jun 1 2017, 1:21 PM

I've just double checked and group0/group1 wikis definitely look affected. A loggedout self-hit on mwdebug1001 for mediawiki.org shows the same symptom where we get a fresh parse on every pageload.

Gilles added a comment.Jun 1 2017, 1:25 PM

What about the passive->active DC replication, should we do something about that, even if looking at it on a separate ticket?

I think it's fine, what looked slightly worrying to me was PoolCounter potentially not working right under stress. My hunch would be that it's a side-effect of the wiki being read-only. Which means that when we go read-only (including in eqiad), we might be subject to more dogpiling when re-parsing pages, which PoolCounter should normally protect us against. Or maybe PoolCounter is just not running on codfw at the moment?

Change 356584 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] Change the read only message for something generic

https://gerrit.wikimedia.org/r/356584

^I have at least changed the message that, so it doesn't refer to the switchover anymore.

Anomie added a comment.EditedJun 1 2017, 1:52 PM

Removing the config hook would probably make the slowdown issue go away, but parser cache for loggedout would get "polluted" by mw-parser-output again. I don't know if the latter is a big deal.

I believe the issues with mw-parser-output are resolved. Adding in @Legoktm @Anomie is that correct?

Yes, all the known issues are resolved.

mw-parser-output making it into the loggedout parser cache is reproducible on VM. The history of that issue is too complicated for me to figure out if that's intentional in wmf.2 (in which case the config hook needs to go) or accidental (in which case the mw-parser-output leak into the loggedout parser cache needs to be fixed).

TL;DR: Yes, it's intentional in wmf.2.

The original bug was not "mw-parser-output making it into the loggedout parser cache", that was always intended. The bug was that parser cache reads would ignore whether the ParserOptions said to include the wrapper or not, it all depended on what parse happened to populate the cache. The major upshot of this was that things like TextExtracts and MobileFrontend that were requesting a no-wrapper parse were getting the wrapper anyway from the parser cache. When things were rolled back to wmf.1, the hook that checks for mw-parser-output was added to clear existing parser cache entries in accordance with the rollback to a version of MW that doesn't generate that wrapper.

rMWecb4c0e3fe11: ParserOptions: Include wrapping class in options hash now fragments the parser cache on the wrapper. For example, a normal parse with the wrapper might have a cache key like "my_wiki:pcache:idhash:2379-0!*!0!!*!5!*" while a parse without it (as used by TextExtracts and such) would use "my_wiki:pcache:idhash:2379-0!*!0!!*!5!*!wrapclass=" instead.

See also T110269: ParserOptions validation to avoid pollution on ParserCache save, which in part will make the main entry point for accessing the parser cache (WikiPage::getParserOutput()) throw an exception if incompatible ParserOptions are passed.

Change 356589 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/mediawiki-config@master] Revert "Add RejectParserCacheValue handler for mw-parser-output invalidation"

https://gerrit.wikimedia.org/r/356589

Change 356589 merged by jenkins-bot:
[operations/mediawiki-config@master] Revert "Add RejectParserCacheValue handler for mw-parser-output invalidation"

https://gerrit.wikimedia.org/r/356589

Mentioned in SAL (#wikimedia-operations) [2017-06-01T19:08:37Z] <thcipriani@tin> Synchronized wmf-config/CommonSettings.php: [[gerrit:356589|Revert "Add RejectParserCacheValue handler for mw-parser-output invalidation"]] T166345 (duration: 00m 43s)

Gilles closed this task as Resolved.Jun 1 2017, 8:54 PM
Gilles claimed this task.
ori awarded a token.Jun 2 2017, 4:37 AM

Change 356584 abandoned by Jcrespo:
db-readonly: Change the read only message for something generic

Reason:
Abandoning for 371444

https://gerrit.wikimedia.org/r/356584

phuedx removed a subscriber: phuedx.Aug 11 2017, 9:07 AM