Page MenuHomePhabricator

mw-1.28.0-wmf.18 load-time regression
Closed, ResolvedPublic

Description

Since the deploy of wmf.18 on 2016-09-08

19:02 logmsgbot: demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group2 to wmf.18

load times increased by approx 600ms.

https://grafana.wikimedia.org/dashboard/db/performance-metrics?panelId=7&fullscreen&from=now-7d

This increase coincided with a decrease for loadEventEnd. (See T146099#2654871)

https://grafana.wikimedia.org/dashboard/db/performance-metrics?panelId=4&fullscreen&from=now-7d

Breakdown:

Details

Related Gerrit Patches:

Event Timeline

Matanya created this task.Sep 19 2016, 10:23 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 19 2016, 10:23 PM
ori triaged this task as High priority.Sep 19 2016, 10:49 PM
ori added a project: Performance-Team.
Matanya renamed this task from wmf.18 load-time regression to mw-1.28.0-wmf.18 load-time regression.Sep 19 2016, 11:16 PM
hashar added a subscriber: hashar.Sep 20 2016, 8:36 AM

That should alarm / page somehow. I have filled T146125 for that

Krinkle updated the task description. (Show Details)Sep 20 2016, 5:01 PM
Gilles added a subscriber: Gilles.Sep 20 2016, 5:02 PM

I'll help look into this when I've put my son to bed, a couple of hours from now.

From @Krinkle on IRC:

There's also a bug in Chrome with native requestIdleCallback that Paul discovered and e-mailed us about. Which is that sometimes rIC can stall for 30 seconds.
So I'll probably change cache-eval to remain async but using our polyfill with setTimeout instead of native requestIdleCallback.
It has to do with whether or not Chrome things the page is currently active.
So people opening pages in new tabs or not with the window focussed (e.g. click on desktop). may end up with their JS delayed by 30 seconds, which means NavTiming initiates later and thus observes "End" later.
Could be that.

I'm currently looking at the data to see if this theory's symptoms are confirmed.

Looking at a plain AVG since the problem started:

SELECT AVG(event_mediaWikiLoadComplete) as avg, STD(event_mediaWikiLoadComplete) as std, userAgent, COUNT(*) as cnt FROM NavigationTiming_15485142 WHERE timestamp > 20160918000000 AND timestamp < 20160919000000 AND event_mediaWikiLoadComplete IS NOT NULL GROUP BY userAgent HAVING cnt > 1000 ORDER BY avg DESC

All the top extreme averages are Chrome. The insanely high standard deviations suggest that something is afoul with extreme values:

460697.7203	10255929.9664	"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36"	5717
214454.7932	5763051.6623	"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36"	3559
163857.8435	5025434.7872	"Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36"	1489
143423.5504	2699176.7342	"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36"	1646
138937.4988	2932483.8404	"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36"	1698
105353.4230	2760907.2888	"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36"	1234
81159.5683	1888114.1405	"Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36"	5034
80947.2719	1794421.3821	"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36"	6278
69250.2042	2510049.6262	"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36"	13450
65351.4484	1809021.0610	"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36"	3051

Now looking at the same averages before the problem started:

SELECT AVG(event_mediaWikiLoadComplete) as avg, STD(event_mediaWikiLoadComplete) as std, userAgent, COUNT(*) as cnt FROM NavigationTiming_15485142 WHERE timestamp > 20160905000000 AND timestamp < 20160906000000 AND event_mediaWikiLoadComplete IS NOT NULL GROUP BY userAgent HAVING cnt > 1000 ORDER BY avg DESC

The top averages don't even come close to the current ones and we see other browsers in the top 10:

24045.4865	1943293.9732	"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36"	7429
16759.5909	544325.1345	"Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36"	1474
7932.1559	475101.1398	"Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36"	7554
4647.9182	471019.1853	"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36"	23844
3635.5737	99102.6511	"Mozilla/5.0 (iPhone; CPU iPhone OS 9_3_3 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Version/9.0 Mobile/13G34 Safari/601.1"	2294
3616.7612	188925.0715	"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2486.0 Safari/537.36 Edge/13.10586"	5625
3253.8592	10593.5998	"Mozilla/5.0 (Windows NT 5.1; rv:48.0) Gecko/20100101 Firefox/48.0"	1406
3113.6116	4287.3466	"Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D257 Safari/9537.53"	1089
2986.9382	9794.6501	"Mozilla/5.0 (Windows NT 6.1; rv:48.0) Gecko/20100101 Firefox/48.0"	2055
2948.4920	161536.8614	"Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"	11757

I've also checked at samples for the worst offending UA and the pattern is clear: most values are normal and sometimes we get a really high one. I don't see that pattern when looking at old data, where some values are "high" but an order of magnitude smaller. So far this seems to confirm @Krinkle 's theory.

Change 311855 had a related patch set uploaded (by Gilles):
Always use the requestIdleCallback polyfill for batchEval

https://gerrit.wikimedia.org/r/311855

Change 311855 merged by jenkins-bot:
Always use the requestIdleCallback polyfill for batchEval

https://gerrit.wikimedia.org/r/311855

Mentioned in SAL (#wikimedia-operations) [2016-09-20T23:35:40Z] <catrope@tin> Synchronized php-1.28.0-wmf.18/resources/src/mediawiki/mediawiki.js: Always use requestIdleCallback polyfill for batchEval (T146099) (duration: 00m 46s)

Looks like the metric has come down about 100-200ms since the above was deployed.

https://grafana.wikimedia.org/dashboard/db/performance-metrics?panelId=7&fullscreen


However, it's not even close to where it was (total regression was 400-600ms; last night's improvement about 100ms). It's been more than long enough for the deployment to affect all page views. I'll check back tomorrow, but I don't expect it to lower more. We should start looking at other possible causes.

I added to the task description that the regression in "MediaWiki Load" coincided with an improvement to "loadEventEnd" – which did not get increased after last nights fix.

The suspect the improved loadEventEnd time is due to async execution of module cache hits. This is merely a re-ordering of events, not a slow down. The loadEvent not happens before module execution, instead of after it. Execution should still start around the same time (relative to navigationStart) and take the same amount of time.

Krinkle updated the task description. (Show Details)Sep 21 2016, 3:28 AM

The same drop happened last week, I don't think the perf improvement this change could have brought is clear at this point. Looking at data since your deployment the figures look just as bad as before:

SELECT AVG(event_mediaWikiLoadComplete) as avg, STD(event_mediaWikiLoadComplete) as std, userAgent, COUNT(*) as cnt FROM NavigationTiming_15485142 WHERE timestamp > 20160912020000 AND event_mediaWikiLoadComplete IS NOT NULL GROUP BY userAgent HAVING cnt > 1000 ORDER BY avg DESC
2015980.3872	89437943.0799	"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"	2309
673278.1948	15744995.1717	"Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.113 Safari/537.36"	1319
333664.2540	9877580.5730	"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/601.6.17 (KHTML, like Gecko) Version/9.1.1 Safari/537.86.6"	1177
330068.0290	10127919.1886	"Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.101 Safari/537.36"	1034
254624.2606	7136742.7536	"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.113 Safari/537.36"	4179
196423.7838	4476084.5873	"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36 OPR/39.0.2256.71"	7220
181469.4345	7859687.4866	"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36"	2214
178802.1851	4551783.5632	"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36 OPR/39.0.2256.71"	5312
174388.0594	7711731.3388	"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.78.2 (KHTML, like Gecko) Version/6.1.6 Safari/537.78.2"	2341
164228.6484	2925783.2537	"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.101 Safari/537.36"	1092

Looking back at when the problem started, I see that there was s 45 minute delay between the deployment and the regression. The regression took about 8 minutes to reach its peak.

Looking at the SAL, it is indeed the wmf.18 deployment to group2. Is it expected that there would be such a long delay?

Now, if wmf.18 is responsible, we should be seeing the same regression before and after the group1 deployment for the relevant wikis. Or if that's not the case, it could be a backport that happened at some point between the group 1 and the group 2 deploy. Either way, if there's enough data in NavigationTiming, we should be able to tell. I'll so some digging to find out.

Gilles added a comment.EditedSep 21 2016, 11:40 AM

Data for group1 between the group1 and group2 deployment:

SELECT AVG(event_mediaWikiLoadComplete) as avg, EXP(AVG(LOG(event_mediaWikiLoadComplete))) as geomean, STD(event_mediaWikiLoadComplete) as std, COUNT(*) as cnt FROM NavigationTiming_15485142 WHERE timestamp > 20160907200000 AND timestamp < 20160908180000 AND event_mediaWikiLoadComplete > 0 AND wiki NOT IN ('aawiki', 'abwiki', 'acewiki', 'adywiki', 'afwiki', 'akwiki', 'alswiki', 'amwiki', 'angwiki', 'anwiki', 'arbcom_dewiki', 'arbcom_enwiki', 'arbcom_fiwiki', 'arbcom_nlwiki', 'arcwiki', 'arwiki', 'arzwiki', 'astwiki', 'aswiki', 'avwiki', 'aywiki', 'azwiki', 'azbwiki', 'barwiki', 'bat_smgwiki', 'bawiki', 'bclwiki', 'be_x_oldwiki', 'bewiki', 'bgwiki', 'bhwiki', 'biwiki', 'bjnwiki', 'bmwiki', 'bnwiki', 'bowiki', 'bpywiki', 'brwiki', 'bswiki', 'bugwiki', 'bxrwiki', 'cawiki', 'cbk_zamwiki', 'cdowiki', 'cebwiki', 'cewiki', 'chowiki', 'chrwiki', 'chwiki', 'chywiki', 'ckbwiki', 'cowiki', 'crhwiki', 'crwiki', 'csbwiki', 'cswiki', 'cuwiki', 'cvwiki', 'cywiki', 'dawiki', 'dewiki', 'diqwiki', 'dsbwiki', 'dvwiki', 'dzwiki', 'eewiki', 'elwiki', 'emlwiki', 'enwiki', 'eowiki', 'eswiki', 'etwiki', 'euwiki', 'extwiki', 'fawiki', 'ffwiki', 'fiu_vrowiki', 'fiwiki', 'fjwiki', 'fowiki', 'frpwiki', 'frrwiki', 'frwiki', 'furwiki', 'fywiki', 'gagwiki', 'ganwiki', 'gawiki', 'gdwiki', 'glkwiki', 'glwiki', 'gnwiki', 'gomwiki', 'gotwiki', 'guwiki', 'gvwiki', 'hakwiki', 'hawiki', 'hawwiki', 'hewiki', 'hifwiki', 'hiwiki', 'howiki', 'hrwiki', 'hsbwiki', 'htwiki', 'huwiki', 'hywiki', 'hzwiki', 'iawiki', 'idwiki', 'iewiki', 'igwiki', 'iiwiki', 'ikwiki', 'ilowiki', 'iowiki', 'iswiki', 'itwiki', 'iuwiki', 'jawiki', 'jamwiki', 'jbowiki', 'jvwiki', 'kaawiki', 'kabwiki', 'kawiki', 'kbdwiki', 'kgwiki', 'kiwiki', 'kjwiki', 'kkwiki', 'klwiki', 'kmwiki', 'knwiki', 'koiwiki', 'kowiki', 'krcwiki', 'krwiki', 'kshwiki', 'kswiki', 'kuwiki', 'kvwiki', 'kwwiki', 'kywiki', 'ladwiki', 'lawiki', 'lbewiki', 'lbwiki', 'lezwiki', 'lgwiki', 'lijwiki', 'liwiki', 'lmowiki', 'lnwiki', 'lowiki', 'lrcwiki', 'ltgwiki', 'ltwiki', 'lvwiki', 'maiwiki', 'map_bmswiki', 'mdfwiki', 'mgwiki', 'mhrwiki', 'mhwiki', 'minwiki', 'miwiki', 'mkwiki', 'mlwiki', 'mnwiki', 'mowiki', 'mrjwiki', 'mrwiki', 'mswiki', 'mtwiki', 'muswiki', 'mwlwiki', 'myvwiki', 'mywiki', 'mznwiki', 'nahwiki', 'napwiki', 'nawiki', 'nds_nlwiki', 'ndswiki', 'newiki', 'newwiki', 'ngwiki', 'nlwiki', 'nnwiki', 'novwiki', 'nowiki', 'nrmwiki', 'nsowiki', 'nvwiki', 'nywiki', 'ocwiki', 'omwiki', 'orwiki', 'oswiki', 'pagwiki', 'pamwiki', 'papwiki', 'pawiki', 'pcdwiki', 'pdcwiki', 'pflwiki', 'pihwiki', 'piwiki', 'plwiki', 'pmswiki', 'pnbwiki', 'pntwiki', 'pswiki', 'ptwiki', 'quwiki', 'rmwiki', 'rmywiki', 'rnwiki', 'roa_rupwiki', 'roa_tarawiki', 'rowiki', 'ruewiki', 'ruwiki', 'rwwiki', 'sahwiki', 'sawiki', 'scnwiki', 'scowiki', 'scwiki', 'sdwiki', 'sewiki', 'sgwiki', 'shwiki', 'simplewiki', 'siwiki', 'skwiki', 'slwiki', 'smwiki', 'snwiki', 'sowiki', 'sqwiki', 'srnwiki', 'srwiki', 'sswiki', 'stqwiki', 'stwiki', 'suwiki', 'svwiki', 'swwiki', 'szlwiki', 'tawiki', 'tcywiki', 'tenwiki', 'testwiki', 'test2wiki', 'tetwiki', 'tewiki', 'tgwiki', 'thwiki', 'tiwiki', 'tkwiki', 'tlwiki', 'tnwiki', 'towiki', 'tpiwiki', 'trwiki', 'tswiki', 'ttwiki', 'tumwiki', 'twwiki', 'tyvwiki', 'tywiki', 'udmwiki', 'ugwiki', 'ukwiki', 'urwiki', 'uzwiki', 'vecwiki', 'vepwiki', 'vewiki', 'viwiki', 'vlswiki', 'vowiki', 'warwiki', 'wawiki', 'wg_enwiki', 'wowiki', 'wuuwiki', 'xalwiki', 'xhwiki', 'xmfwiki', 'yiwiki', 'yowiki', 'zawiki', 'zeawiki', 'zh_classicalwiki', 'zh_min_nanwiki', 'zh_yuewiki', 'zhwiki', 'zuwiki', 'mediawikiwiki', 'test2wiki', 'testwiki', 'testwikidatawiki', 'zerowiki')

avg 11522.7838 geomean 1352.8591451537454 std 241177.9260 cnt 6360

Data for group1 for a duration of the same length 24 hours before the group1 deployment:

SELECT AVG(event_mediaWikiLoadComplete) as avg, EXP(AVG(LOG(event_mediaWikiLoadComplete))) as geomean, STD(event_mediaWikiLoadComplete) as std, COUNT(*) as cnt FROM NavigationTiming_15485142 WHERE timestamp > 20160906200000 AND timestamp < 20160907180000 AND event_mediaWikiLoadComplete > 0 AND wiki NOT IN ('aawiki', 'abwiki', 'acewiki', 'adywiki', 'afwiki', 'akwiki', 'alswiki', 'amwiki', 'angwiki', 'anwiki', 'arbcom_dewiki', 'arbcom_enwiki', 'arbcom_fiwiki', 'arbcom_nlwiki', 'arcwiki', 'arwiki', 'arzwiki', 'astwiki', 'aswiki', 'avwiki', 'aywiki', 'azwiki', 'azbwiki', 'barwiki', 'bat_smgwiki', 'bawiki', 'bclwiki', 'be_x_oldwiki', 'bewiki', 'bgwiki', 'bhwiki', 'biwiki', 'bjnwiki', 'bmwiki', 'bnwiki', 'bowiki', 'bpywiki', 'brwiki', 'bswiki', 'bugwiki', 'bxrwiki', 'cawiki', 'cbk_zamwiki', 'cdowiki', 'cebwiki', 'cewiki', 'chowiki', 'chrwiki', 'chwiki', 'chywiki', 'ckbwiki', 'cowiki', 'crhwiki', 'crwiki', 'csbwiki', 'cswiki', 'cuwiki', 'cvwiki', 'cywiki', 'dawiki', 'dewiki', 'diqwiki', 'dsbwiki', 'dvwiki', 'dzwiki', 'eewiki', 'elwiki', 'emlwiki', 'enwiki', 'eowiki', 'eswiki', 'etwiki', 'euwiki', 'extwiki', 'fawiki', 'ffwiki', 'fiu_vrowiki', 'fiwiki', 'fjwiki', 'fowiki', 'frpwiki', 'frrwiki', 'frwiki', 'furwiki', 'fywiki', 'gagwiki', 'ganwiki', 'gawiki', 'gdwiki', 'glkwiki', 'glwiki', 'gnwiki', 'gomwiki', 'gotwiki', 'guwiki', 'gvwiki', 'hakwiki', 'hawiki', 'hawwiki', 'hewiki', 'hifwiki', 'hiwiki', 'howiki', 'hrwiki', 'hsbwiki', 'htwiki', 'huwiki', 'hywiki', 'hzwiki', 'iawiki', 'idwiki', 'iewiki', 'igwiki', 'iiwiki', 'ikwiki', 'ilowiki', 'iowiki', 'iswiki', 'itwiki', 'iuwiki', 'jawiki', 'jamwiki', 'jbowiki', 'jvwiki', 'kaawiki', 'kabwiki', 'kawiki', 'kbdwiki', 'kgwiki', 'kiwiki', 'kjwiki', 'kkwiki', 'klwiki', 'kmwiki', 'knwiki', 'koiwiki', 'kowiki', 'krcwiki', 'krwiki', 'kshwiki', 'kswiki', 'kuwiki', 'kvwiki', 'kwwiki', 'kywiki', 'ladwiki', 'lawiki', 'lbewiki', 'lbwiki', 'lezwiki', 'lgwiki', 'lijwiki', 'liwiki', 'lmowiki', 'lnwiki', 'lowiki', 'lrcwiki', 'ltgwiki', 'ltwiki', 'lvwiki', 'maiwiki', 'map_bmswiki', 'mdfwiki', 'mgwiki', 'mhrwiki', 'mhwiki', 'minwiki', 'miwiki', 'mkwiki', 'mlwiki', 'mnwiki', 'mowiki', 'mrjwiki', 'mrwiki', 'mswiki', 'mtwiki', 'muswiki', 'mwlwiki', 'myvwiki', 'mywiki', 'mznwiki', 'nahwiki', 'napwiki', 'nawiki', 'nds_nlwiki', 'ndswiki', 'newiki', 'newwiki', 'ngwiki', 'nlwiki', 'nnwiki', 'novwiki', 'nowiki', 'nrmwiki', 'nsowiki', 'nvwiki', 'nywiki', 'ocwiki', 'omwiki', 'orwiki', 'oswiki', 'pagwiki', 'pamwiki', 'papwiki', 'pawiki', 'pcdwiki', 'pdcwiki', 'pflwiki', 'pihwiki', 'piwiki', 'plwiki', 'pmswiki', 'pnbwiki', 'pntwiki', 'pswiki', 'ptwiki', 'quwiki', 'rmwiki', 'rmywiki', 'rnwiki', 'roa_rupwiki', 'roa_tarawiki', 'rowiki', 'ruewiki', 'ruwiki', 'rwwiki', 'sahwiki', 'sawiki', 'scnwiki', 'scowiki', 'scwiki', 'sdwiki', 'sewiki', 'sgwiki', 'shwiki', 'simplewiki', 'siwiki', 'skwiki', 'slwiki', 'smwiki', 'snwiki', 'sowiki', 'sqwiki', 'srnwiki', 'srwiki', 'sswiki', 'stqwiki', 'stwiki', 'suwiki', 'svwiki', 'swwiki', 'szlwiki', 'tawiki', 'tcywiki', 'tenwiki', 'testwiki', 'test2wiki', 'tetwiki', 'tewiki', 'tgwiki', 'thwiki', 'tiwiki', 'tkwiki', 'tlwiki', 'tnwiki', 'towiki', 'tpiwiki', 'trwiki', 'tswiki', 'ttwiki', 'tumwiki', 'twwiki', 'tyvwiki', 'tywiki', 'udmwiki', 'ugwiki', 'ukwiki', 'urwiki', 'uzwiki', 'vecwiki', 'vepwiki', 'vewiki', 'viwiki', 'vlswiki', 'vowiki', 'warwiki', 'wawiki', 'wg_enwiki', 'wowiki', 'wuuwiki', 'xalwiki', 'xhwiki', 'xmfwiki', 'yiwiki', 'yowiki', 'zawiki', 'zeawiki', 'zh_classicalwiki', 'zh_min_nanwiki', 'zh_yuewiki', 'zhwiki', 'zuwiki', 'mediawikiwiki', 'test2wiki', 'testwiki', 'testwikidatawiki', 'zerowiki')

avg 3652.2706 geomean 968.3781677649234 std 83272.6353 cnt 6664

This seems to confirm that the issue is with wmf.18 with the same 400ms impact on the geometric mean as what we saw on the group2 deployment.

We're not making the most of the group1 deployment in terms of detecting performance regressions. We should track wiki groups separately for those metrics.

Double checking the same values on group2 wikis only before and after the group2 deployment.

After:

SELECT AVG(event_mediaWikiLoadComplete) as avg, EXP(AVG(LOG(event_mediaWikiLoadComplete))) as geomean, STD(event_mediaWikiLoadComplete) as std, COUNT(*) as cnt FROM NavigationTiming_15485142 WHERE timestamp > 20160908200000 AND timestamp < 20160909180000 AND event_mediaWikiLoadComplete > 0 AND wiki IN ('aawiki', 'abwiki', 'acewiki', 'adywiki', 'afwiki', 'akwiki', 'alswiki', 'amwiki', 'angwiki', 'anwiki', 'arbcom_dewiki', 'arbcom_enwiki', 'arbcom_fiwiki', 'arbcom_nlwiki', 'arcwiki', 'arwiki', 'arzwiki', 'astwiki', 'aswiki', 'avwiki', 'aywiki', 'azwiki', 'azbwiki', 'barwiki', 'bat_smgwiki', 'bawiki', 'bclwiki', 'be_x_oldwiki', 'bewiki', 'bgwiki', 'bhwiki', 'biwiki', 'bjnwiki', 'bmwiki', 'bnwiki', 'bowiki', 'bpywiki', 'brwiki', 'bswiki', 'bugwiki', 'bxrwiki', 'cawiki', 'cbk_zamwiki', 'cdowiki', 'cebwiki', 'cewiki', 'chowiki', 'chrwiki', 'chwiki', 'chywiki', 'ckbwiki', 'cowiki', 'crhwiki', 'crwiki', 'csbwiki', 'cswiki', 'cuwiki', 'cvwiki', 'cywiki', 'dawiki', 'dewiki', 'diqwiki', 'dsbwiki', 'dvwiki', 'dzwiki', 'eewiki', 'elwiki', 'emlwiki', 'enwiki', 'eowiki', 'eswiki', 'etwiki', 'euwiki', 'extwiki', 'fawiki', 'ffwiki', 'fiu_vrowiki', 'fiwiki', 'fjwiki', 'fowiki', 'frpwiki', 'frrwiki', 'frwiki', 'furwiki', 'fywiki', 'gagwiki', 'ganwiki', 'gawiki', 'gdwiki', 'glkwiki', 'glwiki', 'gnwiki', 'gomwiki', 'gotwiki', 'guwiki', 'gvwiki', 'hakwiki', 'hawiki', 'hawwiki', 'hewiki', 'hifwiki', 'hiwiki', 'howiki', 'hrwiki', 'hsbwiki', 'htwiki', 'huwiki', 'hywiki', 'hzwiki', 'iawiki', 'idwiki', 'iewiki', 'igwiki', 'iiwiki', 'ikwiki', 'ilowiki', 'iowiki', 'iswiki', 'itwiki', 'iuwiki', 'jawiki', 'jamwiki', 'jbowiki', 'jvwiki', 'kaawiki', 'kabwiki', 'kawiki', 'kbdwiki', 'kgwiki', 'kiwiki', 'kjwiki', 'kkwiki', 'klwiki', 'kmwiki', 'knwiki', 'koiwiki', 'kowiki', 'krcwiki', 'krwiki', 'kshwiki', 'kswiki', 'kuwiki', 'kvwiki', 'kwwiki', 'kywiki', 'ladwiki', 'lawiki', 'lbewiki', 'lbwiki', 'lezwiki', 'lgwiki', 'lijwiki', 'liwiki', 'lmowiki', 'lnwiki', 'lowiki', 'lrcwiki', 'ltgwiki', 'ltwiki', 'lvwiki', 'maiwiki', 'map_bmswiki', 'mdfwiki', 'mgwiki', 'mhrwiki', 'mhwiki', 'minwiki', 'miwiki', 'mkwiki', 'mlwiki', 'mnwiki', 'mowiki', 'mrjwiki', 'mrwiki', 'mswiki', 'mtwiki', 'muswiki', 'mwlwiki', 'myvwiki', 'mywiki', 'mznwiki', 'nahwiki', 'napwiki', 'nawiki', 'nds_nlwiki', 'ndswiki', 'newiki', 'newwiki', 'ngwiki', 'nlwiki', 'nnwiki', 'novwiki', 'nowiki', 'nrmwiki', 'nsowiki', 'nvwiki', 'nywiki', 'ocwiki', 'omwiki', 'orwiki', 'oswiki', 'pagwiki', 'pamwiki', 'papwiki', 'pawiki', 'pcdwiki', 'pdcwiki', 'pflwiki', 'pihwiki', 'piwiki', 'plwiki', 'pmswiki', 'pnbwiki', 'pntwiki', 'pswiki', 'ptwiki', 'quwiki', 'rmwiki', 'rmywiki', 'rnwiki', 'roa_rupwiki', 'roa_tarawiki', 'rowiki', 'ruewiki', 'ruwiki', 'rwwiki', 'sahwiki', 'sawiki', 'scnwiki', 'scowiki', 'scwiki', 'sdwiki', 'sewiki', 'sgwiki', 'shwiki', 'simplewiki', 'siwiki', 'skwiki', 'slwiki', 'smwiki', 'snwiki', 'sowiki', 'sqwiki', 'srnwiki', 'srwiki', 'sswiki', 'stqwiki', 'stwiki', 'suwiki', 'svwiki', 'swwiki', 'szlwiki', 'tawiki', 'tcywiki', 'tenwiki', 'testwiki', 'test2wiki', 'tetwiki', 'tewiki', 'tgwiki', 'thwiki', 'tiwiki', 'tkwiki', 'tlwiki', 'tnwiki', 'towiki', 'tpiwiki', 'trwiki', 'tswiki', 'ttwiki', 'tumwiki', 'twwiki', 'tyvwiki', 'tywiki', 'udmwiki', 'ugwiki', 'ukwiki', 'urwiki', 'uzwiki', 'vecwiki', 'vepwiki', 'vewiki', 'viwiki', 'vlswiki', 'vowiki', 'warwiki', 'wawiki', 'wg_enwiki', 'wowiki', 'wuuwiki', 'xalwiki', 'xhwiki', 'xmfwiki', 'yiwiki', 'yowiki', 'zawiki', 'zeawiki', 'zh_classicalwiki', 'zh_min_nanwiki', 'zh_yuewiki', 'zhwiki', 'zuwiki')

avg 13360.0931 geomean 1284.90090754117 std 403370.2035 cnt 321377

Before:

SELECT AVG(event_mediaWikiLoadComplete) as avg, EXP(AVG(LOG(event_mediaWikiLoadComplete))) as geomean, STD(event_mediaWikiLoadComplete) as std, COUNT(*) as cnt FROM NavigationTiming_15485142 WHERE timestamp > 20160907200000 AND timestamp < 20160908180000 AND event_mediaWikiLoadComplete > 0 AND wiki IN ('aawiki', 'abwiki', 'acewiki', 'adywiki', 'afwiki', 'akwiki', 'alswiki', 'amwiki', 'angwiki', 'anwiki', 'arbcom_dewiki', 'arbcom_enwiki', 'arbcom_fiwiki', 'arbcom_nlwiki', 'arcwiki', 'arwiki', 'arzwiki', 'astwiki', 'aswiki', 'avwiki', 'aywiki', 'azwiki', 'azbwiki', 'barwiki', 'bat_smgwiki', 'bawiki', 'bclwiki', 'be_x_oldwiki', 'bewiki', 'bgwiki', 'bhwiki', 'biwiki', 'bjnwiki', 'bmwiki', 'bnwiki', 'bowiki', 'bpywiki', 'brwiki', 'bswiki', 'bugwiki', 'bxrwiki', 'cawiki', 'cbk_zamwiki', 'cdowiki', 'cebwiki', 'cewiki', 'chowiki', 'chrwiki', 'chwiki', 'chywiki', 'ckbwiki', 'cowiki', 'crhwiki', 'crwiki', 'csbwiki', 'cswiki', 'cuwiki', 'cvwiki', 'cywiki', 'dawiki', 'dewiki', 'diqwiki', 'dsbwiki', 'dvwiki', 'dzwiki', 'eewiki', 'elwiki', 'emlwiki', 'enwiki', 'eowiki', 'eswiki', 'etwiki', 'euwiki', 'extwiki', 'fawiki', 'ffwiki', 'fiu_vrowiki', 'fiwiki', 'fjwiki', 'fowiki', 'frpwiki', 'frrwiki', 'frwiki', 'furwiki', 'fywiki', 'gagwiki', 'ganwiki', 'gawiki', 'gdwiki', 'glkwiki', 'glwiki', 'gnwiki', 'gomwiki', 'gotwiki', 'guwiki', 'gvwiki', 'hakwiki', 'hawiki', 'hawwiki', 'hewiki', 'hifwiki', 'hiwiki', 'howiki', 'hrwiki', 'hsbwiki', 'htwiki', 'huwiki', 'hywiki', 'hzwiki', 'iawiki', 'idwiki', 'iewiki', 'igwiki', 'iiwiki', 'ikwiki', 'ilowiki', 'iowiki', 'iswiki', 'itwiki', 'iuwiki', 'jawiki', 'jamwiki', 'jbowiki', 'jvwiki', 'kaawiki', 'kabwiki', 'kawiki', 'kbdwiki', 'kgwiki', 'kiwiki', 'kjwiki', 'kkwiki', 'klwiki', 'kmwiki', 'knwiki', 'koiwiki', 'kowiki', 'krcwiki', 'krwiki', 'kshwiki', 'kswiki', 'kuwiki', 'kvwiki', 'kwwiki', 'kywiki', 'ladwiki', 'lawiki', 'lbewiki', 'lbwiki', 'lezwiki', 'lgwiki', 'lijwiki', 'liwiki', 'lmowiki', 'lnwiki', 'lowiki', 'lrcwiki', 'ltgwiki', 'ltwiki', 'lvwiki', 'maiwiki', 'map_bmswiki', 'mdfwiki', 'mgwiki', 'mhrwiki', 'mhwiki', 'minwiki', 'miwiki', 'mkwiki', 'mlwiki', 'mnwiki', 'mowiki', 'mrjwiki', 'mrwiki', 'mswiki', 'mtwiki', 'muswiki', 'mwlwiki', 'myvwiki', 'mywiki', 'mznwiki', 'nahwiki', 'napwiki', 'nawiki', 'nds_nlwiki', 'ndswiki', 'newiki', 'newwiki', 'ngwiki', 'nlwiki', 'nnwiki', 'novwiki', 'nowiki', 'nrmwiki', 'nsowiki', 'nvwiki', 'nywiki', 'ocwiki', 'omwiki', 'orwiki', 'oswiki', 'pagwiki', 'pamwiki', 'papwiki', 'pawiki', 'pcdwiki', 'pdcwiki', 'pflwiki', 'pihwiki', 'piwiki', 'plwiki', 'pmswiki', 'pnbwiki', 'pntwiki', 'pswiki', 'ptwiki', 'quwiki', 'rmwiki', 'rmywiki', 'rnwiki', 'roa_rupwiki', 'roa_tarawiki', 'rowiki', 'ruewiki', 'ruwiki', 'rwwiki', 'sahwiki', 'sawiki', 'scnwiki', 'scowiki', 'scwiki', 'sdwiki', 'sewiki', 'sgwiki', 'shwiki', 'simplewiki', 'siwiki', 'skwiki', 'slwiki', 'smwiki', 'snwiki', 'sowiki', 'sqwiki', 'srnwiki', 'srwiki', 'sswiki', 'stqwiki', 'stwiki', 'suwiki', 'svwiki', 'swwiki', 'szlwiki', 'tawiki', 'tcywiki', 'tenwiki', 'testwiki', 'test2wiki', 'tetwiki', 'tewiki', 'tgwiki', 'thwiki', 'tiwiki', 'tkwiki', 'tlwiki', 'tnwiki', 'towiki', 'tpiwiki', 'trwiki', 'tswiki', 'ttwiki', 'tumwiki', 'twwiki', 'tyvwiki', 'tywiki', 'udmwiki', 'ugwiki', 'ukwiki', 'urwiki', 'uzwiki', 'vecwiki', 'vepwiki', 'vewiki', 'viwiki', 'vlswiki', 'vowiki', 'warwiki', 'wawiki', 'wg_enwiki', 'wowiki', 'wuuwiki', 'xalwiki', 'xhwiki', 'xmfwiki', 'yiwiki', 'yowiki', 'zawiki', 'zeawiki', 'zh_classicalwiki', 'zh_min_nanwiki', 'zh_yuewiki', 'zhwiki', 'zuwiki')

avg 4233.2972 geomean 877.9545590176596 std 287848.4865 cnt 341530

Alright, this was all very verbose but important if I made a mistake in how I looked at the figures. But so far this confirms that this is an issue with wmf.18 and not something funny with the data like browser distribution or freak values. The patterns is clear: deploy wmf.18 and get a 400ms regression on mediaWikiLoadComplete.

@Krinkle would you be ok with reverting https://gerrit.wikimedia.org/r/#/c/304158 entirely on wmf.18 as an experiment to rule out ResourceLoader as the cause?

Even with setTimeout, we're introducing one where there was none and we're calling globalEval on smaller chunks.

According to this https://developer.mozilla.org/en-US/docs/Web/API/WindowTimers/setTimeout#Reasons_for_delays_longer_than_specified setTimeouts are clamped to at least a second on inactive tabs. Tabs might still explain this phenomenon even with the polyfill active for everyone.

Change 312065 had a related patch set uploaded (by Krinkle):
Revert "resourceloader: Make cache-eval in mw.loader.work asynchronous"

https://gerrit.wikimedia.org/r/312065

Change 312065 merged by jenkins-bot:
Revert "resourceloader: Make cache-eval in mw.loader.work asynchronous"

https://gerrit.wikimedia.org/r/312065

Mentioned in SAL (#wikimedia-operations) [2016-09-21T19:24:06Z] <krinkle@tin> Synchronized php-1.28.0-wmf.18/resources/src/mediawiki/mediawiki.js: T146099 (duration: 01m 41s)

Looks like that did the trick:

Now we know the source of the apparent regression and we can investigate if it's a net gain or net loss. @Krinkle I'll let you deal with what's next.

I think it's safest to revert on all the branches and study it better/add better monitoring before a re-rollout, but it's your call.

  • Affected metrics
  • RUM Median story
  • RUM p95 story
  • WebPageTest story

Affected metrics

NavigationTiming metrics around September 8th:

(Links to Grafana viewing from 20160801 to 20160918)

MetricDifferenceGraph
responseStartunaffected-
domInteractive (From navigationStart to document received, parsed, and blocking resources ready)Median: -50ms (from 870ms to 820ms)-
p95: -6% (4.5s to 4.2s)-
firstPaintMedian: unaffected (~1000ms)
p95: +441% (from 4.9s to 21.6s)
domCompleteMedian:  -20% (from 1900ms to 1500ms)
p95: -8% (from 9.4s to 8.7s)-
loadEventEndMedian:  -25% (from 2000ms to 1600ms)
p95:  -10% (from 10s to 9s)-
mediaWikiLoadComplete (Time startup.js loaded execution and NavTiming module being loaded)Median:  +400ms (from 900ms to 1300ms)
p95: +100% (from 5s to 10s)

RUM Median story

Before:

0s..870ms1000ms..1900ms2000ms
Navigation TimingnavigationStartdomInteractivefirstPaintdomCompleteloadEvent
MediaWikimediaWikiLoadStart..(+900ms)mediaWikiLoadEnd

After:

0s..820ms1000ms..1500ms1600ms
Navigation TimingnavigationStartdomInteractivefirstPaintdomCompleteloadEvent
MediaWikimediaWikiLoadStart..(+1300ms)mediaWikiLoadEnd

For the median use case it seems like events mostly re-ordered, in a way that ultimately actually makes the page finish loading quicker. Assuming that this is caused by "resourceloader: Make cache-eval asynchronous ", it happened by letting the browser prioritise getting the page ready, on-screen and interactive; in favour of spreading out module execution.

RUM p95

Before:

0s..4.5s4.9s..9.4s10s
Navigation TimingnavigationStartdomInteractivefirstPaintdomCompleteloadEvent
MediaWikimediaWikiLoadStart..(+5s)mediaWikiLoadEnd

After:

0s..4.2s8.7s9s..21.6s
Navigation TimingnavigationStartdomInteractivedomCompleteloadEventfirstPaint
MediaWikimediaWikiLoadStart..(+10s)mediaWikiLoadEnd

Here too overall page load is faster. But JS execution is spread out even more. This could be due to various reasons. The most likely reasons is a combination of the following three reasons.

  • requestIdleCallback in chrome causing a delay upto 30s (fixed in master)
  • setTimeout being deprioritized in background tabs (probably acceptable and desirable; not sure if a significant amount of page views initialise in a background tab and/or pre-render).
  • Cache-eval goes in chunks of 50ms. On slower devices (e.g. mobile) will require a lot of chunks, compared to the previous approach of executing everything in one go. However the downside is that this will render the user's browser unresponsive for a long time - which is arguably a much worse user experience. To be confirmed.

WebPageTest

https://grafana.wikimedia.org/dashboard/db/webpagetest-drilldown
(user: Anonymous, project: enwiki, page: Facebook)

Desktop first view:

  • SpeedIndex: Unaffected.
  • Start render:
    • us-east1/Firefox: Consistently hitting the low-end more frequently (range unchanged).
    • us-east1/Chrome: Range is consistently 100ms lower.
    • us-east1/IE: Unchanged.
  • Fully loaded:
    • us-east1/Firefox: Unchanged.
    • us-east1/Chrome: Consistently 100ms lower.
    • us-east1/IE: Unchanged.
  • mwLoadStart/mwLoadEnd
    • Firefox: Unchanged.
    • Chrome: Consistently ends 100ms earlier.
    • IE: No clear change.

Desktop repeat view (Chrome only):
https://grafana.wikimedia.org/dashboard/db/webpagetest-drilldown?var-view=repeatView

  • SpeedIndex: 1.2s -> 800ms.
  • Start render: 790ms -> 570ms.
  • Fully loaded: 2.6s -> 2.3s.

Real Mobile (Motorola G, 3G-Fast):
https://grafana.wikimedia.org/dashboard/db/webpagetest-drilldown?panelId=22&fullscreen

  • SpeedIndex: About 200 lower on average
  • fullyLoaded: No clear change.
  • mwLoadStart/End: No clear change.
  • render: No clear change.

Second page view (Chrome, emulateMobile):
https://grafana.wikimedia.org/dashboard/db/webpagetest-drilldown?panelId=26&fullscreen

  • SpeedIndex: -100 (2014 -> 1930)
  • render: -100 (1990 -> 1891)
  • fullyLoaded: +300ms (2.1s -> 2.4s)
  • mwLoadStart: +/-0 (210ms)
  • mwLoadEnd: +200ms (2.4s -> 2.6s)
greg added a comment.Sep 21 2016, 8:48 PM

File follow-up tasks and resolve this one?

ori added a comment.Sep 22 2016, 4:14 PM

@Krinkle, @Gilles: are you satisfied that the change represents an improvement over the previous status quo, or do you still think we need to back it out and study it further?

ori assigned this task to Krinkle.Sep 22 2016, 4:29 PM

OK, so we have a plausible explanation for why mwLoadComplete taking longer is fine, but not one for first paint, which also spiked and flattened with the roll-out and revert:

In light of that, I support the recommendation @Gilles made earlier -- to back this out for now, by reverting it in the latest branch and in master. @Krinkle, can you take care of that?

Mentioned in SAL (#wikimedia-operations) [2016-09-22T21:04:19Z] <krinkle@tin> Synchronized php-1.28.0-wmf.20/resources/src/mediawiki/mediawiki.js: T146099 (duration: 00m 48s)

Krinkle closed this task as Resolved.Sep 23 2016, 6:34 PM

File follow-up tasks and resolve this one?

Filed T146510 for investigating and mitigating it before we try async cache-eval again.