Page MenuHomePhabricator

Significant mobile web performance regression observed at deployment of 1.34.0-wmf.22
Closed, ResolvedPublic

Description

https://www.mediawiki.org/wiki/MediaWiki_1.34/wmf.22

RUM

Capture d'écran 2019-09-17 12.31.00.png (394×634 px, 72 KB)

Affected RUM metrics (non-exhaustive list):

  • domInteractive
  • domComplete
  • firstPaint
  • loadEventEnd
  • processing

Latency, request/response time are unaffected, ruling out a networking issue.

Looking at the platform dashboard, it's very clear that this is a mobile web-only regression.

Synthetic

Only visible for the Sweden article on WPR. Not for the other articles on WPR, nor on WPT:

Capture d'écran 2019-09-17 17.10.16.png (731×796 px, 85 KB)

Human

Survey satisfaction ratios on mobile:

Dayeswikiruwiki
Sep 1488.01%89.10%
Sep 1588.07%88.37%
Sep 1686.24%86.31%
Sep 17 (partial)84.75%87%

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Looking at a HAR comparison, ruwiki mobile runs before and after the change:

https://compare.sitespeed.io/?har2=https://webpagereplay-wikimedia.s3.us-east-1.amazonaws.com/ruwiki/mobile/chrome/50/Wikipedia/2019-09-17-09-29/browsertime.har.gz&compare=1&har1=https://webpagereplay-wikimedia.s3.us-east-1.amazonaws.com/ruwiki/mobile/chrome/50/Wikipedia/2019-09-16-08-10/browsertime.har.gz&compare=1

It seems like the main cause happens before DOMContentLoaded, which is significantly delayed, despite the waterfall of downloads preceding it being identical between the 2 runs. This suggests that the main thread is busy for longer before reaching DOMContentLoaded. However, when looking at the CPU section it seems like the more recent, slower run, is actually spending less time on CPU overall:

Capture d'écran 2019-09-17 13.27.40.png (581×1 px, 82 KB)

I've looked at two Chrome traces before and after, we got forced reflow before first paint that on the test machine increased from 66 to 120 ms.

This was article specific and not due to the release.

@Krinkle I think the quickest course of action is to deploy a revert of the startup module optimisations on its own to production. I see these as the most likely outcomes:

  • It's actually what caused this sharp regression, and we've greatly narrowed down the investigation
  • These startup module changes were a performance improvement, masked by a regression somewhere else in '22, and we get to find out by how much this startup module size reduction exactly improves things. And we've ruled out a potential cause.

Considering the waterfalls seen in the WPR runs I linked to earlier, I think it's pretty clear that it's something coming from early requests (since everything is downloading exactly at the same time), not modules whose load is trigger at the load event and which wouldn't be cached in the testing environment.

HTML is ruled out because it didn't change, so are images. That leaves only 2 things that changed:

  • startup JS (whose size decreased, as we know)
  • startup CSS (whose size increased)

@Jdlrobson https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/MobileFrontend/+/533359/ looks like the change that increased startup CSS size for the mobile site on 1.34.0-wmf.22. Is that correct?

This is interesting in the context of this investigation, since the regression is mobile-only.

At this point, since this change only affected mobile, increased the startup CSS size (which blocks rendering) and increased the amount of time parsing/applying CSS prior to the first render, this becomes the new prime suspect, IMHO. I would suggest reverting the change to see if we get back to our original performance on mobile.

@Jdlrobson https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/MobileFrontend/+/533359/ looks like the change that increased startup CSS size for the mobile site on 1.34.0-wmf.22. Is that correct?

This is interesting in the context of this investigation, since the regression is mobile-only.

At this point, since this change only affected mobile, increased the startup CSS size (which blocks rendering) and increased the amount of time parsing/applying CSS prior to the first render, this becomes the new prime suspect, IMHO. I would suggest reverting the change to see if we get back to our original performance on mobile.

If it did it was only temporary. That code was already being loaded by Minerva and this code simply moved repos from one to the other (see https://gerrit.wikimedia.org/r/#/q/I1815a730fc12f68c287d58f2de6f92e05db3abd2). Reverting would break toggling on the mobile site.

In terms of CSS - most of this CSS is loaded via JS. Only the new https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/MobileFrontend/+/533359/7/resources/mobile.init.styles.less should be render blocking

@Krinkle does T229245: Reduce size of module version hash mean that all RL modules' cache was invalidated in clients at once? (localStorage/browser cache). Instead of imagine typically only modules whose code was touched since the last deployment.

Yes, that's right. We've done this several times throughout these months, including two weeks in a row this month, without visible lasting impact. For example, change 528300 invalidated all localStorage cache following 1.34.0-wmf.21.

Regarding other points where we shrunk the startup module, there was an even bigger drop the week of 1 Aug 2019 (1.34.0-wmf.16) which is when the Wikidata change went out. That shrunk startup by 2.6KB (compressed). At the time Amir and I observed a small improvement in the synthetic data with high-confidence, see T203696#5387672. The change in August happened to get reverted for a few hours by an unrelated train blocker and then re-deployed, which luckily allowed us to see with high-confidence that the change was correlated.

This weeks' changes shrunk startup by an additional 2.4KB (compressed).

From the task description:

On mobile webpagereplay, however, the spike of First Visual Change and Speed Index right at the deployment time is very clear.

Barack Obama

Capture d'écran 2019-09-17 12.55.52.png (519×1 px, 80 KB)

Link is WebPageReplay drilldown.

These graphs, however, have a 24h moving average applied. Here's the same without it:

24h moving avg2h median
WPR/mobile/Obama
Screenshot 2019-09-17 at 15.34.30 1.png (1×2 px, 112 KB)
Screenshot 2019-09-17 at 15.34.56.png (1×2 px, 105 KB)
WPR/mobile/Facebook
Screenshot 2019-09-17 at 15.43.09.png (1×2 px, 92 KB)
Screenshot 2019-09-17 at 15.43.04.png (1×2 px, 109 KB)

In both cases, it looks like the measures are now back their pre-deploy values.

If it did it was only temporary. That code was already being loaded by Minerva and this code simply moved repos from one to the other (see https://gerrit.wikimedia.org/r/#/q/I1815a730fc12f68c287d58f2de6f92e05db3abd2). Reverting would break toggling on the mobile site.

But the 1.34.0-wmf.22 deployment is a net increase of startup CSS size, because the removal patch hasn't been deployed yet. It should be deployed ASAP.

Additionally, are the old and new code identical, or did some CSS selectors change?

@Jdlrobson follow-up question. If the code is indeed identical (see my previous comment), does that mean that the same CSS rules are being "loaded twice" on top of each other in the current interim situation? It shouldn't matter in theory, but you never know, could be part of the problem.

In synthetic WebPageReplay, we see the change is mobile-only and only continues to affect the Sweden article (not Obama or Facebook).

In real-user metrics, we see the change is mobile-only but generally continues to users. This suggests what whatever is affecting the Sweden article is likely affecting many other articles as well. We don't have Paint Timing from Safari unfortunately (After 7 years, Apple still hasn't implemented it). But from Chrome Mobile we see it affects Chrome 76 and Chrome 77 equally (this upgrade is currently 50/50).

In synthetic WebPageTest, I'm trying to see what it is that has affected the Sweden article. In particular what has stopped affecting the other two articles and/or hasn't affected those in the first place.

For WPT/enwiki-mobile/Sweden (dashboard):

  • First visible change: +1 second (5.5s -> 6.5s)
  • Last visible change: +3 seconds (9s -> 12s)
  • DOM elements: reduced (8351 -> 8192)
  • Transferred CSS: increased +0.2KB (7.1KB -> 7.3KB)
  • Transferred JS: increased +.1.5KB (182.1KB -> 183.6KB)
  • CPU time in Layout: +3 seconds (4.0s -> 7.2s)
  • CPU time in Painting: +0.4 seconds (150ms -> 550ms)
  • CPU time in Scripting: +0.5 second (1.1s -> 1.6s)
  • CPU time in Loading, Transferred HTML, Transferred images, HTTP request count: unchanged.

For WPT/enwiki-mobile/Obama:

  • First visible change, Last visible change: unchanged
  • Transferred CSS: increased +0.2KB (7.1KB -> 7.3KB) [same as Sweden]
  • Transferred JS: increased +.1.3KB (182.1KB -> 183.4KB) [almost same as Sweden]
  • CPU time in Layout: unchanged [different from Sweden]
  • CPU time in Painting: unchanged [different from Sweden]
  • CPU time in Scripting: +0.5 second (1.1s -> 1.6s) [same as Sweden]
  • CPU time in Loading, Transferred images, HTTP request count: unchanged.

This suggests the added JS/CSS size by itself didn't cause the regression, as they are downloaded on the unaffected page as well. Given the impact on Layout only, this suggests to me there is some added JS code that is causing a layout change or other style recalculation on some pages but not others. Or some CSS code that is expensive to render on some pages, and unused on others.

This suggests the added JS/CSS size by itself didn't cause the regression, as they are downloaded on the unaffected page as well.

That's an overly eager conclusion. Synthetic testing runs in very specific conditions that are disconnected from reality. There is also variance between runs. The fact that only Sweden is affected in synthetic testing could be a red herring in terms of identifying cause. It doesn't rule out JS/CSS as the cause. It might just be chance that the very real issues affecting real user does or doesn't manifest itself in synthetic testing runs. Particularly since the synthetic tests are running an emulated mobile browser, not the real thing.

What I think is most useful from the WPT before/after runs is to get the diff. The differences seen might affect real users but not synthetic tests.

If it did it was only temporary. That code was already being loaded by Minerva and this code simply moved repos from one to the other (see https://gerrit.wikimedia.org/r/#/q/I1815a730fc12f68c287d58f2de6f92e05db3abd2). Reverting would break toggling on the mobile site.

But the 1.34.0-wmf.22 deployment is a net increase of startup CSS size, because the removal patch hasn't been deployed yet. It should be deployed ASAP.

The patches should have been deployed at the same time so I am a little confused.
When I look at the Sweden article the old module is not loaded and the new one is:

mw.loader.getState('skins.minerva.toggling') === 'registered' && mw.loader.getState('mobile.init') === 'ready'

and skins.minerva.toggling is an empty module: https://en.m.wikipedia.org/w/load.php?modules=skins.minerva.toggling

Additionally, are the old and new code identical, or did some CSS selectors change?

The CSS should be the same but per above this shouldn't be a problem.

@Jdlrobson follow-up question. If the code is indeed identical (see my previous comment), does that mean that the same CSS rules are being "loaded twice" on top of each other in the current interim situation? It shouldn't matter in theory, but you never know, could be part of the problem.

There should be no css being loaded twice unless for some reason an old cached version lives in localStorage or a local cached version on the machine. The code was added and removed in the same deploy (at least that was my understanding and that's what I'm seeing).

There should be no css being loaded twice unless for some reason an old cached version lives in localStorage or a local cached version on the machine.

Both of these have hard version validation (localStorage) or short cache expiry without grace period (HTTP).

When I look at the Sweden article the old module is not loaded and the new one is:

Yep, confirmed, it's not even attempted to be loaded. Looking at the related changes in Gerrit (A and B), I also don't see any change in size between them that would explain the difference (the JS got a tiny bit smaller due to not needing the desktop check, the CSS got a tiny bit bigger due to needing to set up the same selector twice, and it added two declarations that weren't there before).

Yet, we do see +1.5KB increase in JavaScript - which is after the ~2KB decrease from the startup module. Where there any other notable changes in the 1.5 week before last Tuesday that may've increased or altered the JS payload on (most) mobile page views?

Yet, we do see +1.5KB increase in JavaScript - which is after the ~2KB decrease from the startup module. Where there any other notable changes in the 1.5 week before last Tuesday that may've increased or altered the JS payload on (most) mobile page views?

I'm not seeing anything obvious in the beta cluster graph for JS https://grafana.wikimedia.org/d/000000205/mobile-2g?panelId=69&fullscreen&orgId=1&from=now-30d&to=now - we've been mostly removing code recently. The most notable change we made was using mediawiki core's table of contents - previously this was rendered by JS and now is in the HTML and hidden via CSS. Could that have anything to do with it?

Scanning through the JS, the only potentially new modues I can see requested are:

  • oojs throttle debounce
  • banner code
  • skins.minerva.share.options (that shouldn't be loaded to stable users - not sure if this is a new regression or an old one)
  • QuickSurveys init modules - did we turn on any surveys recently?

Other than that icons may be to blame. We've been making lots of icons changes recently and they impact CSS and JS. All icons should be compressed by default, but it's possible one of the icon packs introduced more icons than intended due to the fact variants work (a theory is that someone thought they were adding 1 icon and added 3 icons). It's probably comparing before and after URLs and looking closely at any packs containing icons to see how their contents may have changed.

Diffing beautified production head CSS as served on the 16th before the deployment and as served today:

https://github.com/gi11es/T233095/commit/b50cb627f64464fa4e72f60de788bdcb80207789

I stand by my assessment that your change introduces extra weight in the head CSS.

Maybe it's some LESS compilation optimisations that can't be done anymore, but the startup CSS got heavier, and as you can see there the bulk of new additional stuff not found in deleted code is the TOC thing.

Diffing beautified production startup CSS as served on the 16th before the deployment and as served today:

https://github.com/gi11es/T233095/commit/b50cb627f64464fa4e72f60de788bdcb80207789

I stand by my assessment that your change introduces extra weight in the startup CSS.

The extra weight appears to be from table of contents change (I720e62a578f0c7a14f4b5a698004471c85e54bc8) and userAvatar (Ie956ac82869ea5ff91d5d39d6b679a4803e46458) change not the toggling change you pointed at before. Is there a way to revert those patches in a controlled non production environment to see if the problem goes away and determine which (or if both?) is the cause?

My guess is table of contents is more problematic as that is loaded via link tag and is near the top of the page. I noticed this while looking at the code:
https://gerrit.wikimedia.org/r/#/c/mediawiki/skins/MinervaNeue/+/537494 Drop empty toc.less file and import

Much of this code only works with client-js - is it possible that the client-js tag being added to the body is causing the issue?

Much of this code only works with client-js - is it possible that the client-js tag being added to the body is causing the issue?

The body class is set before any CSS is applied. Should be fine.

Looking at the [toggling] changes in Gerrit (A and B), I also don't see any change in size between them that would explain the difference […]

The extra weight appears to be from table of contents change (I720e62a578f0c7a14f4b5a698004471c85e54bc8) and userAvatar (Ie956ac82869ea5ff91d5d39d6b679a4803e46458) change not the toggling change […]

Yeah, the toggling code seems fine. My guess it'll either the added CSS or JS code. Gilles is prepping a diff now.

I'm seeing a lot of additions in phpCharToUpper.json due to https://gerrit.wikimedia.org/r/507793

Jon's question: why is that loading on a wiki that's not Georgian language?
Jon's answer to own question: because it's mw.Title library. It would be interesting to use packageFiles to make this JSON language specific.

I'm seeing a lot of additions in phpCharToUpper.json due to https://gerrit.wikimedia.org/r/507793

Jon's question: why is that loading on a wiki that's not Georgian language?
Jon's answer to own question: because it's mw.Title library. It would be interesting to use packageFiles to make this JSON language specific.

James's better answer: All wikis can have Georgian content, including in titles. This isn't content-language-specific code.

And the behemoth diff of the before and after "regular" JS:

https://github.com/gi11es/T233095/commit/c9eb328eaf67ed368b1e6f4cc8ed90d3a2234575

  • CentralNotice: Some minor changes from their routine branch pull through (c26ef03e7aad1ce1). Nothing stands out. Roughly same size as before. Nothing stands out as expensive to compute or render.
  • MW core / oojs: New version release (cefb64f9d88d706). Added a new OO.EventEmitter method, which is +86 bytes after gzip.
  • MW core / mediawiki.Title: Increase in charToUpper.json data (416895821fdb68). Could explain about ~ 1KB of the size increase. Not layout or computational, though.
  • MobileFrontend: Moved a few things around. Removed "skins.minerva.scripts/toc.js", which is nice. About ~ 0.4KB after gzip less JS.

Change 537505 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/extensions/VisualEditor@master] Remove cache group setting from styles modules

https://gerrit.wikimedia.org/r/537505

Change 537514 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] mediawiki.Title: Reduce transfer size of phpCharToUpper.json

https://gerrit.wikimedia.org/r/537514

Change 537505 merged by jenkins-bot:
[mediawiki/extensions/VisualEditor@master] Remove cache group setting from styles modules

https://gerrit.wikimedia.org/r/537505

Change 537523 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/extensions/VisualEditor@wmf/1.34.0-wmf.22] Remove cache group setting from styles modules

https://gerrit.wikimedia.org/r/537523

Change 537514 merged by jenkins-bot:
[mediawiki/core@master] mediawiki.Title: Reduce transfer size of phpCharToUpper.json

https://gerrit.wikimedia.org/r/537514

Change 537530 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@wmf/1.34.0-wmf.22] mediawiki.Title: Reduce transfer size of phpCharToUpper.json

https://gerrit.wikimedia.org/r/537530

Change 537531 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@wmf/1.34.0-wmf.23] mediawiki.Title: Reduce transfer size of phpCharToUpper.json

https://gerrit.wikimedia.org/r/537531

Change 537530 merged by jenkins-bot:
[mediawiki/core@wmf/1.34.0-wmf.22] mediawiki.Title: Reduce transfer size of phpCharToUpper.json

https://gerrit.wikimedia.org/r/537530

Change 537531 merged by jenkins-bot:
[mediawiki/core@wmf/1.34.0-wmf.23] mediawiki.Title: Reduce transfer size of phpCharToUpper.json

https://gerrit.wikimedia.org/r/537531

Change 537523 merged by jenkins-bot:
[mediawiki/extensions/VisualEditor@wmf/1.34.0-wmf.22] Remove cache group setting from styles modules

https://gerrit.wikimedia.org/r/537523

I'm no longer able to see any regression in the synthetic tests. Most of them were only raised for 1 or 2 runs around the branch on 16 Sept from 14:00 to 16:00. These were prolonger in the older version of the some graphs due to a moving average. The exception was WebPageReplay/enwiki-mobile/chrome/Sweden article which did not recover until 17 Sept 16:00. Most of our investigations took place after these were both recovered, including the above backports.

On RUM, the mobile regression is still visible. Note that it did not affect desktop at any point. And within mobile, looking at the browser breakdown it appears to affect Chrome Mobile (Android) but not Mobile Safari.

I initially thought the reason it wasn't visible on Mobile Safari is due to the lack of Paint Timing API support. However, the regression affects loadEventEnd and firstPaint equally for mobile overall (dashboard). They are both elevated by ~200ms at the 75th percentile.

Looking at the by-browser breakdown (dashboard) we can see it clearly visible at the median as well as the 75th percentile. However, loadEventEnd (at both median and p75) does not appear to be elevated in Mobile Safari.

I'm not sure what it could be at this point. Looking back at the Sitespeed waterfall chart's from Peter's comment, I do see the payload for MobileFrontend icons correlating with the delayed first visual change. Looking at Gilles' allthethings-js diff, I see there is a new "mobile.ooui.icons" module loaded with JavaScript with several embedded icons. On the other hand, I also see on the left-hand side of that same diff several icon packs were removed or shrunk, so I'm not sure whether that has increased weight or not. One thing to look at could be to disable embedding of those icons in favour of regular url-based loading, which would likely happen soon regardless (per T121730). Parsing of data URIs is known to affect mobile devices more than desktop, and Android is on average probably less powerful than iOS given it covers a wider spectrum of device class. Backporting a no-embed change or applying it to wmf.22 only as experiment could help rule this out.

There were lots of reports today of the mobile site blanking out - just mentioning this in case there's any relation to this issue (I don't know what could have caused this):
https://twitter.com/AdaMcVean/status/1174133105818570752
https://twitter.com/finalfantasy420/status/1173956372968247298
https://twitter.com/VulpesOkamis/status/1174042412190195712

Interesting. Two of those articles have their wmf.21-generated HTML cached still (maybe all three at time of reporting). E.g. https://en.m.wikipedia.org/wiki/Shin_Megami_Tensei:_Digital_Devil_Saga contains:

<script>
RLPAGEMODULES=["mediawiki.page.startup","skins.minerva.watchstar","skins.minerva.options", "skins.minerva.scripts","skins.minerva.toggling","ext.gadget.switcher","ext.centralauth.centralautologin","ext.visualEditor.targetLoader","mobile.site","mobile.init","ext.relatedArticles.readMore.bootstrap","ext.eventLogging","ext.wikimediaEvents","ext.navigationTiming","mw.externalguidance.init","ext.quicksurveys.init","ext.centralNotice.geoIP","ext.centralNotice.startUp"]</script>
<link rel="stylesheet" href="/w/load.php?lang=en&amp;modules=ext.cite.styles%7Cmediawiki.hlist%7Cmediawiki.ui.button%2Cicon%7Cskins.minerva.base.styles%7Cskins.minerva.content.styles%7Cskins.minerva.content.styles.images%7Cskins.minerva.icons.images%2Cwikimedia&amp;only=styles&amp;skin=minerva"/><meta name="generator" content="MediaWiki 1.34.0-wmf.21"/>

And after a purge:

<script>
RLPAGEMODULES=["mediawiki.page.startup","skins.minerva.watchstar","skins.minerva.options", "skins.minerva.scripts","ext.gadget.switcher","ext.centralauth.centralautologin","ext.visualEditor.targetLoader","mobile.site","mobile.init","ext.relatedArticles.readMore.bootstrap","ext.eventLogging","ext.wikimediaEvents","ext.navigationTiming","mw.externalguidance.init","ext.quicksurveys.init","ext.centralNotice.geoIP","ext.centralNotice.startUp"];</script>
<link rel="stylesheet" href="/w/load.php?lang=en&amp;modules=ext.cite.styles%7Cmediawiki.hlist%7Cmediawiki.ui.button%2Cicon%7Cmobile.init.styles%7Cskins.minerva.base.styles%7Cskins.minerva.content.styles%7Cskins.minerva.content.styles.images%7Cskins.minerva.icons.images%2Cwikimedia&amp;only=styles&amp;skin=minerva"/><meta name="generator" content="MediaWiki 1.34.0-wmf.22"/>

This is normal in so far that HTML remains cached for 4-7 days unless it is edited or purged. But the above list of JS and CSS modules might help narrow down a possible side-effect.

I could see this is on a couple of URLs in the WebPageReplay tests:
https://de.m.wikipedia.org/wiki/Facebook
https://es.m.wikipedia.org/wiki/Dolly_Parton
https://es.m.wikipedia.org/wiki/Estocolmo

All having

<meta name="generator" content="MediaWiki 1.34.0-wmf.21"/>

What I could I see with the new setup, the long task before first paint changed:

Screen Shot 2019-09-18 at 8.39.22 AM.png (978×686 px, 87 KB)

And I can see that we spend +40ms in layout before first paint for https://es.m.wikipedia.org/wiki/Estocolmo (attaching devtools log) but I cannot see why, maybe we should look into increase the log level of events.

Could also verify that JS exeuting time is smaller. And no more time spent in layout, it's more that it moved to happen before first paint.

At this point I think it's worth picking a wiki where we can see the effect on RUM clearly, then purging all its articles and seeing if it helps.

If purging article cache is connected to the synthetic regression, it could be that RUM is also only affected by articles which haven't had their cache purged yet. I've diffed some CSS and JS, but not HTML. It's possible that a new JS feature came out alongside and HTML change, assuming that the modified would be there, and somehow it misbehaves perf-wise when faced with cached HTML. This is a very plausible scenario where whoever developed that new feature didn't think of the cached HTML scenario, always looking at new HTML in their development environment.

Looking at the eswiki mobile site:

daymedian_loadeventendmedian_firstpaint
11310962
213831012
314471044
414501031
514531016
615121045
71372977
81335964
914031007
1014531031
1114341029
1214511052
1313751025
1413361013
151272983
1615161153
1716111219
1816051204

Seems like the bump is indeed visible on a per-wiki basis. I'm going to purge eswiki.

Mentioned in SAL (#wikimedia-operations) [2019-09-18T19:42:48Z] <gilles> T233095 Purging all pages on eswiki

Change 537752 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/core@master] Make purgeList.php use getCdnUrls()

https://gerrit.wikimedia.org/r/537752

(only purged the eswiki desktop site after all, because purgeList.php needs this fix)

Change 537760 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/core@wmf/1.34.0-wmf.22] Make purgeList.php use getCdnUrls()

https://gerrit.wikimedia.org/r/537760

Change 537760 merged by Gilles:
[mediawiki/core@wmf/1.34.0-wmf.22] Make purgeList.php use getCdnUrls()

https://gerrit.wikimedia.org/r/537760

Mentioned in SAL (#wikimedia-operations) [2019-09-18T21:53:28Z] <gilles@deploy1001> Synchronized php-1.34.0-wmf.22/maintenance/purgeList.php: T233095 Make purgeList.php use getCdnUrls() (duration: 01m 04s)

Mentioned in SAL (#wikimedia-operations) [2019-09-18T21:54:52Z] <gilles> T233095 Purging all eswiki articles (both desktop and mobile this time)

Change 537752 merged by Gilles:
[mediawiki/core@master] Make purgeList.php use getCdnUrls()

https://gerrit.wikimedia.org/r/537752

Let's check RUM data for eswiki again in 24 hours, to see if the mass HTML purge had an effect.

Seems like it worked, here's mobile eswiki:

daymedian_loadeventendmedian_firstpaint
1313751025
1413361013
151272983
1615161153
1716111219
1815911192
1914781076
2014591059

Meanwhile on mobile ruwiki, which didn't get mass-purged:

daymedian_loadeventendmedian_firstpaint
131180899
141120864
151103863
161288996
1713481045
1813121011
191281985
2013341003

Next step should be to purge HTML of all wikis, IMHO.

Capture d'écran 2019-09-23 09.54.37.png (652×880 px, 48 KB)

It seems like the impact is decreasing over time, which is what should be expected if the old cached HTML theory is true.

Purging everything would speed up that process.

The main question at this point is: is it worth spending more time putting the changes through a comb to figure out the root cause? Or is it not a good investment in light of cache renewal probably making the regression go away?

Change 538680 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/core@wmf/1.34.0-wmf.23] Make purgeList.php use getCdnUrls()

https://gerrit.wikimedia.org/r/538680

Change 538680 merged by jenkins-bot:
[mediawiki/core@wmf/1.34.0-wmf.23] Make purgeList.php use getCdnUrls()

https://gerrit.wikimedia.org/r/538680

Mentioned in SAL (#wikimedia-operations) [2019-09-23T17:59:51Z] <gilles@deploy1001> Synchronized php-1.34.0-wmf.23/maintenance/purgeList.php: T233095 Make purgeList.php use getCdnUrls() (duration: 00m 56s)

Mentioned in SAL (#wikimedia-operations) [2019-09-23T18:03:04Z] <gilles> T233095 Purge articles for all wikis: foreachwiki maintenance/purgeList.php --all --verbose

Next week I will put together a list of emails of everyone who merged code in that release and contact them describing the symptoms. With some luck someone might realise that their change caused this regression when face with cached HTML.

Gilles lowered the priority of this task from Unbreak Now! to High.Sep 23 2019, 6:16 PM

this bug T233373 may also be relevant - a potential hypothesis is that if sections are not collapsed it causes time to first interactive issues. It's not possible to see in the runs if section collapsing is being enabled or not but the timing would fit.

@Jdlrobson nice find! It certainly sounds consistent in terms of symptoms. I looked back at the screenshots for the affected webpagereplay synthetic test runs, and it's entirely possible that the sections were expanded, because the viewport in our tests is too small to see the sections...

Krinkle moved this task from Inbox, needs triage to Doing (old) on the Performance-Team board.

TODO Note to self:

  • Write a short summary here about how we found it, what we knew then, what we know now.

Can we see the scrollbar in the screenshots? If so that also might confirm that theory about sections not collapsing. In which case something went wrong with our patches that we thought were caching safe and disabled section collapsing on the site which may have caused undesired performance regressions that are worth understanding (and possibly justifying the fact we collapse content by default on mobile).

For the record –

The first alerts fired on Mon 16 September, 4 hours after the new branch was deployed (1.34.0-wmf.22). This branch was originally scheduled to roll out on Thursday, but was blocked due to unrelated regressions with PHP 7.2 and MediaWiki core (T220747).

The first alert was the synthetic/webpagereplay, and the RUM/navtiming fired shortly after.

webpagereplay alert
cap-wpr-1st_alert.png (1×2 px, 235 KB)
navtiming alert
cap-rum-2nd_alert.png (1×2 px, 260 KB)

Our RUM/navtiming alerts fire based on a 24-hour moving average, compared to that same moving average shifted 1 week back. This means we sometimes get false alarms. Either due to a very short-lived problem earlier on the same day that may've cause very high times to be reported which then hours after it was resolved cause the average to be raised. Conversely, if an issue last week caused very low times to be reported, the next week may experience a false alarm when the times align with that of the incident the week prior.

We confirmed the validity of the RUM regression on the dedicated dashboard (linked from the alert panel) where we could see that the regression was real:

cap-rum-breakdown.png (1×2 px, 331 KB)

We confirmed this in two ways. 1) On the 24h graph used for alerting (bottom row) we see that the yellow "last week" line looked pretty normal. The abnormal one is this week (green). On the higher rows with 1h moving-avg and no moving average, we also see that the issue was not short-lived or intermittent, but rather real and persistent.

After that, we looked at some other related metrics from navtiming on that same dedicated dashboard. Where we saw that the overall page load time was also affected (domComplete, loadEventEnd). Which indicated that it was not merely an issue of re-ordering work (e.g. some processing that normally happens after FirstPaint now happening before), but that actually more work appears to be done overall and some of it before the browser would start rendering the page.

The fact that we got it from the synthetic test as well, and that we did not have an annotation on the synthetic dashboard about a browser upgrade happening in our test lab, meant that we could rule out any factor of upstream vendors or user-controlled factors being responsible (e.g. new browser release, or device system update).

At this point, we ruled out a few more things (at least for the initial investigation) by looking at which monitoring did not alert. For example, our RUM/navtiming monitoring for desktop pages also did not fire. This suggested that the change might be specific to mobile. Although it is not unusual for mobile devices and networks to be more sensitive to a small regression, so it could've been too small to affect desktop (not a universal truth, and not something we could optimise for, as desktop devices and WiFi can be just as slow or slower than mobile; but at least for aggregated metrics it holds up).
The synthetic monitoring for desktop pages also did not fire, where we have very precise measurement and low-noise data. This added more weight to the idea of it being specific to mobile, given that at least our synthetic monitoring should have picked it up if it were affected. (Still not impossible, but likely).

At this point we started examining the synthetic-mobile results more closely. Looking to see which wikis and which pages were impacted. Al three enwiki pages we monitor (Barack Obama, Facebook, Sweden) were affected:

Barack ObamaFacebookSweden
Capture d'écran 2019-09-17 12.55.52.png (519×1 px, 80 KB)
Capture d'écran 2019-09-17 13.08.11.png (518×1 px, 80 KB)
Capture d'écran 2019-09-17 13.07.19.png (517×1 px, 77 KB)

The WebPageTest and Browsertime synthetic runs both capture detailed information about the test runs. Gilles and Peter analysed these, and found the HTML did not change on these articles, and neither the images. That left only two things: JS and CSS. The JS payload decreased (from known perf work going on that week, T229245). The CSS payload increased slightly. Gilles and Jon further investigated the changes in CSS payload and related Git commits that rolled out that week.

While that was on-going, I noticed that while the RUM graphs still looked bad, the synthetic monitoring had started to recover. Not only did the synthetic graphs start to recover, it turned out they had recovered long ago. However, it was obscured due to a moving average (moving-avg made sense for the more noisey RUM data, but wasn't needed for the lab results).

24h moving avg2h median
WPR/mobile/Obama
Screenshot 2019-09-17 at 15.34.30 1.png (1×2 px, 112 KB)
Screenshot 2019-09-17 at 15.34.56.png (1×2 px, 105 KB)
WPR/mobile/Facebook
Screenshot 2019-09-17 at 15.43.09.png (1×2 px, 92 KB)
Screenshot 2019-09-17 at 15.43.04.png (1×2 px, 109 KB)

For the synthetic tests, two of the three monitored articles had recovered. The regression both started and stopped quite suddenly. However the Sweden article did not yet recover at this time. It wasn't until half-way Wed Sept 18 (the third day) that the Sweden suddenly recovered.

cap-Sweden-recover.png (1×2 px, 267 KB)

We confirmed from the synthetic results that the HTML of the Sweden article changed from a cached copy generated by MediaWiki 1.34.0-wmf.21 to that of 1.34.0-wmf.22. This confirmed our suspicions that it was related to an interaction between a cached page and newly deployed stylesheets referenced from all pages.

The RUM data had also slowly started to recover at this point, although not by much. To confirm beyond a doubt that it was cache-induced, Gilles monitored the RUM data from two wikis in particular (eswiki and ruwiki) and purged the article HTML cache from Varnish for one whole wiki (eswiki). And the next day we could see one had recovered but the other hadn't. We cache HTML for 4-7 days. Although one would expect in that case for the vast majority of traffic from real users to be on the new HTML by day 4 or 5. However, the RUM regression was still noticable after 6 days. Follow-up: This is something to better understand. We might be wrong about our assumption that most stuff expired after 4 days.

For good measure we slowly purged the HTML for all wikis from Varnish over the course of ~2-3 hours. The regression is now no longer visible, and we can start to see a potential improvement (similar to the one on desktop) from unrelated perf work that week.

The article purge isn't over, FYI. The script on mwmaint1002 processing wikis in alphabetical order is processing Commons at the moment. I think it's going to be stuck in there for too long. I'm rather going to purge the biggest traffic wikis one by one, starting with enwiki. We will just let the natural expiry take care of the long tail of smaller wikis.

Just reopening it to keep track of when the purge of the top wikis will be finished. I'm doing ruwiki right now. When the top 10 is done, we will have covered 80% of the traffic, which should be enough. Clearly the RUM metrics have made a recovery already. This is just to verify if we were missing out on a performance improvement thanks to the latest startup size reduction deployed on Sep 16 alongside the regression.

Purging of namespace 0 for the top 10 wikis is done.

DannyS712 subscribed.

[batch] remove patch for review tag from resolved tasks