Page MenuHomePhabricator

Investigate Firefox SpeedIndex/First Visual Change alert 2019-09-03
Closed, ResolvedPublic

Description

We got an SpeedIndex alert on Firefox on WebPageReplay earlier today.

The alerting graph looks like this:

Screen Shot 2019-09-03 at 8.22.45 PM.png (662×1 px, 105 KB)

Also looking at the First Visual change looks worrying:

Screen Shot 2019-09-03 at 8.22.52 PM.png (650×1 px, 99 KB)

Checking a little more in deep, it seems it correlates to a release:

Screen Shot 2019-09-03 at 8.24.23 PM.png (1×1 px, 182 KB)

Looks ok in Chrome though:

Screen Shot 2019-09-03 at 8.26.11 PM.png (1×1 px, 143 KB)

And I can see the same thing for all three URLS on enwiki on WebPageTest too:

Screen Shot 2019-09-03 at 8.27.38 PM.png (636×1 px, 103 KB)

Screen Shot 2019-09-03 at 8.27.10 PM.png (638×1 px, 79 KB)

At last checked my own instance. Same thing there, only visible on Firefox, not Chrome:

Screen Shot 2019-09-03 at 8.29.46 PM.png (1×1 px, 149 KB)

And nothing changed in TTFB. I wonder if we can see this in RUM too?

Note: I think we run different Firefox versions, both 68 and 69, so its not to a specific version.

Lets look into RUM and what has been going out.

Event Timeline

Peter renamed this task from Investigate Firefox SpeedIndex/First Visual Change alert to Investigate Firefox SpeedIndex/First Visual Change alert 2019-09-03.Sep 3 2019, 6:32 PM
Peter created this task.

Did a quick check and this only happens for enwiki (not on none of the others we measure). All pages has an increase in the HTML (+1 or 2 kb) for that release but I haven't been able to identify what yet.

Besides the fact that some code was updated (version module version changes), I notice that the slower recent runs have an extra request that older ones don't have:

https://meta.wikimedia.org/w/load.php?lang=en&modules=ext.globalCssJs.site&skin=vector&version=07sqt3r

Oddly, that file is mostly empty, containing only:

mw.loader.implement("ext.globalCssJs.site@1r8qw7v");

@Krinkle any idea why this started showing up? What's the purpose of this module if it's empty?

Could be completely unrelated to the apparent performance regression in Firefox, but it seems bizarre enough to me to warrant investigation.

The big deployment that went out on the 3rd is "all wikis to 1.34.0-wmf.20".

globalCssJs was touched by a bot in this release: https://www.mediawiki.org/wiki/MediaWiki_1.34/wmf.20#GlobalCssJs

Having looked at the WPT runs closely, it does coincide very well with the 1.34.0-wmf.20 deployment.

Something peculiar I've noticed is a significant increase in payload size of the main document, which Peter pointed out already, by about 2kb compressed, on all 3 articles, when looking at the last run before the deployment and the first run after.

Unfortunately the body doesn't seem to be available in WPT runs :( When I try to open it, the dialog states:

Turn on the "Save Response Bodies" option in the advanced settings to capture text resources.

It would have been nice to be able to diff the HTML to see what made it suddenly grow by 2kb. Maybe webpagereplay kept a copy of that?

The behaviour of known-empty modules not being skipped in some cases was an intentional side-effect of https://gerrit.wikimedia.org/r/530119 which also rolled out with the 1.34.0-wmf.20 branch.

However, I didn't think about the use case of user-controlled non-personal cross-wiki modules, aka ext.globalCssJs.site. It exposes a flaw in an assumption I made, which is that modules that aren't embeddable and not otherwise user-specific, would be fine to request regardless.

The idea was that for modules like site and gadgets, omitting the name from the RLPAGEMODULES list in the article HTML isn't useful because it wouldn't save an http request (they go in the same batch as anything else) and it would lead to corruption when the module starts being used by the wiki. That is, you'd have a situation where cached pages skip the module and falsely mark it as "successfully loaded" - which then breaks all sorts of guarantees.

The not-skipping of such modules for site etc isn't new. But ext.globalCssJs.site is a corner case that last week's commit unintentionally shifted from skipped to not skipped. And the reason thats an issue is because unlike site it isn't part of the main batch request. Logically it is totally part of the main batch, but internally it gets split into its own request because it comes from a different domain name (no way to batch them from the client-side perspective).

Will need to think about this for a bit.. No obvious solution without re-introducing the DB overhead we fixed. Afaik there is a way to disable this module, which I thought we did in prod. Maybe that's a way out of this.

Change 534532 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/extensions/GlobalCssJs@master] Hooks: Restructure onBeforePageDisplay() to avoid early return

https://gerrit.wikimedia.org/r/534532

Change 534533 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/extensions/GlobalCssJs@master] Hooks: Only register/load ext.globalCssJs.site for UseGlobalSiteCssJs

https://gerrit.wikimedia.org/r/534533

Looking at the webpagereplay host, we don't see to keep any data there besides the most recent run.

The internet archive has snapshots of the Facebook article before and after the change:

https://web.archive.org/web/20190902090151/https://en.wikipedia.org/wiki/Facebook
https://web.archive.org/web/20190904213433/https://en.wikipedia.org/wiki/Facebook

Unfortunately the wayback machine modifies the HTML heavily, so it's unclear that the changes would be visible.

I tried to diff the HTML of these 2 snapshots and the only difference I got was the different URL path for the 2 snapshots in subresources. Whatever caused the increase in compressed body size cannot be found with the internet archive's copies.

I'm out of ideas on how to get the HTML diff. My only hope is that Peter's private set up saved that data...

After combing through the change commit titles in 1.34.0-wmf.20, I don't see anything that could explain a significant HTML size increase.

There's also the possibility - although there is nothing related in the SAL - that the compressed size changed due to a gzip/transport change, but that the underlying HTML hasn't.

For getting the HTML content: We only have it turned on for WebPageTest using Chrome (I don't think it work for Firefox, or at least it didn't when we enabled it for Chrome). I'll dig into it see what I can find. For WebPageReplay runs we should keep em for two weeks if I remember correctly, so I'll check that too.

For the new setup, let me enable so we keep HTML for both Chrome & Firefox to make it easier to find things, our S3 cost is low so it shouldn't be a problem.

I briefly looked at the graph and things have changed, the Obama page renders faster now:

Grafana - WebPageTest drilldown.jpg (630×1 px, 69 KB)

The facebook page is also back but it took longer time:

Grafana - WebPageTest drilldown (1).jpg (644×1 px, 74 KB)

But Sweden still has the same, what's up with that:

Grafana - WebPageTest drilldown (2).jpg (640×1 px, 48 KB)

This should be a run before the change: http://wpt.wmftest.org/result/190903_YE_34/
And this after: http://wpt.wmftest.org/result/190903_9Q_6W/1/details/#waterfall_view_step1

But this is for Chrome, but lets use it to compare the HTML. Getting the HTML, format it in Code and then using the compare functionality between two files worked fine in Code: We got a lot of errors like this:

<span class="cs1-visible-error error citation-comment">Cite news requires
                     <code class="cs1-code">&#124;newspaper=</code> (<a
                       href="/wiki/Help:CS1_errors#missing_periodical"
                       title="Help:CS1 errors"
                       >help</a
                     >)</span

We got 311 span class="cs1-visible-error"

Yes, I noticed that error coming up a lot when inspecting the current HTML manually. It seems to be coming from a template: https://en.wikipedia.org/wiki/Help:CS1_errors#missing_periodical

https://en.wikipedia.org/w/index.php?title=Module:Citation/CS1&action=history

So it seems like this great increase of HTML size was introduced by this template change which rolled out a first time at 11:06, 3 September 2019‎, then again at 13:09, 3 September 2019‎

In other words, this makes it unrelated to the deployment.

I will try leaving on-wiki comments explaining that the cost of that change.

Left a note on the responsible editor's page https://en.wikipedia.org/wiki/User_talk:Trappist_the_monk#Significant_HTML_size_increase_due_to_September_3_Cite_template_change

What might have happened is that editors were fast to edit the Barack Obama and Facebook articles to do whatever makes these errors go away, but not on the Sweden one. I'll double check that it's the case on the current versions of the articles.

So yes, the bulk of the errors on the Sweden article are cs1-hidden-error, which are not shown to readers. There's still 300+ of them.

On the current version of the Obama article, there are no cs1-visible-error spans anymore. Meaning that basically people cleaned those up quickly. In fact it seems to have been done in a single edit: https://en.wikipedia.org/w/index.php?title=Barack_Obama&type=revision&diff=913889260&oldid=913887531&diffmode=source

The timing of which (21:07, 3 September 2019) corresponds to the drop seen in your graph.

The Facebook article, however, still contains 3 visible and 103 hidden errors, and has no recent edits addressing them. As its "back to normal" status is more recent, we shall see how it goes in the next couple of days.

I suspect that the hidden errors will take a lot longer to be addressed, and that an article like Sweden might be affected for a while.

Looking at the Barack Obama changes, though, it seems like this is something a bot could fix everywhere... maybe it's planned? We'll see what Trappist responds.

Peter removed Peter as the assignee of this task.Sep 6 2019, 9:20 AM
Gilles claimed this task.

The editors replied, this gets cleaned up by "bots and gnomes". These spikes happen regularly upon changes to those templates, this one was just bigger than usual. I'm closing this, as this is a very likely explanation, based on the rise and drops corresponding exactly for the Barack Obama article.

Change 534532 merged by jenkins-bot:
[mediawiki/extensions/GlobalCssJs@master] Hooks: Restructure onBeforePageDisplay() to avoid early return

https://gerrit.wikimedia.org/r/534532

Change 534533 merged by jenkins-bot:
[mediawiki/extensions/GlobalCssJs@master] Hooks: Only register/load ext.globalCssJs.site for UseGlobalSiteCssJs

https://gerrit.wikimedia.org/r/534533

@Gilles what do you think about creating an upstream bug in Firefox for this? The change only affected Firefox and not Chrome and that so its somehow interesting? I can create the issue.

Sure, why not. You can link to historic revisions of the Obama article, when it got the extra errors due to the template update, and when an editor fixed them.