Page MenuHomePhabricator

Desktop performance graph jumped significantly
Closed, ResolvedPublic

Description

Starting 3/30/22, the desktop performance Fully loaded graph jumped from an average of 2.7 seconds to 3.5 on production:
https://grafana.wikimedia.org/d/000000205/reading-web-performance?orgId=1&from=now-30d&to=now&viewPanel=90

Screen Shot 2022-04-06 at 11.12.58 AM.png (900×1 px, 207 KB)

This is also reflected on beta cluster where the jump is more dramatic - approximately triple
https://grafana.wikimedia.org/d/000000205/reading-web-performance?orgId=1&from=now-30d&to=now&viewPanel=108

Screen Shot 2022-04-06 at 11.35.43 AM.png (820×1 px, 178 KB)

As far as we know, there are no banners running atm.

Event Timeline

I recommend using the Navigation Timing and WebPageTest dashboards, such as these:

And e.g. determine whether real users appear impacted, whether it's region or browser specific, whether the more controlled WebPageReplay environment observes the issue, whether it correlates with other pages or wikis.

As for probable causes:

  • There are no annotations in the screenshot for deploys to synthetic monitoring, suggesting it isn't due to a change in how we measure things.
  • the Beta Cluster graph spiked at the same time, which means it likely wasn't a change MediaWiki core or an extension, as those ride the train some days later. Unless it was due to a change that was backported or config-changed.

The SAL (wikitech, tool) can help confirm identify a shortlist of probable causes in terms of deployments around that time. These are also shown in the screenshot (the vertical lines), though SAL contains more detail about what the Scap event deployed.

Hi @cjming that seems to correlate when Chrome 100 rolled out on WebPageTest. I found that by using the drill down dashboard for WebPageTest:
https://grafana.wikimedia.org/d/000000057/webpagetest-drilldown?orgId=1&from=1648564576262&to=1648596853128&var-base=sitespeed_io&var-path=webpagetest&var-group=en_wikipedia_org&var-page=_wiki_Barack_Obama&var-browser=&var-location=us-east-chrome&var-connectivity=cable&var-view=firstView and then I zoomed into the time when the change first appeared. Then I click "Show each test" and wait a second. Then the meta data for each test run will load and you will see green vertical line, if you hover on them, you will see a screenshot and browser version (browser version is at the top right). The tests before that test used Chrome 99:

chrome-100.jpg (1×1 px, 216 KB)

I also found other things with Chrome 100, the Largest Contentful Paint was affected on WebPageTest and I could also that the response end metric too was slower, checkout T305122. We also have other tools but I couldn't find same things for largest contentful paint but I will have a look first thing tomorrow for fully loaded etc too. The response end metric should also be visible in our RUM data if that has increased, so I will look there too.

Peter added a project: WebPageTest.

I found the root cause by looking at the waterfall graph (the waterfall graph explains how request/responses are handled):

Screenshot 2022-04-07 at 16.23.11.png (1×2 px, 257 KB)

If you look at that last request it picks up a Chrome "feature" when Chrome calls home. For the other tools (the ones I created, I disabled that call, you can do it by either adding a extra Chrome flag or by routing that domain to localhost). I'll look at that tomorrow for WebPageTest. I wonder though what information they collect and why, I seen people ask on Twitter but not answer.

Also if you @cjming or someone else in your team have time for a run-through of our tools I can do short demo of what I first look for if a metric seems wrong/broken/regression.

Lets keep this task open until I pushed the change and we can see that the metric goes back.

I recommend using the Navigation Timing and WebPageTest dashboard and determine whether real users appear impacted, whether it's region or browser specific, whether the more controlled WebPageReplay environment observes the issue, whether it correlates with other pages or wikis.

The SAL (wikitech, tool) can help confirm identify a shortlist of probable causes in terms of deployments around that time. These are also shown in the screenshot (the vertical lines), though SAL contains more detail about what the Scap event deployed.

thanks @Krinkle -- I'll make a note of checking those dashboards + SAL in the future against spikes on our Grafana boards

Also if you @cjming or someone else in your team have time for a run-through of our tools I can do short demo of what I first look for if a metric seems wrong/broken/regression.

hi @Peter - that would be fantastic if you don't mind (anytime at your convenience) - I'll let my team know too as I'm sure some would be interested as well

I did push the change now. The Chrome flag I added was --disable-fetching-hints-at-navigation-start and I also blocked that domain to be 100% sure. I'll have a look later tonight and check if it fixed the problem.

@cjming I'll get back on email about when we can have the session!

That fixed the problem:

blocking.jpg (1×2 px, 248 KB)

I'm gonna create a task upstream for Chrome and ask about the behaviour. I understand that they beacon back user data but in some cases it seems like it's affecting metrics and that seems wrong.