Page MenuHomePhabricator

Increased first visual change/first paint on group 1/en.wiki desktop both with synthetic and real user measurements
Closed, ResolvedPublic1 Estimated Story Points

Description

We got a big jump on desktop alerts today on group 1:

Screenshot 2023-06-29 at 10.08.01.png (1×2 px, 203 KB)

Let me look into what happened.

Event Timeline

I could see this both with our WebPageReplay tests and our direct tests. When I compared the HAR file nothing stands out except that the https://ca.wikipedia.org/w/load.php?lang=ca&modules=startup&only=scripts&raw=1&skin=vector-2022 has changed and that we got a longer CPU long tasks before first paint with the new changes.

Looking at the data in Prometheus just for group 1 gives me to unstable data to actually say anything.

This is also visible in the metrics from our real users looking at p75 for all users on desktop for first paint. Green is the current value, yellow is what it looks like one week back, they usually match pretty good:
https://grafana.wikimedia.org/d/000000038/navigation-timing-by-platform?orgId=1&var-metric=firstPaint&var-platform=desktop&var-users=anonymous&var-percentile=All&from=now-7d&to=now&viewPanel=30

firspaint-7days.jpg (1×2 px, 353 KB)

firstpaint-2-days.jpg (1×2 px, 292 KB)

This also visible when I look at the median:

median.jpg (1×2 px, 296 KB)

Also the synthetic tests fired on en.wiki when the changes where made there.

It's also affecting largest content full paint. In this graph we can see the amount of traffic that have better Largest Contentful Paint than 4 seconds (the Google limit for bad performance).

We can see that the amount there is about one percent more of the traffic that we measure that not reaches that limit (have bad performance according to Google):
https://grafana.wikimedia.org/d/rum/real-user-monitoring?orgId=1&viewPanel=125

4lcp.jpg (1×2 px, 303 KB)

For the 2.5s limit "need improvement" it looks like this (something like a 1.3% drop):

Screenshot 2023-06-30 at 09.05.18.png (1×2 px, 514 KB)

https://grafana.wikimedia.org/d/rum/real-user-monitoring?orgId=1&var-geo_country=All&var-geo_continent=All&var-ua_family=All&var-mw_skin=All&var-mw_group=All&var-mw_context=anonymous_mainspace_view&var-percentile=0.95&var-timespan=6h&viewPanel=124

Attached to traces that we can drag/drop into Chrome devtools performance tab. I can see that the latest trace after the regression touches more elements in the Recalculate style and Layout phase (the layout phase is longer) but we don't get any deeper info than that, I wonder maybe there's a way to configure Chrome to get more info.

Before the change:

Layout-before.png (482×1 px, 101 KB)

After:

Layout-after.png (712×1 px, 200 KB)

Traces:

Peter renamed this task from Increased first visual change on group 1 desktop to Increased first visual change/first paint on group 1/en.wiki desktop both with synthetic and real user measurements.Jun 30 2023, 10:07 AM

The HAR files have the content of the HTML/JS so I can make a diff, or if someone else maybe knows what could have caused it: https://www.mediawiki.org/wiki/MediaWiki_1.41/wmf.15

Peter removed Peter as the assignee of this task.Jun 30 2023, 11:16 AM

I did quick check and couldn't see any obvious things in the diffs. I'll sign off the task for now (got vacation first three days next week) but I'll check in and see if I can help in any way.

Did anything significant ship in the parser?

Did anything significant ship in the parser?

There were a number of patches related to the media work I'm doing (search Arlo on https://www.mediawiki.org/wiki/MediaWiki_1.41/wmf.15) but that stuff isn't enabled on enwiki yet.

Now when it's been a couple of days this is easy to spot on overall traffics. You can see that in following graph:
https://grafana.wikimedia.org/d/rum/real-user-monitoring?orgId=1&viewPanel=25

And attaching 50/75p/95p:

Screenshot 2023-07-03 at 19.53.33.png (1×2 px, 764 KB)

Screenshot 2023-07-03 at 19.53.48.png (1×2 px, 718 KB)

Screenshot 2023-07-03 at 19.54.09.png (1×2 px, 699 KB)

It's also pretty cool that we can see this in the CPU longtasks that we are tracking:
https://grafana.wikimedia.org/d/rum/real-user-monitoring?orgId=1&viewPanel=135

What we can see here, is that we have 3% decrease of our users that has long tasks faster than 1 second.

Screenshot 2023-07-03 at 20.32.12.png (724×2 px, 288 KB)

These also affects loadEventEnd so we have those alerts firing.

Jdlrobson set the point value for this task to 1.

Moving to sprint for further investigation. We shouldn't spend more than half a day trying to collect as much information about this issue as possible to help us narrow down what the cause might be.

I deployed a config change on June 29 that effects a relevant code path, though I don't see how it would cause this: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/933453.

The change should only affect job runner load.

The graph matches last deployment of MediaWiki 1.41.0-wmf.15 to all wikis

06/29th
18:16 	<brennen@deploy1002> 	rebuilt and synchronized wikiversions files: group2 wikis to 1.41.0-wmf.15 refs T340243

If I'm reading the flamegraphs from June 28 and June 30, it seems like ParserOuptutAccess::getParserOutput is a little slower on June 30, though I'm not sure if that fully explains everything we're seeing in the charts.

This comment has been deleted.

I pulled down an old version of the Estocolm article using Wayback machine from May https://web.archive.org/web/20230519031646/https://ca.wikipedia.org/wiki/Estocolm and compared it with July and the number of DOM nodes is in the same magnitude so I don't think it relates to number of DOM nodes.

@Arlolra pointed out an interesting theory - previously language list was "display: none" and is now "visibility: hidden" - that might explain a significant rendering change on page load. I introduced some CSS to test his hypothesis: https://ca.wikipedia.org/wiki/MediaWiki:Vector-2022.css but I'm not sure how instant we'd expect a result here (and doesn't seem to be any change in the graph so far).

I expanded the templates of that ca:Estocolm page and then added the resulting wikitext to a local wiki. Then I artificially added some languages to,
https://github.com/wikimedia/mediawiki/blob/master/includes/skins/Skin.php#L1187 so that the selector had a similar count of around ~230.

Then clearly reverting the following patch accounts for the difference in the number of nodes in the trace at T340715#8980086,
https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/931998/2/resources/skins.vector.styles/components/LanguageDropdown.less

Change 936302 had a related patch set uploaded (by Jdlrobson; author: Jdlrobson):

[mediawiki/skins/Vector@master] Performance: Language dropdown menu should be hidden when ULS is enabled

https://gerrit.wikimedia.org/r/936302

The change in https://ca.wikipedia.org/wiki/MediaWiki:Vector-2022.css has led to the graph restoring itself! So I think we can safely conclude that @Arlolra's theory is correct here (nice!):

Screenshot 2023-07-07 at 8.24.54 AM.png (1×1 px, 221 KB)

The change in https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/931998/ made the language dropdown visibility: hidden - which had implications on layout. The above patch should fix this.

Change 936302 merged by jenkins-bot:

[mediawiki/skins/Vector@master] Performance: Language dropdown menu should be hidden when ULS is enabled

https://gerrit.wikimedia.org/r/936302

(Still need to verify)

The graphs are all looking good to me. @Peter feel free to reopen if anything still looks off.
Long graphs for example at:

Screenshot 2023-07-14 at 9.01.59 AM.png (1×2 px, 785 KB)