Page MenuHomePhabricator

Collect Layout Instability API data
Closed, ResolvedPublic

Description

Since the origin trial ended and the final implementation shipped the shape of the API changed. The collection code needs to be updated.

Event Timeline

Change 496684 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/extensions/NavigationTiming@master] Collect Layout Stability API jank scores

https://gerrit.wikimedia.org/r/496684

Change 496684 merged by jenkins-bot:
[mediawiki/extensions/NavigationTiming@master] Collect Layout Stability API jank scores

https://gerrit.wikimedia.org/r/496684

Change 499152 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/mediawiki-config@master] Element Timing for Images and Layout Stability on ruwiki

https://gerrit.wikimedia.org/r/499152

Change 499775 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/vagrant@master] Extend Layout Stability API origin trial

https://gerrit.wikimedia.org/r/499775

Change 499775 merged by jenkins-bot:
[mediawiki/vagrant@master] Extend Layout Stability API origin trial

https://gerrit.wikimedia.org/r/499775

Change 499152 merged by jenkins-bot:
[operations/mediawiki-config@master] Element Timing for Images and Layout Stability on ruwiki

https://gerrit.wikimedia.org/r/499152

Mentioned in SAL (#wikimedia-operations) [2019-03-29T07:01:07Z] <gilles@deploy1001> Synchronized wmf-config/InitialiseSettings.php: T216598 T216594 Element Timing for Images and Layout Stability on ruwiki (duration: 00m 51s)

The first LayoutJank EL events have made it to the schema.

Looking at some data, what we're really missing here is some context. We're getting LayoutJank, which might influence user perception, but we don't know which page it's coming from. We should record the articleid in the LayoutJank schema, IMHO.

Now, I noticed something quite peculiar, which is the over-representation of action=edit values, which suggests jank happening when editing articles.

While the general NavigationTiming ratio of action=edit is 0.3%:

SELECT event.action, COUNT(*) AS count FROM event.navigationtiming WHERE year = 2019 AND month = 3 GROUP BY event.action;

Total: 20316641

actioncountpercentage
NULL2323571.14
history111850.05
unprotect20
watch480
single-view10
purge1710
nosuchaction60
mcrrestore10
revert30
delete345
rollback2650
markpatrolled30
view1999990698.4
protect210
submit86180.04
edit632670.3
unwatch10
info4410

In the LayoutJank schema, it's 1.2%:

SELECT nt.event.action, COUNT(*) AS count FROM event.layoutjank lj JOIN event.navigationtiming nt ON lj.event.pageviewtoken = nt.event.pageviewtoken WHERE lj.year = 2019 AND nt.year = 2019 AND lj.event.fraction > 0 GROUP BY nt.event.action;

Total: 403616

actioncountpercentage
NULL18974.7
delete80
edit49561.2
history1740.04
info290
rollback30
submit2450.06
view39630498.19

This suggests that the edit action is quite prone to getting layout instability.

While we don't have articleid, we have transferSize, which is a good proxy to knowing how large an article is. Let's find out if the larger the article, the more layout instability we get:

CREATE TABLE summed_layoutjank AS SELECT event.pageviewtoken, SUM(event.fraction) AS total_fraction FROM event.layoutjank WHERE year = 2019 GROUP BY event.pageviewtoken;

SELECT CORR(slj.total_fraction, nt.event.transfersize) AS correlation, COUNT(*) AS count FROM summed_layoutjank slj JOIN event.navigationtiming nt ON slj.pageviewtoken = nt.event.pageviewtoken WHERE nt.year = 2019;
correlationcount
0.013033870613197522125548

This is pretty low. Article size can't explain layout instability alone. Is it more correlated when we look only at the edit action?

SELECT CORR(slj.total_fraction, nt.event.transfersize) AS correlation, COUNT(*) AS count FROM summed_layoutjank slj JOIN event.navigationtiming nt ON slj.pageviewtoken = nt.event.pageviewtoken WHERE nt.year = 2019 AND nt.event.action = 'edit';
correlationcount
0.075687890586762581027

A bit better but still very low. If it's VE, though, I don't know if the article data is part of transferSize at all.

Finally, how does the total fraction correlate to survey responses? (only looking at pageviews where there was layout instability)

SELECT CORR(slj.total_fraction,CASE qr.event.surveyresponsevalue WHEN 'ext-quicksurveys-example-internal-survey-answer-positive' THEN -1 ELSE 1 END) AS correlation, COUNT(*) AS count FROM summed_layoutjank slj JOIN event.quicksurveysresponses qr ON slj.pageviewToken = qr.event.pageviewToken WHERE qr.year = 2019 AND qr.event.surveyresponsevalue IN ('ext-quicksurveys-example-internal-survey-answer-positive', 'ext-quicksurveys-example-internal-survey-answer-negative');
correlationcount
-0.0148769449103818534566

A low correlation, but a negative one, which is what should be expected (the higher the total instability fraction, the less likely people would be happy about the performance of the page).

It turns out that VE (including the visual source editor) are counted as "view", which is a problem, IMHO. Because NavigationTiming looks at the action GET parameter, but not the veaction. This is particularly problematic on ruwiki where VE seems to be the default (at least for new editors).

Actually we do have the revid in NavigationTiming, so I can tell which pages are affected (but for now, won't be able to tell if they were VE or not). Let's see if some pages are more affected than others:

SELECT nt.event.revid AS revid, COUNT(DISTINCT(lj.event.pageviewtoken)) AS count FROM event.layoutjank lj JOIN event.navigationtiming nt ON lj.event.pageviewtoken = nt.event.pageviewtoken WHERE lj.year = 2019 AND nt.year = 2019 AND lj.event.fraction > 0 GROUP BY nt.event.revid ORDER BY count DESC LIMIT 10;

The first one is a very plain looking article. Looking at the history, though, it's had 11 edits yesterday. The second one, while a bit more complex, also has had a lot of edits in recent days.

Let's see how many of the layoutjank occurrences for that top revid were from logged-in users:

SELECT nt.event.isanon AS isanon, COUNT(DISTINCT(lj.event.pageviewtoken)) AS count FROM event.layoutjank lj JOIN event.navigationtiming nt ON lj.event.pageviewtoken = nt.event.pageviewtoken WHERE lj.year = 2019 AND nt.year = 2019 AND lj.event.fraction > 0 AND nt.event.revid = 98369394 GROUP BY nt.event.isanon;
isanoncount
false2
true162

This suggests that most of the layout instability was coming from anonymous users. Could it simply be that those articles saw a lot of edits because they were getting a lot of traffic around that time? Let's see if those revisionids show up in the top 10 seen in navtiming over that period:

SELECT event.revid AS revid, COUNT(*) AS count FROM event.navigationtiming WHERE year = 2019 AND month = 4 AND dt >= '2019-04-02T21:42:22Z' AND dt <= '2019-04-03T09:25:30Z' GROUP BY event.revid ORDER BY count DESC LIMIT 10;
revidcount
NULL3623
01561
112969012753
98984576563
889268954529
98369394289
115001713255
114962564210
890634189138
170599184115

While that person doesn't have an article in Wikipedias other than Russian, I see that she was the subject of Google's Doodle yesterday in Russia: https://www.google.com/doodles/sofia-mogilevskayas-116th-birthday which certainly explains the sudden spike of interest in that particular article.

In short, we're getting a lot of layout instability events for that revisionid because it's the 4th more visited article/revision over that period. 164 distinct pageview out of 289 over that period with layout instability sounds like a lot. Let's see what ratio of Chrome 73 navtiming pageviews are getting layout instability events on a given day:

SELECT COUNT(DISTINCT(event.pageviewtoken)) FROM event.navigationtiming WHERE year = 2019 AND month = 4 AND day = 2 AND useragent.browser_family = 'Chrome' AND useragent.browser_major = '73'  AND wiki = 'ruwiki';

SELECT COUNT(DISTINCT(event.pageviewtoken)) FROM event.layoutjank WHERE year = 2019 AND month = 4 AND day = 2 AND useragent.browser_family = 'Chrome' AND useragent.browser_major = '73' AND wiki = 'ruwiki';

Which gives us 34072 / 37005 = 92% of pageviews that are capable of recording layout instability getting at least one such event! That sounds like a lot, particularly when you see an article as simple as that biography getting so many of them.

I've just noticed that there's an ongoing CentralNotice campaign served to me when I browse ruwiki:

Capture d'écran 2019-04-04 13.38.43.png (117×941 px, 13 KB)

It's in English, though, and might not target users in Russia. Let's find out if CentralNotice appearance is correlated to layout instability:

SELECT COUNT(DISTINCT(cn.event.pageviewtoken)), MIN(lj.dt), MAX(lj.dt) FROM event.centralnoticetiming cn JOIN event.layoutjank lj ON cn.event.pageviewtoken = lj.event.pageviewtoken WHERE cn.year = 2019 AND lj.year = 2019 AND cn.month = 4 AND lj.month = 4 AND cn.day = 2 AND lj.day = 2 AND cn.useragent.browser_family = 'Chrome' AND cn.useragent.browser_major = '73' AND cn.wiki = 'ruwiki' AND lj.wiki = 'ruwiki';

SELECT COUNT(DISTINCT(event.pageviewtoken)) FROM event.centralnoticetiming WHERE year = 2019 AND month = 4 AND day = 2 AND useragent.browser_family = 'Chrome' AND useragent.browser_major = '73' AND wiki = 'ruwiki';

When a CentraNotice banner is displayed on a given pageview, layout instability is experienced 7787 / 8432 = 92% of the time. Same ratio as before.

We're still left with a lot of pageviews without CentralNotice banners that get layout instability:

SELECT COUNT(DISTINCT(event.pageviewtoken)) FROM event.layoutjank WHERE year = 2019 AND month = 4 AND day = 2 AND useragent.browser_family = 'Chrome' AND useragent.browser_major = '73' AND wiki = 'ruwiki';

SELECT COUNT(DISTINCT(event.pageviewtoken)) FROM event.navigationtiming WHERE year = 2019 AND month = 4 AND day = 2 AND useragent.browser_family = 'Chrome' AND useragent.browser_major = '73' AND wiki = 'ruwiki';

34072 total pageviews with layout jank - 7787 = 26285 pageviews with layout jank and no CN banner. 37005 pageviews that might have had layout jank - 8432 that showed the banner = 28573 total CN-free pageviews. This works out to 26285 / 28573 = 92% of banner-free pageviews getting layout instability.

With the baseline being 92%, it doesn't seem like CentralNotice makes things worse. Or maybe it does in terms of how many layout instability events occured?

Let's find out:

SELECT COUNT(*) AS count, COUNT(DISTINCT(event.pageviewtoken)) AS uniques FROM event.layoutjank WHERE year = 2019 AND month = 4 AND day = 2 AND useragent.browser_family = 'Chrome' AND useragent.browser_major = '73' AND wiki = 'ruwiki';

We have 109123 / 34072 = 3.2 LayoutJank events per unstable pageview in general.

Looking only at unstable pageviews that got CN banners:

SELECT COUNT(*) AS count, COUNT(DISTINCT(lj.event.pageviewtoken)) AS uniques FROM event.layoutjank lj JOIN event.centralnoticetiming cn ON lj.event.pageviewtoken = cn.event.pageviewtoken WHERE lj.year = 2019 AND cn.year = 2019 AND lj.month = 4 AND cn.month = 4 AND lj.day = 2 AND cn.day = 2 AND cn.useragent.browser_family = 'Chrome' AND cn.useragent.browser_major = '73' AND lj.wiki = 'ruwiki' AND cn.wiki = 'ruwiki';

We get 30744 / 7787 = 3.9 LayoutJank events per unstable pageview that displayed a CentralNotice banner.

It also means that unstable pageviews that didn't get a banner had ( 109123 - 30744) / (34072 - 7787) = 2.9 LayoutJank events.

At last, something that makes sense! When the pageview is unstable - which is the case 92% of the time... - the CentralNotice banner contributes exactly one extra LayoutJank event on average :)

I've just figured out that the Layout Instability API origin trial token only started getting served on 2019-03-28. Which means that the missing 8% are probably visits to articles that haven't had their cache updated yet. Which I assume probably means that close to 100% of our pageviews get layout instability events...

Let's verify this theory by looking at the ratio of pageviews getting events on 2019-04-03 (instead of 2019-04-02 previously): for that next day the ratio becomes 93.7%

It's, as one would expect, increasing, as more and more pages on ruwiki start serving the origin trial token.

Mentioned in SAL (#wikimedia-operations) [2019-04-05T04:49:01Z] <gilles> T216594 Start purge of namespace 0 on ruwiki

Given the lack of ability to introspect origin trials and determine if a given pageview is in the experiment or not, I'm purging articles on ruwiki to ensure that they all serve the up-to-date origin trial tokens. Once that's done, we will be able to verify the true extent of layout instability on the ruwiki traffic.

With the purging and the time that has gone by, we should be at a point where 100% of desktop ruwiki are getting the origin trial. It looks like 99.5% of our desktop pageviews get layout jank:

SELECT COUNT(DISTINCT(event.pageviewtoken)) FROM event.layoutjank WHERE year = 2019 AND month = 4 AND day = 23 AND useragent.browser_family = 'Chrome' AND useragent.browser_major = '73' AND wiki = 'ruwiki';

36168

SELECT COUNT(DISTINCT(event.pageviewtoken)) FROM event.navigationtiming WHERE year = 2019 AND month = 4 AND day = 23 AND useragent.browser_family = 'Chrome' AND useragent.browser_major = '73' AND wiki = 'ruwiki' AND event.mobileMode IS NULL;

36366

And there's still the possibility that the 0.5% remaining are clients that are lying about their UA, for example.

Let's look at the distribution by amount of layout jank events per pageview, for pageviews that get jank.

On ruwiki (desktop):

layoutjank events12345> 5
%age of pageviews9.9653.1516.069.763.297.77

On mobile, surprisingly, the situation is a lot better, with "only" 76.5% of pageviews getting layout jank events at all. And within those that do:

On eswiki (mobile):

layoutjank events12345> 5
%age of pageviews34.1833.0121.946.9118.662.08

Change 507545 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/mediawiki-config@master] Renew origin trial tokens for ruwiki

https://gerrit.wikimedia.org/r/507545

Change 507545 merged by jenkins-bot:
[operations/mediawiki-config@master] Renew origin trial tokens for ruwiki

https://gerrit.wikimedia.org/r/507545

Mentioned in SAL (#wikimedia-operations) [2019-05-01T11:22:05Z] <gilles@deploy1001> Synchronized wmf-config/InitialiseSettings.php: T216499 T216598 T216594 Renew origin trial tokens for ruwiki (duration: 01m 14s)

While investigating worst offenders in terms of summed fraction, I discovered 2 bugs ( or at least very big shortcomings) of the API, filed upstream:

As it stands, the bug experienced on the mobile site is creating so much noise that it's pretty much useless to look at that data. Which is probably why all the top offenders in terms of summed fraction for a pageview were from the mobile site. I'll have to redo the investigation of worst offenders while looking only at the desktop site.

I've discovered yet another bug/big shortcoming, this time on desktop. The Multimedia Viewer bottom panel scroll animation is another source of a ton of small LayoutJank events. I've filed another bug about that: https://bugs.chromium.org/p/chromium/issues/detail?id=958828

I've also narrowed down the pattern that was triggering the previous bug on Desktop, and seems to be whenever there are hoverable links inside a multi-column <ol> element:

In terms of top summed fraction desktop offenders, this one remains a mystery: https://ru.wikipedia.org/wiki/Модуль_Юнга it doesn't have any image that can be opened with Multimedia Viewer, nor a multi-column list of references.

I've just discovered, however, that merely resizing the page results in a ton of LayoutJank events... https://bugs.chromium.org/p/chromium/issues/detail?id=958832 which could probably explain that last mysterious one.

Change 509630 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/vagrant@master] Renew origin trial tokens

https://gerrit.wikimedia.org/r/509630

Change 509630 merged by jenkins-bot:
[mediawiki/vagrant@master] Renew origin trial tokens

https://gerrit.wikimedia.org/r/509630

Gilles raised the priority of this task from Medium to High.May 27 2019, 5:03 AM
Gilles changed the task status from Open to Stalled.Jun 18 2019, 9:22 AM

The upstream bugfixes have been committed:

https://bugs.chromium.org/p/chromium/issues/detail?id=958795#c_ts1556814267
https://bugs.chromium.org/p/chromium/issues/detail?id=958832#c4

But until these make their way to Chrome stable, the current origin trial data is useless for us, it's all noise.

Whether we'll be able to revisit this largely depends on the second bug (events occurring on resize) being backported to 76, which is being discussed right now.

The main issue is fixed in 76, which is part of the origin trial and will soon be stable: https://bugs.chromium.org/p/chromium/issues/detail?id=958795#c9

The other issue I reported (media viewer) needs to be investigated further on our end to verify whether or not this is really a layout change, and if it is, how we can avoid it: https://bugs.chromium.org/p/chromium/issues/detail?id=958828#c4

Also, the shape of the exposed data might have changed since I set up the origin trial, needs to be double checked.

Change 527837 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/mediawiki-config@master] Renew origin trial tokens

https://gerrit.wikimedia.org/r/527837

Change 527837 merged by jenkins-bot:
[operations/mediawiki-config@master] Renew origin trial tokens

https://gerrit.wikimedia.org/r/527837

Mentioned in SAL (#wikimedia-operations) [2019-08-03T09:35:58Z] <gilles@deploy1001> Synchronized wmf-config/InitialiseSettings.php: T216499 T216594 Renew origin trial tokens (duration: 00m 48s)

Gilles changed the task status from Stalled to Open.Aug 20 2019, 3:44 PM
Gilles lowered the priority of this task from High to Low.
Gilles raised the priority of this task from Low to Medium.

The name of the entry type changed in 76, which probably means we stopped collecting data. And it's going to change again in the final implementation...

Nevertheless, I was able to verify on ruwiki that https://bugs.chromium.org/p/chromium/issues/detail?id=958795 is indeed fixed, and we won't get layout-shift reports on ul/li elements anymore.

As for https://bugs.chromium.org/p/chromium/issues/detail?id=958828 after re-reading the spec, I agree that this is a legitimate report. The CSS animation that triggers the layout-shift events affect the width of an element, therefore (deliberately) affecting the layout. We will simply have to filter out or ignore those reports that come from the Media Viewer drawer.

There isn't a lot of time left in the origin trial, but I will try to fix the entryType naming, so we can see what the data is liked now that the biggest bug has been fixed.

Change 531491 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/extensions/NavigationTiming@master] Fix Layout Instability origin trial entryType

https://gerrit.wikimedia.org/r/531491

Change 531491 merged by jenkins-bot:
[mediawiki/extensions/NavigationTiming@master] Fix Layout Instability origin trial entryType

https://gerrit.wikimedia.org/r/531491

Looking at the data for the first 3 days of september, I see:

32% of ruwiki desktop pageviews have at least 1 layout instability event. 6% have at least 1 instability event with a fraction greater than 0.5 (half the viewport being pushed down).
For eswiki mobile, it's 17% and 4% respectively.

I'd be curious how that breaks down by action and namespace, particularly for (action=view and ns=0) vs special pages, non-view actions, and non-article spaces (e.g. File/User pages). I suspect it would only be a fairly small percentage, but could also uncover issues with search, recent changes, history, editor, sign up etc.

Only looking at namespace/action combinations that had at least 10 instability events over the course of 3 days (otherwise percentages are a bit meaningless).

Desktop (ruwiki)

NamespaceAction% of pageviews with instability% of pageviews with instability + fraction > 0.5
0view23.062.01
6view18.851.64
10view17.723.8
0edit16.50.3
14view15.591.98
1view12.371.03
0submit10.170
4view7.50.42

Mobile (eswiki)

NamespaceAction% of pageviews with instability% of pageviews with instability + fraction > 0.5
0view19.84.93
4view14.624.03
104view12.063.68
14view8.73.83
6view8.663.56
2view8.141.3
100view7.210.48
0edit4.730.71

@Krinkle does anything stand out?

Examples of queries used, for later reference:

SELECT nt.event.action, nt.event.namespaceId, COUNT(DISTINCT(nt.event.pageviewtoken)) FROM event.layoutjank AS lj JOIN event.navigationtiming AS nt ON lj.event.pageviewtoken = nt.event.pageviewtoken WHERE lj.year = 2019 AND lj.month = 9 AND lj.day < 4 AND nt.year = 2019 AND nt.month = 9 AND nt.day < 4 AND lj.wiki = 'eswiki' AND nt.event.mobileMode ='stable' AND nt.useragent.browser_family = 'Chrome Mobile' AND nt.useragent.browser_major >= 73 AND nt.useragent.browser_major <= 76 GROUP BY nt.event.action, nt.event.namespaceId;

SELECT nt.event.action, nt.event.namespaceId, COUNT(DISTINCT(nt.event.pageviewtoken)) FROM event.navigationtiming AS nt WHERE nt.year = 2019 AND nt.month = 9 AND nt.day < 4 AND nt.wiki = 'eswiki' AND nt.event.mobileMode ='stable' AND useragent.browser_family = 'Chrome Mobile' AND nt.useragent.browser_major >= 73 AND nt.useragent.browser_major <= 76 GROUP BY nt.event.action, nt.event.namespaceId;

SELECT nt.event.action, nt.event.namespaceId, COUNT(DISTINCT(nt.event.pageviewtoken)) FROM event.layoutjank AS lj JOIN event.navigationtiming AS nt ON lj.event.pageviewtoken = nt.event.pageviewtoken WHERE lj.year = 2019 AND lj.month = 9 AND lj.day < 4 AND nt.year = 2019 AND nt.month = 9 AND nt.day < 4 AND lj.wiki = 'eswiki' AND nt.event.mobileMode = 'stable' AND nt.useragent.browser_family = 'Chrome Mobile' AND nt.useragent.browser_major >= 73 AND nt.useragent.browser_major <= 76 AND lj.event.fraction > 0.5 GROUP BY nt.event.action, nt.event.namespaceId;
Gilles renamed this task from Layout Stability API origin trial to Collect Layout Instability API data.Apr 8 2020, 9:13 AM
Gilles updated the task description. (Show Details)

Change 587495 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/extensions/NavigationTiming@master] Migrate LayoutJank origin trial collection code to layout-shift

https://gerrit.wikimedia.org/r/587495

Change 587495 merged by jenkins-bot:
[mediawiki/extensions/NavigationTiming@master] Migrate LayoutJank origin trial collection code to layout-shift

https://gerrit.wikimedia.org/r/587495

Change 594135 had a related patch set uploaded (by Gilles; owner: Gilles):
[analytics/refinery@master] LayoutJank schema is deprecated, now LayoutShift

https://gerrit.wikimedia.org/r/594135

Change 594135 merged by Gilles:
[analytics/refinery@master] LayoutJank schema is deprecated, now LayoutShift

https://gerrit.wikimedia.org/r/594135