Page MenuHomePhabricator

Layout Stability API origin trial
Open, Stalled, HighPublic

Description

There's an origin trial for the Layout Stability API until September for Chrome 73-76: https://gist.github.com/skobes/2f296da1b0a88cc785a4bf10a42bca07

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 20 2019, 8:41 AM
Gilles triaged this task as Normal priority.Feb 20 2019, 8:44 AM
Gilles updated the task description. (Show Details)Mar 14 2019, 10:02 AM

Change 496684 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/extensions/NavigationTiming@master] Collect Layout Stability API jank scores

https://gerrit.wikimedia.org/r/496684

Change 496684 merged by jenkins-bot:
[mediawiki/extensions/NavigationTiming@master] Collect Layout Stability API jank scores

https://gerrit.wikimedia.org/r/496684

Change 499152 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/mediawiki-config@master] Element Timing for Images and Layout Stability on ruwiki

https://gerrit.wikimedia.org/r/499152

Change 499775 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/vagrant@master] Extend Layout Stability API origin trial

https://gerrit.wikimedia.org/r/499775

Change 499775 merged by jenkins-bot:
[mediawiki/vagrant@master] Extend Layout Stability API origin trial

https://gerrit.wikimedia.org/r/499775

Change 499152 merged by jenkins-bot:
[operations/mediawiki-config@master] Element Timing for Images and Layout Stability on ruwiki

https://gerrit.wikimedia.org/r/499152

Mentioned in SAL (#wikimedia-operations) [2019-03-29T07:01:07Z] <gilles@deploy1001> Synchronized wmf-config/InitialiseSettings.php: T216598 T216594 Element Timing for Images and Layout Stability on ruwiki (duration: 00m 51s)

The first LayoutJank EL events have made it to the schema.

Looking at some data, what we're really missing here is some context. We're getting LayoutJank, which might influence user perception, but we don't know which page it's coming from. We should record the articleid in the LayoutJank schema, IMHO.

Now, I noticed something quite peculiar, which is the over-representation of action=edit values, which suggests jank happening when editing articles.

While the general NavigationTiming ratio of action=edit is 0.3%:

SELECT event.action, COUNT(*) AS count FROM event.navigationtiming WHERE year = 2019 AND month = 3 GROUP BY event.action;

Total: 20316641

actioncountpercentage
NULL2323571.14
history111850.05
unprotect20
watch480
single-view10
purge1710
nosuchaction60
mcrrestore10
revert30
delete345
rollback2650
markpatrolled30
view1999990698.4
protect210
submit86180.04
edit632670.3
unwatch10
info4410

In the LayoutJank schema, it's 1.2%:

SELECT nt.event.action, COUNT(*) AS count FROM event.layoutjank lj JOIN event.navigationtiming nt ON lj.event.pageviewtoken = nt.event.pageviewtoken WHERE lj.year = 2019 AND nt.year = 2019 AND lj.event.fraction > 0 GROUP BY nt.event.action;

Total: 403616

actioncountpercentage
NULL18974.7
delete80
edit49561.2
history1740.04
info290
rollback30
submit2450.06
view39630498.19

This suggests that the edit action is quite prone to getting layout instability.

While we don't have articleid, we have transferSize, which is a good proxy to knowing how large an article is. Let's find out if the larger the article, the more layout instability we get:

CREATE TABLE summed_layoutjank AS SELECT event.pageviewtoken, SUM(event.fraction) AS total_fraction FROM event.layoutjank WHERE year = 2019 GROUP BY event.pageviewtoken;

SELECT CORR(slj.total_fraction, nt.event.transfersize) AS correlation, COUNT(*) AS count FROM summed_layoutjank slj JOIN event.navigationtiming nt ON slj.pageviewtoken = nt.event.pageviewtoken WHERE nt.year = 2019;
correlationcount
0.013033870613197522125548

This is pretty low. Article size can't explain layout instability alone. Is it more correlated when we look only at the edit action?

SELECT CORR(slj.total_fraction, nt.event.transfersize) AS correlation, COUNT(*) AS count FROM summed_layoutjank slj JOIN event.navigationtiming nt ON slj.pageviewtoken = nt.event.pageviewtoken WHERE nt.year = 2019 AND nt.event.action = 'edit';
correlationcount
0.075687890586762581027

A bit better but still very low. If it's VE, though, I don't know if the article data is part of transferSize at all.

Finally, how does the total fraction correlate to survey responses? (only looking at pageviews where there was layout instability)

SELECT CORR(slj.total_fraction,CASE qr.event.surveyresponsevalue WHEN 'ext-quicksurveys-example-internal-survey-answer-positive' THEN -1 ELSE 1 END) AS correlation, COUNT(*) AS count FROM summed_layoutjank slj JOIN event.quicksurveysresponses qr ON slj.pageviewToken = qr.event.pageviewToken WHERE qr.year = 2019 AND qr.event.surveyresponsevalue IN ('ext-quicksurveys-example-internal-survey-answer-positive', 'ext-quicksurveys-example-internal-survey-answer-negative');
correlationcount
-0.0148769449103818534566

A low correlation, but a negative one, which is what should be expected (the higher the total instability fraction, the less likely people would be happy about the performance of the page).

It turns out that VE (including the visual source editor) are counted as "view", which is a problem, IMHO. Because NavigationTiming looks at the action GET parameter, but not the veaction. This is particularly problematic on ruwiki where VE seems to be the default (at least for new editors).

Gilles added a comment.EditedApr 4 2019, 9:51 AM

Actually we do have the revid in NavigationTiming, so I can tell which pages are affected (but for now, won't be able to tell if they were VE or not). Let's see if some pages are more affected than others:

SELECT nt.event.revid AS revid, COUNT(DISTINCT(lj.event.pageviewtoken)) AS count FROM event.layoutjank lj JOIN event.navigationtiming nt ON lj.event.pageviewtoken = nt.event.pageviewtoken WHERE lj.year = 2019 AND nt.year = 2019 AND lj.event.fraction > 0 GROUP BY nt.event.revid ORDER BY count DESC LIMIT 10;

The first one is a very plain looking article. Looking at the history, though, it's had 11 edits yesterday. The second one, while a bit more complex, also has had a lot of edits in recent days.

Let's see how many of the layoutjank occurrences for that top revid were from logged-in users:

SELECT nt.event.isanon AS isanon, COUNT(DISTINCT(lj.event.pageviewtoken)) AS count FROM event.layoutjank lj JOIN event.navigationtiming nt ON lj.event.pageviewtoken = nt.event.pageviewtoken WHERE lj.year = 2019 AND nt.year = 2019 AND lj.event.fraction > 0 AND nt.event.revid = 98369394 GROUP BY nt.event.isanon;
isanoncount
false2
true162

This suggests that most of the layout instability was coming from anonymous users. Could it simply be that those articles saw a lot of edits because they were getting a lot of traffic around that time? Let's see if those revisionids show up in the top 10 seen in navtiming over that period:

SELECT event.revid AS revid, COUNT(*) AS count FROM event.navigationtiming WHERE year = 2019 AND month = 4 AND dt >= '2019-04-02T21:42:22Z' AND dt <= '2019-04-03T09:25:30Z' GROUP BY event.revid ORDER BY count DESC LIMIT 10;
revidcount
NULL3623
01561
112969012753
98984576563
889268954529
98369394289
115001713255
114962564210
890634189138
170599184115

While that person doesn't have an article in Wikipedias other than Russian, I see that she was the subject of Google's Doodle yesterday in Russia: https://www.google.com/doodles/sofia-mogilevskayas-116th-birthday which certainly explains the sudden spike of interest in that particular article.

In short, we're getting a lot of layout instability events for that revisionid because it's the 4th more visited article/revision over that period. 164 distinct pageview out of 289 over that period with layout instability sounds like a lot. Let's see what ratio of Chrome 73 navtiming pageviews are getting layout instability events on a given day:

SELECT COUNT(DISTINCT(event.pageviewtoken)) FROM event.navigationtiming WHERE year = 2019 AND month = 4 AND day = 2 AND useragent.browser_family = 'Chrome' AND useragent.browser_major = '73'  AND wiki = 'ruwiki';

SELECT COUNT(DISTINCT(event.pageviewtoken)) FROM event.layoutjank WHERE year = 2019 AND month = 4 AND day = 2 AND useragent.browser_family = 'Chrome' AND useragent.browser_major = '73' AND wiki = 'ruwiki';

Which gives us 34072 / 37005 = 92% of pageviews that are capable of recording layout instability getting at least one such event! That sounds like a lot, particularly when you see an article as simple as that biography getting so many of them.

Gilles added a comment.EditedApr 4 2019, 12:25 PM

I've just noticed that there's an ongoing CentralNotice campaign served to me when I browse ruwiki:

It's in English, though, and might not target users in Russia. Let's find out if CentralNotice appearance is correlated to layout instability:

SELECT COUNT(DISTINCT(cn.event.pageviewtoken)), MIN(lj.dt), MAX(lj.dt) FROM event.centralnoticetiming cn JOIN event.layoutjank lj ON cn.event.pageviewtoken = lj.event.pageviewtoken WHERE cn.year = 2019 AND lj.year = 2019 AND cn.month = 4 AND lj.month = 4 AND cn.day = 2 AND lj.day = 2 AND cn.useragent.browser_family = 'Chrome' AND cn.useragent.browser_major = '73' AND cn.wiki = 'ruwiki' AND lj.wiki = 'ruwiki';

SELECT COUNT(DISTINCT(event.pageviewtoken)) FROM event.centralnoticetiming WHERE year = 2019 AND month = 4 AND day = 2 AND useragent.browser_family = 'Chrome' AND useragent.browser_major = '73' AND wiki = 'ruwiki';

When a CentraNotice banner is displayed on a given pageview, layout instability is experienced 7787 / 8432 = 92% of the time. Same ratio as before.

We're still left with a lot of pageviews without CentralNotice banners that get layout instability:

SELECT COUNT(DISTINCT(event.pageviewtoken)) FROM event.layoutjank WHERE year = 2019 AND month = 4 AND day = 2 AND useragent.browser_family = 'Chrome' AND useragent.browser_major = '73' AND wiki = 'ruwiki';

SELECT COUNT(DISTINCT(event.pageviewtoken)) FROM event.navigationtiming WHERE year = 2019 AND month = 4 AND day = 2 AND useragent.browser_family = 'Chrome' AND useragent.browser_major = '73' AND wiki = 'ruwiki';

34072 total pageviews with layout jank - 7787 = 26285 pageviews with layout jank and no CN banner. 37005 pageviews that might have had layout jank - 8432 that showed the banner = 28573 total CN-free pageviews. This works out to 26285 / 28573 = 92% of banner-free pageviews getting layout instability.

With the baseline being 92%, it doesn't seem like CentralNotice makes things worse. Or maybe it does in terms of how many layout instability events occured?

Gilles added a comment.EditedApr 4 2019, 1:29 PM

Let's find out:

SELECT COUNT(*) AS count, COUNT(DISTINCT(event.pageviewtoken)) AS uniques FROM event.layoutjank WHERE year = 2019 AND month = 4 AND day = 2 AND useragent.browser_family = 'Chrome' AND useragent.browser_major = '73' AND wiki = 'ruwiki';

We have 109123 / 34072 = 3.2 LayoutJank events per unstable pageview in general.

Looking only at unstable pageviews that got CN banners:

SELECT COUNT(*) AS count, COUNT(DISTINCT(lj.event.pageviewtoken)) AS uniques FROM event.layoutjank lj JOIN event.centralnoticetiming cn ON lj.event.pageviewtoken = cn.event.pageviewtoken WHERE lj.year = 2019 AND cn.year = 2019 AND lj.month = 4 AND cn.month = 4 AND lj.day = 2 AND cn.day = 2 AND cn.useragent.browser_family = 'Chrome' AND cn.useragent.browser_major = '73' AND lj.wiki = 'ruwiki' AND cn.wiki = 'ruwiki';

We get 30744 / 7787 = 3.9 LayoutJank events per unstable pageview that displayed a CentralNotice banner.

It also means that unstable pageviews that didn't get a banner had ( 109123 - 30744) / (34072 - 7787) = 2.9 LayoutJank events.

At last, something that makes sense! When the pageview is unstable - which is the case 92% of the time... - the CentralNotice banner contributes exactly one extra LayoutJank event on average :)

Gilles added a comment.EditedApr 4 2019, 1:49 PM

I've just figured out that the Layout Instability API origin trial token only started getting served on 2019-03-28. Which means that the missing 8% are probably visits to articles that haven't had their cache updated yet. Which I assume probably means that close to 100% of our pageviews get layout instability events...

Let's verify this theory by looking at the ratio of pageviews getting events on 2019-04-03 (instead of 2019-04-02 previously): for that next day the ratio becomes 93.7%

It's, as one would expect, increasing, as more and more pages on ruwiki start serving the origin trial token.

Mentioned in SAL (#wikimedia-operations) [2019-04-05T04:49:01Z] <gilles> T216594 Start purge of namespace 0 on ruwiki

Given the lack of ability to introspect origin trials and determine if a given pageview is in the experiment or not, I'm purging articles on ruwiki to ensure that they all serve the up-to-date origin trial tokens. Once that's done, we will be able to verify the true extent of layout instability on the ruwiki traffic.

Gilles added a comment.EditedApr 24 2019, 11:00 AM

With the purging and the time that has gone by, we should be at a point where 100% of desktop ruwiki are getting the origin trial. It looks like 99.5% of our desktop pageviews get layout jank:

SELECT COUNT(DISTINCT(event.pageviewtoken)) FROM event.layoutjank WHERE year = 2019 AND month = 4 AND day = 23 AND useragent.browser_family = 'Chrome' AND useragent.browser_major = '73' AND wiki = 'ruwiki';

36168

SELECT COUNT(DISTINCT(event.pageviewtoken)) FROM event.navigationtiming WHERE year = 2019 AND month = 4 AND day = 23 AND useragent.browser_family = 'Chrome' AND useragent.browser_major = '73' AND wiki = 'ruwiki' AND event.mobileMode IS NULL;

36366

And there's still the possibility that the 0.5% remaining are clients that are lying about their UA, for example.

Let's look at the distribution by amount of layout jank events per pageview, for pageviews that get jank.

On ruwiki (desktop):

layoutjank events12345> 5
%age of pageviews9.9653.1516.069.763.297.77

On mobile, surprisingly, the situation is a lot better, with "only" 76.5% of pageviews getting layout jank events at all. And within those that do:

On eswiki (mobile):

layoutjank events12345> 5
%age of pageviews34.1833.0121.946.9118.662.08

Change 507545 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/mediawiki-config@master] Renew origin trial tokens for ruwiki

https://gerrit.wikimedia.org/r/507545

Change 507545 merged by jenkins-bot:
[operations/mediawiki-config@master] Renew origin trial tokens for ruwiki

https://gerrit.wikimedia.org/r/507545

Mentioned in SAL (#wikimedia-operations) [2019-05-01T11:22:05Z] <gilles@deploy1001> Synchronized wmf-config/InitialiseSettings.php: T216499 T216598 T216594 Renew origin trial tokens for ruwiki (duration: 01m 14s)

Gilles added a comment.EditedMay 2 2019, 2:35 PM

While investigating worst offenders in terms of summed fraction, I discovered 2 bugs ( or at least very big shortcomings) of the API, filed upstream:

As it stands, the bug experienced on the mobile site is creating so much noise that it's pretty much useless to look at that data. Which is probably why all the top offenders in terms of summed fraction for a pageview were from the mobile site. I'll have to redo the investigation of worst offenders while looking only at the desktop site.

I've discovered yet another bug/big shortcoming, this time on desktop. The Multimedia Viewer bottom panel scroll animation is another source of a ton of small LayoutJank events. I've filed another bug about that: https://bugs.chromium.org/p/chromium/issues/detail?id=958828

I've also narrowed down the pattern that was triggering the previous bug on Desktop, and seems to be whenever there are hoverable links inside a multi-column <ol> element:

In terms of top summed fraction desktop offenders, this one remains a mystery: https://ru.wikipedia.org/wiki/Модуль_Юнга it doesn't have any image that can be opened with Multimedia Viewer, nor a multi-column list of references.

I've just discovered, however, that merely resizing the page results in a ton of LayoutJank events... https://bugs.chromium.org/p/chromium/issues/detail?id=958832 which could probably explain that last mysterious one.

Change 509630 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/vagrant@master] Renew origin trial tokens

https://gerrit.wikimedia.org/r/509630

Change 509630 merged by jenkins-bot:
[mediawiki/vagrant@master] Renew origin trial tokens

https://gerrit.wikimedia.org/r/509630

Gilles raised the priority of this task from Normal to High.May 27 2019, 5:03 AM
Gilles changed the task status from Open to Stalled.Jun 18 2019, 9:22 AM

The upstream bugfixes have been committed:

https://bugs.chromium.org/p/chromium/issues/detail?id=958795#c_ts1556814267
https://bugs.chromium.org/p/chromium/issues/detail?id=958832#c4

But until these make their way to Chrome stable, the current origin trial data is useless for us, it's all noise.

Whether we'll be able to revisit this largely depends on the second bug (events occurring on resize) being backported to 76, which is being discussed right now.

Gilles added a comment.EditedFri, Jul 26, 8:46 PM

The main issue is fixed in 76, which is part of the origin trial and will soon be stable: https://bugs.chromium.org/p/chromium/issues/detail?id=958795#c9

The other issue I reported (media viewer) needs to be investigated further on our end to verify whether or not this is really a layout change, and if it is, how we can avoid it: https://bugs.chromium.org/p/chromium/issues/detail?id=958828#c4

Also, the shape of the exposed data might have changed since I set up the origin trial, needs to be double checked.

Change 527837 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/mediawiki-config@master] Renew origin trial tokens

https://gerrit.wikimedia.org/r/527837

Change 527837 merged by jenkins-bot:
[operations/mediawiki-config@master] Renew origin trial tokens

https://gerrit.wikimedia.org/r/527837

Mentioned in SAL (#wikimedia-operations) [2019-08-03T09:35:58Z] <gilles@deploy1001> Synchronized wmf-config/InitialiseSettings.php: T216499 T216594 Renew origin trial tokens (duration: 00m 48s)