Page MenuHomePhabricator

Why did first paint/fully loaded time drop (in a good way!) on mobile?
Closed, ResolvedPublic

Description

Moving median for first paint dropped on 26th/27th April as did total page load time for 95th and 99th percentile. CSS dropped for Barack Obama/Facebook around this time. Webpagetest runs showed no change in first paint.

Screen Shot 2018-05-01 at 1.54.18 PM.png (350×1 px, 84 KB)

Screen Shot 2018-05-01 at 1.55.24 PM.png (335×715 px, 53 KB)

Screen Shot 2018-05-01 at 1.55.53 PM.png (253×1 px, 78 KB)

Screen Shot 2018-05-01 at 2.05.24 PM.png (263×1 px, 79 KB)

Graphs: https://grafana-admin.wikimedia.org/dashboard/db/mobile-2g?orgId=1
Could the CSS drop be related to the real world device data or is this just relating to our instrumentation?

During this period there are several changes that may have caused this:

Any way we can work out what caused this and do more of it?

Event Timeline

On late April 26th group 2 wikis moved to 1.32.0-wmf.1

This included the following NavigationTiming changes on the client side:

https://gerrit.wikimedia.org/r/#/c/428551/
https://gerrit.wikimedia.org/r/#/c/428549/
https://gerrit.wikimedia.org/r/#/c/427931/
https://gerrit.wikimedia.org/r/#/c/427460/

Tomorrow group 2 wikis are moving to 1.32.0-wmf.2 which will include a bunch more NavigationTiming changes...

Of these changes https://gerrit.wikimedia.org/r/#/c/428551/ seems like the prime suspect to me. While the old code was quite agressive, it didn't trust the fact that the load event would happen on the document. The new code does trust that the document load event will happen if document.readyState != 'complete'. It also trusts that by the time the timeout runs in both situations, loadEventEnd will be set.

https://developer.mozilla.org/en-US/docs/Web/API/Document/readyState mentioned that the "complete" state indicates that the "load event is about to fire". I don't know if the browser has to fire it immediately, though.

Maybe we should start with a more cautious migration away from the old agressive setInterval, and only start doing the old agressive loop waiting on loadEventEnd once we've either caught the onload event or the readyState is "complete"? Kind of a middle ground compared to the current change. And if the assumptions of the new code are correct, it won't loop at all, as it will find loadEventEnd right away after the timeout. I'll write a patch doing that.

Change 430302 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/extensions/NavigationTiming@master] Don't trust loadEventEnd being set immediately

https://gerrit.wikimedia.org/r/430302

Thanks all for digging into this!

[..] mentioned that the "complete" state indicates that the "load event is about to fire". I don't know if the browser has to fire it immediately, though.

As far as I understand it, there are no standardised steps for the browser to perform in-between there. They're two public interfaces coming from the same internal transition. There may be additional work going on internally of course, but there are no public events for them, which means no code can run there. The only code that can run after onreadystate-complete and before onload would be a promise handler for a promise that was resolved within the onreadystate handler (given the micro-task queue being cleared before ticking the outer event loop).

It's certainly possible for browsers to behave in non-standard ways though. So let's try it out, but if it doesn't change anything, I'd rather not keep it as defensive code in this case, as I'm afraid it'd do more harm than good (e.g. infinite polling for buggy performance.timing).

Regarding report rates, if loadEventEnd is missing, its report rate should be lower than e.g. responseStart and domComplete. Yet, it seems all our metric report rates went down ~equally on April 26/27:

https://graphite.wikimedia.org/render?width=1200&height=600&from=20180423&until=20180430&target=aliasByNode(frontend.navtiming2.{responseStart,domComplete,firstPaint,loadEventEnd}.overall.sample_rate,1,2)

render.png (600×1 px, 168 KB)

I agree that it's an unlikely cause, but worth ruling out. Have any changes on the backend/ingestion pipeline happened around that time?

Change 430302 merged by jenkins-bot:
[mediawiki/extensions/NavigationTiming@master] Don't trust loadEventEnd being set immediately

https://gerrit.wikimedia.org/r/430302

Change 430423 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/extensions/NavigationTiming@master] Don't trust loadEventEnd being set immediately

https://gerrit.wikimedia.org/r/430423

Change 430423 abandoned by Gilles:
Don't trust loadEventEnd being set immediately

https://gerrit.wikimedia.org/r/430423

Change 430424 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/extensions/NavigationTiming@wmf/1.32.0-wmf.1] Don't trust loadEventEnd being set immediately

https://gerrit.wikimedia.org/r/430424

Change 430424 merged by jenkins-bot:
[mediawiki/extensions/NavigationTiming@wmf/1.32.0-wmf.1] Don't trust loadEventEnd being set immediately

https://gerrit.wikimedia.org/r/430424

It's hard to say if the backported "fix" had an effect yet, since it was deployed around the lowest traffic point of the day. But so far at this time we have equivalent figures as yesterday, which are 1/3 lower than 7 days ago.

Change 430651 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/extensions/NavigationTiming@wmf/1.32.0-wmf.1] Reset NavigationTiming@1.32.0-wmf.1 to state of 1.31.0-wmf.30

https://gerrit.wikimedia.org/r/430651

Change 430651 merged by jenkins-bot:
[mediawiki/extensions/NavigationTiming@wmf/1.32.0-wmf.1] Reset NavigationTiming@1.32.0-wmf.1 to state of 1.31.0-wmf.30

https://gerrit.wikimedia.org/r/430651

Seeing the report rate drop in both NavigationTiming and SaveTming events (from Statsd/Graphite perspective):

We (@Gilles, @Imarlier and myself) went over different parts of the stack and ruled out various changes to the processing backends (navtimig, coal, their change of hardware and deployment methods; T159354, T186774, T191994).

That leaves: Kafka, EventLogging, Varnish, our client-code, and then actual devices/browsers.

Kafka metrics show the same drop for our schemas:

https://grafana.wikimedia.org/dashboard/db/eventlogging-schema?orgId=1&from=1524564000000&to=1525309200000

Screen Shot 2018-05-03 at 21.42.15.png (874×2 px, 150 KB)

And the above branch revert of NavigationTiming@1.32.0-wmf.1 restored the volume and typical median/p75 values so I guess it was our client code after all.

https://graphite.wikimedia.org/render?width=1200&height=600&from=00:0020180503&until=20:4020180503&target=frontend.navtiming.domComplete.overall.sample_rate

render-2.png (600×1 px, 69 KB)

Change 430802 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/extensions/NavigationTiming@wmf/1.32.0-wmf.1] Restore NavigationTiming@1.32.0-wmf.1 back to its original branch points

https://gerrit.wikimedia.org/r/430802

Change 430805 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/extensions/NavigationTiming@master] ext.navigationTiming: Listen for window.onload, not document.onload

https://gerrit.wikimedia.org/r/430805

Chris407 renamed this task from Why did first paint/fully loaded time drop (in a good way!) on mobile? to Heyy.May 3 2018, 9:12 PM

Change 430806 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/extensions/NavigationTiming@wmf/1.32.0-wmf.1] ext.navigationTiming: Listen for window.onload, not document.onload

https://gerrit.wikimedia.org/r/430806

Gilles renamed this task from Heyy to Why did first paint/fully loaded time drop (in a good way!) on mobile?.May 3 2018, 9:13 PM
Gilles changed the visibility from "Public (No Login Required)" to "Custom Policy".
Gilles removed a subscriber: Chris407.

Change 430805 merged by jenkins-bot:
[mediawiki/extensions/NavigationTiming@master] ext.navigationTiming: Listen for window.onload, not document.onload

https://gerrit.wikimedia.org/r/430805

Change 430821 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/extensions/NavigationTiming@wmf/1.32.0-wmf.2] ext.navigationTiming: Listen for window.onload, not document.onload

https://gerrit.wikimedia.org/r/430821

@demon At this point in time, wmf.1 is fine because earlier today I reverted NavigationTiming's wmf.1 branch back to the state of 1.31.0-wmf.30. That helped isolate the problem.

Master is fixed now, through https://gerrit.wikimedia.org/r/430805 (3abd83ff).

To apply to wmf.1:

To apply to wmf.2:

Aklapper changed the visibility from "Custom Policy" to "Public (No Login Required)".May 3 2018, 10:12 PM

Change 430821 merged by jenkins-bot:
[mediawiki/extensions/NavigationTiming@wmf/1.32.0-wmf.2] ext.navigationTiming: Listen for window.onload, not document.onload

https://gerrit.wikimedia.org/r/430821

Change 430802 merged by jenkins-bot:
[mediawiki/extensions/NavigationTiming@wmf/1.32.0-wmf.1] Restore NavigationTiming@1.32.0-wmf.1 back to its original branch points

https://gerrit.wikimedia.org/r/430802

Change 430806 merged by jenkins-bot:
[mediawiki/extensions/NavigationTiming@wmf/1.32.0-wmf.1] ext.navigationTiming: Listen for window.onload, not document.onload

https://gerrit.wikimedia.org/r/430806

Mentioned in SAL (#wikimedia-operations) [2018-05-03T23:09:32Z] <krinkle@tin> Synchronized php-1.32.0-wmf.1/extensions/NavigationTiming/: If293a156cac / T193570 (duration: 01m 17s)

Mentioned in SAL (#wikimedia-operations) [2018-05-03T23:12:40Z] <krinkle@tin> Synchronized php-1.32.0-wmf.2/extensions/NavigationTiming/: If293a156ca / T193570 (duration: 01m 16s)

Vvjjkkii renamed this task from Why did first paint/fully loaded time drop (in a good way!) on mobile? to dudaaaaaaa.Jul 1 2018, 1:13 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii removed Gilles as the assignee of this task.
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed subscribers: gerritbot, Aklapper.
CommunityTechBot renamed this task from dudaaaaaaa to Why did first paint/fully loaded time drop (in a good way!) on mobile?.Jul 2 2018, 2:52 PM
CommunityTechBot closed this task as Resolved.
CommunityTechBot assigned this task to Gilles.
CommunityTechBot lowered the priority of this task from High to Medium.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added subscribers: gerritbot, Aklapper.