Page MenuHomePhabricator

First paint time regression on Wikimedia page views with wmf.23 roll-out
Closed, ResolvedPublic

Description

The roll-out of 1.27.0-wmf.23 to all Wikipedias coincided with a significant (~150ms) increase in first paint time for Wikimedia page views. First paint time is the time it takes the user to see anything on the screen after clicking on a link. Keeping it low is crucial to good user experience.

In the graph below, the green line shows median first paint time in milliseconds over the course of the past 24 hours (at the time of writing). The blue line shows what the value was that time the day before. You can see they diverge shortly after 19:00 UTC and converge around 4:20 UTC.

first-paint-2016-05-05.png (300×800 px, 105 KB)

Event Timeline

ori triaged this task as High priority.May 6 2016, 6:35 AM
Legoktm renamed this task from First paint time regression on Wikimedia page views to First paint time regression on Wikimedia page views with wmf.23 roll-out.May 6 2016, 6:50 AM
Legoktm subscribed.

Was this mobile and desktop or just desktop?

Jdforrester-WMF subscribed.

Provisionally tagging this as blocking the MW 1.27 release (in case it's a fault in core or a tarball extension).

While the most impressive week-over-week deviation was indeed while wmf.23 was deployed (max seems to be around 170ms), shortly before the wmf.23 deployment I see gaps as big as 90ms.

We might be dealing with 2 separate issues. It's hard to say when exactly the permanent-looking one started. Eyeballing it, I would say between 00:00 and 02:00 GMT on the 5th.

Alright, I think I've found the culprit for wmf.23. This commit is part of the tag:

https://phabricator.wikimedia.org/rENTId261d407b434fa64a1bd4415b4687ce0fd408414

It removes (some, maybe not all) freak negative firstPaint values (which are all extreme values like -1462475223582) that were previously weighing down on the median.

I've looked at the NavigationTiming_15485142 table for confirmation in the EventLogging database:

event_firstPaint < 0 versus event_firstPaint > 0

April 28th: 22807 / 188214 (10.8%)
April 29th: 20678 / 170444 (10.8%)
April 30th: 18492 / 150651 (10.9%)
May 1st: 19918 / 168984 (10.5%)
May 2nd: 22563 / 191845 (10.5%)
May 3rd: 22601 / 191365 (10.6%)
May 4th: 22670 / 190948 (10.6%)
May 5th: 17256 / 182801 (8.6%)
May 6th: 12260 / 174094 (6.6%)
May 7th: 12852 / 152083 (7.8%)
May 8th: 14342 / 168481 (7.8%)

Which means that while wmf.23 was deployed during a portion of the 5th/6th, there was a lower share of freak huge negative values, and the median was pushed back up, closer to its "real value".

Now, it looks like this hasn't gone back to its normal level after reverting back to wmf.22. But maybe a lot of users still have the newer NavigationTiming JS cached since the revert? This would explain why it seems like the effect kept going, but isn't as strong as when wmf.23 was deployed.

All in all, given how extreme the negative values that are less present now are, I'm fairly confident that this commit was the source of the spike. I think we can safely move back to wmf.23. When we do, we can expect the firstPaint graphs to spike again, as less freak values make it through the pipeline.

@ori by email:

These values don't make it to Graphite (see NavTiming. Py) so I don't think thats it. You can test by cherry picking the change to wmf22

To which I replied:

I see the frontend.navtiming prefix in there. Do the coal.* figures also go through navtiming.py and its sanitation?

@ori said:

No, they dont -- good point

I think my theory still holds, since it's much harder to see the same regression on the navtiming graphs. It might be there on desktop, but if it is it's nowhere near as big as the difference seen on coal. And on the navtiming figures the regression is unaffected by the revert to wmf22. It's about the same (hard to see) amount throughout wmf23 and after.

Theory confirmed by backporting the change to wmf22:

Screenshot 2016-05-09 23.25.26.png (419×1 px, 110 KB)

The gap that forms after the deployment of the backport is about 100ms.

Now, wmf23 will be redeployed later today, and we can see if that had any separate regression.

Gilles claimed this task.

The gap on the coal metric that appeared during the backport stayed the same after the wmf23 redeploy.

Looking at navtiming firstPaint, comparing the last 7 days to the previous 7, it doesn't look like wmf23 introduced any particular regression.