Page MenuHomePhabricator

Investigate Google Search Console Speed report trends
Closed, ResolvedPublic

Description

This new tool shows a worrying trend for us (screenshot courtesy of @JKatzWMF):

image.png (698×1 px, 76 KB)

It seems to be based on CrUX data, which means FCP (first contentful paint) and FID (first input delay).

The pattern since in the graph above does bear a strong resemblance to the report rate we've been getting of the different Chrome Mobile versions being rolled out over the same period:

Capture d'écran 2019-11-12 12.42.33.png (426×1 px, 130 KB)

Now, looking at First Paint, which for us is pretty much the same as FCP on Chrome Mobile over that period, it does seem to get gradually worse over time:

Capture d'écran 2019-11-12 12.44.28.png (402×1 px, 74 KB)

Now, Safari doesn't support First Paint and Firefox Mobile gets too little traffic, which leaves us with Opera Mobile as the only "other mobile browser" comparison for this. Opera is also Chromium-based, but lagging behind in terms of version (the latest stable version seems to be based on Chromium 73). And there we see:

Capture d'écran 2019-11-12 12.49.25.png (405×1 px, 69 KB)

At this point this would suggest that the last 2 versions of Chrome are likely responsible for the regression. Which is no less serious, since Google takes page speed into account for ranking purposes (which explains them releasing this new tool).

Event Timeline

As far as I can see, CrUX data doesn't contain browser version. So we won't be able to verify that theory without Google's help. I'll wait until I have access to the search console to get all the details before I file a Chrome bug about this.

Further data from our own performance metrics that this is likely a Chrome 77/78 problem. A similar First paint regression seems to happen on desktop Chrome over that period:

Capture d'écran 2019-11-12 13.18.22.png (402×1 px, 68 KB)

Edge, which is the second desktop browser where we collect the most First Paint data, doesn't show the same trend:

Capture d'écran 2019-11-12 13.17.14.png (402×1 px, 64 KB)

I think that a regression that large would have been noticed by Google if it affected all sites, so it's likely to be a performance regression related to our content.

If I remember correctly i saw different patterns from the synthetic testing. Some URLs slower, some faster. The good thing when I pushed our new setup we are still running 77, so let me update to 78 to see if can get some more help there.

Gilles claimed this task.
Gilles triaged this task as High priority.
Gilles moved this task from Inbox to Doing (old) on the Performance-Team board.

Here are the 2 graphs on top of each other for Mobile:

Capture d'écran 2019-11-14 10.29.28.png (361×1 px, 90 KB)

The pattern is clear for the big drop, with the exception of the change around September 2. It also seems clear that the migration to 78 gets a little of that performance back, but far from all.

Digging further in the sub-graphs of the search console, I find these potential one-time events that could be on our end and contributed to the loss in the overall URL classification:

First Input Delay

Mobile regression on September 2. 2 million pages that used to take less than 100ms now that between 100 and 300. There is no significant shift in Chrome version on that day. The level goes back to normal as suddenly as it started on October 30.

Looking at the Server Admin Log on September 1 and 2, I don't see any significant deployment on our end. Nothing that could explain such a shift.

First Contentful Paint

Desktop regression on October 4. Pages where FCP is longer than 3s triple on that day and stays high from that point on. Also no significant shift of Chrome version on that day. However, I'm seeing some weird spikes in other Desktop graphs that suggest that either we are on the fence of the thresholds set by Google or it behaves funny when dealing with a smaller amount of data.

Looking at our own data for Chrome and higher percentiles of First Paint, nothing stands out on October 4, suggesting that it's probably Google's report acting funny:

Capture d'écran 2019-11-14 10.56.14.png (414×1 px, 67 KB)

If there was such a huge impact on FCP, we should see it on our own RUM collection of First Paint over that period.

Conclusion

So far, there is nothing that confirms that the changes seen in that experimental Google Search Console report are real. In fact, some of them really look like artifacts. Isolated events aren't reflected in our own metric collection. And the big long term shift for mobile seems highly correlated to Chrome version updates.

That being said, we don't track FID and FCP specifically on our end. We should improve our instrumentation to collect both of those, as well as potential causes for FID, in order to have better certainty in the future.

We should also report our finding to Google. I've asked the CrUX Google person I know about where we should report this.

When I look at the ones on desktop that has a slow First Contentful Paint, most of the example URLs are on the are from ar.wikipedia.org. One example URL is https://ar.wikipedia.org/wiki/فيسبوك

On my machine we have a couple long tasks before first paint on that wiki, my guess that is the problem. They vary from just over 50 to 500 ms and sometimes delays the first paint by one second. That could easily be more on another machine. I'm thinking something maybe changed with right to left or it could be that we have some room for improvement there.

Adding the "new" limits so we have them:

Screen Shot 2019-11-17 at 10.21.16 AM.png (386×868 px, 79 KB)

@Peter could you put together some figures with the real device you were using browsing very minimal websites, which we could show to Google? I'm going to make separate bug reports and that one will be about how unrealistic the targets are on very common devices.

Yes I could that. The way I'm thinking will be the parse time of JavaScript (so long tasks/max potential fid), is that how you think too?. So I'm gonna do a simple page with a lot of HTML + one with only limited JS and run them with Alacatel phone. I'll do that tomorrow.

Yes I could that. The way I'm thinking will be the parse time of JavaScript (so long tasks/max potential fid), is that how you think too?. So I'm gonna do a simple page with a lot of HTML + one with only limited JS and run them with Alacatel phone. I'll do that tomorrow.

Yes, demonstrating how high max potential FID is on a such a device browsing a couple of simple pages should be enough to prove our point.

@Gilles I've did some testing. First I tried google.com on my Alacatel One phone. Just accessing the URL and measuring the maxPotentialFID. The thing is Google ship quite much JavaScript, it was different between runs but something like 700-800kb unpacked. For five runs the maxPotentialFid on that phone was: 462, 512, 502, 493, 590 ms.

That means RED on the scale. However that is max, if a user tries to interact after the first content full paint (that happened 682 - 769 ms for my tests) and just exactly when the longest CPU task happen. I think it gives us two things: max potential fid isn't the best metric and a lot website will have a problem on lower end phones.

Then I tried to capture real first input delay. I navigate to https://www.google.com on my phone using my wifi at home, wait for it to finished loading and then press "o" (I want to search for Obama) and then collect the FID.

I've first tested it on my Alacatel One phone. I did 11 runs. The minimum FID is 24 ms, median 24 ms, max 2,1 s! Looking at the individual results 9 runs are fast (under 100 ms), one is 200 ms and one is really off the one that takes two seconds. I did some runs to verify and that off value happens now and then..

.I then tried doing the same on a Moto G5. The span was 8ms -> 200 ms. Two out of eleven would be categorized as yellow.

That was wrong by me, I measured the wrong way. The FID is 10-20 ms all the time.

After a lot of back and forth with Google, it seems like this is not related to Chrome versions (they provided graphs privately demonstrating this). It seems, however, very likely to be correlated to at least one particular banner that ran on dewiki for WLM during the whole month of September.

They also told me that the Speed report uses a 28-day rolling aggregate, which explains the ramp-up effect and the apparent regression that lasts after September. Dewiki gets a lot of traffic, so this sort of thing would show up in the global Wikipedia stats.

The regression Google sees on FID on dewiki matches exactly the dates the banner actually ran, based on what I see in Turnilo (Sep 1 - 30):

Screenshot 2019-12-06 at 11.53.18.png (1×1 px, 247 KB)

The good news is that this is over, the bad news is that with our fundraising campaign at full speed right now, we might also have ongoing CentralNotice banners that make our FID regress. Not to mention the fact that it's very frequent for community banners to run as well, and there's probably a lot of code reuse by copying and pasting past banners.

The main action item for us is a Q3 goal already, which is to track FID ourselves: T238091: Collect First Input Delay. This way we won't be dependent on this phenomenons being only visible in Google tools.

And I've filed a task to investigate why this particular banner increased FID so much: T239982: Investigate why the wlm_2019_de banner increased the p95 FID from 25 to 275ms Hopefully if we manage to track down a root cause, this might turn into a recommendation we can communicate to people who create banners.