Page MenuHomePhabricator

Make the logo's loading priority higher
Closed, ResolvedPublic

Assigned To
Authored By
Gilles
Jun 1 2015, 4:27 PM
Referenced Files
Tokens
"Like" token, awarded by Volker_E.

Description

See Improving time-to-logo performance with preload on the Wikimedia Performance blog.

Currently the logo is a background-image, which browsers load after <img> elements on the page. Which makes the logo, an above-the-fold element, load after all the below-the-fold images.

One way we could solve this is using link prefetching. When I last experimented with that feature, the browser started loading the content right away. This would allow us to achieve loading the logo sooner without having to touch its CSS.

Links:

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Gilles changed the task status from Stalled to Open.Dec 7 2016, 7:20 PM
Gilles raised the priority of this task from Medium to High.

We should revisit this with link rel="preload", which also supports media queries and can be passed as an HTTP header. In fact I'm pretty sure that https://gerrit.wikimedia.org/r/#/c/215061/2 had all we needed, but was mistakenly using rel="prefetch", where we needed preload. It had the right parameters for preload and everything.

Change 215061 restored by Gilles:
Preload the logo using a link prefetch call

Reason:
Try again with rel="preload"

https://gerrit.wikimedia.org/r/215061

Change 345516 had a related patch set uploaded (by Gilles):
[mediawiki/vagrant@master] Add HD logos

https://gerrit.wikimedia.org/r/345516

Change 345516 merged by jenkins-bot:
[mediawiki/vagrant@master] Add HD logos

https://gerrit.wikimedia.org/r/345516

Change 346925 had a related patch set uploaded (by Krinkle; owner: Gilles):
[mediawiki/vagrant@trusty-compat] Add HD logos

https://gerrit.wikimedia.org/r/346925

Change 346925 merged by jenkins-bot:
[mediawiki/vagrant@trusty-compat] Add HD logos

https://gerrit.wikimedia.org/r/346925

Change 215061 merged by jenkins-bot:
[mediawiki/core@master] Preload the logo using link rel="preload" http header

https://gerrit.wikimedia.org/r/215061

We should collect before and after in WebPageTest, attach the waterfall and best case the video here too.

I've secured a hight quality video, let me merge it with the latest one when the release is live!

I got this warning in the console after the patch: "The resource http://localhost:3080/w/images/thumb/3/31/Red.png/120px-Red.png was preloaded using link preload but not used within a few seconds from the window's load event. Please make sure it wasn't preloaded for nothing." It seems the actual logo is loaded from http://localhost:3080/w/images/thumb/3/31/Red.png/120px-Red.png?82c42 (with the ?82c42). My setting is $wgLogo = "/w/images/thumb/3/31/Red.png/120px-Red.png"; (that file is uploaded to the wiki and it exists).

Looks like OutputPage::transformResourcePath() in called in ResourceLoaderSkinModule when generating the CSS, but not in OutputPage when generating the preload header.

Change 347970 had a related patch set uploaded (by Krinkle):
[mediawiki/core@master] Ensure logo preload transforms urls if needed

https://gerrit.wikimedia.org/r/347970

Change 347970 merged by jenkins-bot:
[mediawiki/core@master] Ensure logo preload transforms urls if needed

https://gerrit.wikimedia.org/r/347970

@Peter can you record another high-quality video? I believe this gets deployed later today. With the switchover there's extra latency right now for WPT that wasn't there when you recorded your reference video.

yep, I got it automatically running on my test instance, I can show you later. Need to make script to join the videos, can fix that when we have them. Yep we got those extra 100 ms, so the videos will not be perfect to compare, I don't remember now when we will switch back?

Actually, as the wiki page states, the traffic will switch back on May 1 and May 2. This is what will affect the latency in WPT.

Cool, lets have one video after the release and one after the switch and let us compare and cross our fingers :)

Adding for future use. Add the two videos side by side:

ffmpeg \
  -i video1.mp4 \
  -i video2.mp4 \
  -filter_complex '[0:v]pad=iw*2:ih[int];[int][1:v]overlay=W/2:0[vid]' \
  -map [vid] \
  -c:v libx264 \
  -crf 23 \
  -preset veryfast \
  combined.mp4

And then slow down the speed so you can see the difference:

ffmpeg -i  combined.mp4 -filter:v "setpts=5.0*PTS" combined-slow.mp4

Have you had a chance to look at it today, now that the change has been deployed?

I've checked before but I didn't see the header?

@Peter @Gilles Yeah, the weekly deploy train arrives at en.wikipedia.org on Thursday (tomorrow). But at this point it should be live on mediawiki.org, test wikis, and non-Wikipedia wikis.

I can look more later today to see if I can find one that have perfect match with timings, but this is quite ok. Check the video and feel proud @Gilles :) Will be interesting to follow the rest of the metrics the coming days.

To the left is one without the preload, to the right the one with preconnect.

Without the preload:

Screen Shot 2017-04-28 at 6.40.51 AM.png (1×2 px, 322 KB)

And the preload (logo picked up at number 2):

Screen Shot 2017-04-28 at 6.48.17 AM.png (1×2 px, 321 KB)

The Barack Obama page is the one where you can visually see it, here's what it looks like in Facebook and Sweden (no diff). The same here, to the left the old, to the right the one with the change:

All above is with Chrome. With FF we could a small win for the Obama page:

Getting the following warning on en.m.wikipedia.org page views.

The resource /static/images/project-logos/enwiki-2x.png was preloaded using link preload but not used within a few seconds from the window's load event. Please make sure it wasn't preloaded for nothing.

I guess control of this should be moved to the skin so that skins that don't use the logo (e.g. Modern/CologneBlue), or display a custom logo (e.g. Minerva MobileFrontend) without using ResourceLoaderSkinModule, can override or disable it.

NavigationTiming shows an improvement of firstPaint on desktop (saved about 50ms, roughly 9%). This is more than I expected.

On mobile no negative effect is visible on the metrics, despite fetching the logo for nothing, probably because Opera Mini, Safari and older versions of the Android browsers don't support the header.

We can see the regression on synthetic testing for first view both start visual render and SpeedIndex but for second view it is different depending on the page I don't see anything here: https://grafana.wikimedia.org/dashboard/db/webpagetest-drilldown?orgId=1&from=now-7d&to=now&var-wiki=enwiki-mobile&var-users=anonymous&var-page=Elizabeth-second&var-location=us-east-1&var-browser=Google_Chrome-emulateMobile&var-view=firstView

It would be interesting for mobile timings how much is cached.

I double-checked that the performance improvement wasn't due to the Chrome 58 auto-update which happened last week and ramped up during that period. Picking only Chrome 57 (previous version) shows the same trend: https://grafana.wikimedia.org/dashboard/db/navigation-timing-by-browser?refresh=5m&orgId=1&from=now-30d&to=now&var-metric=firstPaint&var-browserFamily=Chrome&var-browserVersion=57&var-m=median

58 is a lot noisier and the outliers hide the trend, but hovering the values seems to confirm the same thing: https://grafana.wikimedia.org/dashboard/db/navigation-timing-by-browser?refresh=5m&orgId=1&from=1492688392790&to=now&var-metric=firstPaint&var-browserFamily=Chrome&var-browserVersion=58&var-m=median

I'll focus on fixing the mobile site and optional skins first, but we have a nice blog post to write up with all of this! It's crazy how much RUM firstPaint can get affected by things that in theory shouldn't affect it that much.

Change 351306 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/core@master] Toggle logo preloading via the skin

https://gerrit.wikimedia.org/r/351306

Change 351307 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/skins/Vector@master] Toggle logo preloading via the skin

https://gerrit.wikimedia.org/r/351307

Change 351306 merged by jenkins-bot:
[mediawiki/core@master] Toggle logo preloading via the skin

https://gerrit.wikimedia.org/r/351306

Change 351307 merged by jenkins-bot:
[mediawiki/skins/Vector@master] Toggle logo preloading via the skin

https://gerrit.wikimedia.org/r/351307

Change 352546 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/core@wmf/1.29.0-wmf.21] Toggle logo preloading via the skin

https://gerrit.wikimedia.org/r/352546

I've just noticed that fetchStart on Chrome has been dramatically affected as well (by the same amount in fact), at the same time firstPaint was, and also affecting desktop while not doing anything on mobile:

https://grafana.wikimedia.org/dashboard/db/navigation-timing-by-browser?refresh=5m&orgId=1&from=now-30d&to=now&var-metric=fetchStart&var-browserFamily=Chrome&var-browserVersion=all&var-m=median

fetchStart moving is even more odd than firstPaint, though, as it's supposed to happen before app cache lookups, i.e. before the HTTPS request is even sent. Hopefully today's SWAT should answer whether this was caused by the logo preload or not.

I wonder if it couldn't "go deeper" than fetchStart and actually be a change in unload or redirect timings. Unfortunately when I point Grafana to those coal metrics I get an NaN error. I wonder if we really collect those.

We collect Navigations Timings in synthetic too and I can't see any change there the 26th.

Ok, so looking into fetchStart on Chrome, there is something funny going on. If you open a new tab to a site, its value is significant (testing with my own connection, 60-80). If you're already on a site and you navigate to another page on that same site, the value is almost 0. That's true with cache enabled or disabled.

Beyond getting an explanation for that, it means that a change in traffic pattern of first views/repeat views could affect the median.

I think we discussed it before that the way we are measuring right now with using Navigation start as the start time, our metrics can be affected with for example if Google change the behaviour (we measure when people click in the link on Google, they do their redirect thing and the user ends up on Wikipedia). But realistically I think we maybe pushed a change to how we measure, I haven't checked yet if any changes went out in the Navigation Timing extension?

Correlation of fetchStart/firstPaint over the last year on Chrome:

firstPaint vs fetchStart.JPG (537×1 px, 50 KB)

It seems like some "natural" trends match, while not all drops likely to be induced by changes do.

Indeed, I was thinking of external factors as well. But what I see is counter-intuitive compared to the spec, it's exactly the other way around in fact. I think Ori was commenting on what the spec says it should do, not the actual values returned by Chrome, which seem backwards to me.

I think I finally understand. Nic and Philip are right, and Chrome seems to have "read between the lines" because the spec isn't explicit enough about fetchStart. The unload and redirect times in the spec are pretty clear about the fact that if it's not same-origin, it should be zero. It doesn't mention anything about same-origin regarding fetchStart, however.

I'm guessing that Chrome decided to implement it this way:

  • if not same origin, unload and redirect times are at zero, but we still count them implicitly as part of fetchStart
  • if same origin, unload and redirect times have explicit values, which are deducted from fetchStart, which in that case is only concerned with time until that point spent on things other than unload and redirect, which is very close to 0

It's kind of weird when you consider fetchStart values that actually mean something different in both cases, but it makes sense for the purpose we need it for. I.e. if you deduct the fetchStart value from firstPaint, it gives you the true firstPaint, while not ignoring the unload/redirect times that happen with same-origin traffic.

Now, we need to verify if other browsers behave the same way.

And at least for Chrome, indeed it means that if Google makes its self-redirect more efficient (or gets rid of it), it would affect our fetchStart and firstPaint metrics. Which might be what we're seeing here.

Assuming that this theory is confirmed, we can reconstruct the firstPaint "that should have been" in our historical data (at least for Chrome) by deducting fetchStart from firstPaint. Doing it on last year's data we get (restricted to Chrome):

realFirstPaint.JPG (538×1 px, 60 KB)

And in that case, no effect caused by the last deployment. Which is more in line with synthetic testing.

In fact, since we can generate those graphs, we don't really need to change our data collection.

I've tried writing a POC for the fetchStart issue and I can confirm that an unload prompt on a site linking to us doesn't affect fetchStart: https://jsfiddle.net/ofmbahje/1/

I tried Bing and Google and for both the fetchStart values on landing on wikipedia was very low. Those sites perform well, though. I'd have to manufacture a purposefully slow http redirect to be sure.

That little POC also confirms that when navigating same-origin, fetchStart doesn't include the unload value. And firstPaint does include the same-origin inefficient unload time.

All in all so far the only case where I can a "high" fetchStart is really when I open a new tab to the page: https://dl.dropboxusercontent.com/u/109867/test/index.html

In that context, an improvement of Chrome's performance could affect that figure. Although this drop we've seen recently was experienced with Chrome 57 (previous version) which had been out for a while.

Cool. We miss internal redirects if we remove it, I haven't checked if something has been changed there lately?

Change 352546 merged by jenkins-bot:
[mediawiki/core@wmf/1.29.0-wmf.21] Toggle logo preloading via the skin

https://gerrit.wikimedia.org/r/352546

Mentioned in SAL (#wikimedia-operations) [2017-05-08T18:33:50Z] <catrope@tin> Synchronized php-1.29.0-wmf.21/includes: T100999 (duration: 01m 24s)

So far, with the logo preload turned off for a couple of hours, no change in the metrics. We'll see tomorrow, but I think this confirms that the logo preload had nothing to do with the fetchStart/firstPaint change.

Hmm I think the report rate is a little lower since the change happened:

reportrate.png (1×2 px, 262 KB)

ah you already investigated it in T164749 let me also continue there.

Yeah I think that's just the global traffic really dipping, which has made things super confusing. This was really the perfect storm of: us deploying a huge amount of changes after a week of deployment was skipped, a new Chrome version ramping up, traffic in Turkey getting blocked and what looks like a global traffic dip that Turkey alone can't explain. And who knows what else...

It's hard to tell if the global traffic drop is responsible for the decrease of the actual values or not. We'll only know when traffic picks up again.

Anyway, to close the discussion on this particular task, I think now enough time has passed since yesterday's SWAT to declare that the logo preloading has absolutely nothing to do with the fetchStart/firstPaint change. I don't think I'm going to bother backporting the second part with the patch that reactivates it, it'll just go out with the train this week.

While there are mysteries remaining to be solved about fetchStart, I'm glad sanity is restored here and RUM matches what we've seen in synthetic testing and what we expected from the start. I.e. that the logo does show up sooner on large articles, which is good, and doesn't have a negative or positive impact on firstPaint.

Probably still worth writing a short and sweet blog post about it, with the video. If we had a way to track "time to logo", it would have definitely improved :)

Probably still worth writing a short and sweet blog post about it, with the video. If we had a way to track "time to logo", it would have definitely improved :)

Do it frame by frame: ffmpeg -i video.mp4 %04d.jpg :)

Right :) But I meant as RUM. The Obama article is quite an extreme and it would have been nice to know how much it improved time-to-logo for users in practice on average.