Page MenuHomePhabricator

Anonymous fully-loaded and first paint (moving medium) page time significantly increased on 3G Barack Obama en.wiki on November 3rd, 06:00
Closed, ResolvedPublic

Description

95th and 99th percentile fully-loaded times up by 3s and 15s, respectively. The 3G speed indices are up by 200 milliseconds. Anon first paint is up by over a second.

https://grafana.wikimedia.org/dashboard/db/mobile-2g?orgId=1&from=now-30d&to=now

Addtional references:

Event Timeline

Maybe related to the banner with an image?:

Screen Shot 2017-11-08 at 7.31.36 PM.png (133×843 px, 23 KB)

Jdlrobson renamed this task from Anonymous fully-loaded page time significantly increased on 3G Barack Obama en.wiki on November 3rd, 06:00 to Anonymous fully-loaded and first paint (moving medium) page time significantly increased on 3G Barack Obama en.wiki on November 3rd, 06:00.Nov 9 2017, 8:00 PM
Jdlrobson triaged this task as High priority.
Jdlrobson added subscribers: Krinkle, Jdlrobson.

Adding fundraising team.

This is a considerable increase given first paint has been pretty stable for some time.

The module ext.centralNotice.display is now loaded on the page and the 2nd biggest module in the mobile site 23.0 KiB". It also introduces ext.centralNotice.kvStore which is 6.5kb ; ext.centralNotice.impressionDiet which is 2.5KB, ext.centralNotice.geoIP which is 1.5kb and given there is no such impact on the beta cluster, CentralNotice seems like a good theory (in particular the banner "This November is Wikipedia Asian month".

This seems to be confirmed when comparing these 2 runs - http://wpt.wmftest.org/video/compare.php?tests=171109_W8_JH,171103_X7_D - the banner does impact first paint. Although this suggests number of requests is the issue not number of bytes.

I wonder if we can reduce the impact - at least on first paint by taking the same approach suggested by @Krinkle in T176211#3688010 and not load it during the critical rendering path. The banner already creates a repaint/reflow of the entire mobile page.

This seems like something critical to iron out in this period where we are likely to run many banner tests. What do the performance team think?

Hi sorry I was on leave last week and I think we missed this one.

First the change in First Paint in RUM is because we fixed an bug that was quite ... hmm a bug. We maybe haven't communicated this? Short story: when we tried out HTTPS (before it was turned on for everyone) we turned off collecting navigation timing metrics for all measurements where the user had a ssl setup connection. And when we turned that on for everyone we missed changing how we measure so all requests where a SSL connection was setup was disregarded (meaning we didn't get any measurements for first view :( Checkout T179555 and https://gerrit.wikimedia.org/r/#/c/387997/

For the changes in WebPageTest: Everytime we add banners we will have an increase in fully loaded, but that is ok right? I mean we load more content so the time must increase. Then we can of course maybe do it smarter/improve. Instead I think we should focus on Last Visual Change (that is when the last change within the view port happens), that is more user centric metric. In this case the new last visual change is when the full page is bumped down by the banner (that is also why Speed Index is changed).

We'll sync in todays weekly meeting that we communicate the change in the RUM measurements better.

Also I want to do a hangout with you all about the automatic alerts we have setup, I can help you get going with it. It has worked out ok for us so far so I think it's time to share it so you also can use it.

Jdlrobson lowered the priority of this task from High to Medium.Nov 13 2017, 5:08 PM

If the first paint update is unrelated, then yes, it would be expected that fully loaded time changes with a banner being loaded. Are we 100% sure that's the case.
A knowledge share sounds like it would be a good idea. Generally 9AM PST would be a good time to do that. How best to organize? Do you want to set something up in my calendar and I'll add additional attendees?

Hi! As regards CentralNotice, no significant changes have gone out in that part of the codebase (client-side code running on nearly all pageviews) recently. None of the dependencies are new.

  • ext.centralNotice.geoIP makes geo data easily accessible from JS. It's used in many places and should be rolled out of CentralNotice (T102848).
  • ext.centralNotice.kvStore lets us use localStorage instead of cookies for essential data needed to determine whether or not to show a bannner.
  • ext.centralNotice.impressionDiet is only included as a dependency if the "impression diet" feature is used in a campaign that the pageview may be eligible for. Used by Fundraising and a lot of community campaigns.

None of the above should cause extra requests, since they should arrive on the client bundled with other RL modules.

In theory, CentralNotice shouldn't be delaying first paint, since runs in a non-blocking manner.

Regarding banners loading increasing time until fully loaded, yes, that should be the case... Indeed there is a banner in the "after" webpagetest linked in the task description, vs. none in the "before" test.

However, I don't see any changes in campaign settings or campaigns that were scheduled to start exactly at the time of the bumps in the Grafana graphs.

Hope this is helpful!!!! Thanks much! :)

Hmmmm, for the 99th percentile, I wonder why load times decrease just before they increase?

In theory, CentralNotice shouldn't be delaying first paint, since runs in a non-blocking manner.

It sounds likely. Hopefully we'll see no difference when the banner is stopped.

First the change in First Paint in RUM is because we fixed an bug that was quite ... hmm a bug. We maybe haven't communicated this?

Is there a phabricator ticket for tracking this?

Hmmmm, for the 99th percentile, I wonder why load times decrease just before they increase?

Yeh that is weird...

Hmmmm, for the 99th percentile, I wonder why load times decrease just before they increase?

You can look at @Krinkle investigation at T179555 to get the full picture. The drop before the raise happened because we started to include zeros in secureConnectStart, but all browsers that supports Navigation Timing v2 always reports zeroes (in Navigation Timing v1 the browser could exclude secureConnectStart if it wasn't used). That made our metrics even more wrong, so then we did a push to to include navigation timings for both secureConnectStart meaning both first view and the rest of the views. We should have communicated that better.

Is there a phabricator ticket for tracking this?

Not specific for that only bug (we have been fixing multiple bugs the last weeks). You can follow the work in https://phabricator.wikimedia.org/T104902 ... let me create a specific issue for that bug so we have it for the future.

alerts

I synced with Gilles yesterday: Lets talk and start use the alerts in the end of the quarter. We are working on using proxies to get more stable metrics (T176361) and it looks really good and that will make it easier for all of us to catch code regressions. I'll get in touch when it's ready!

Jdlrobson claimed this task.

Sounds like we can resolve this as the spikes can be explained and are unrelated?