Page MenuHomePhabricator

Collect Navigation Timing metrics if the user leave and loadEventEnd hasn't happened
Closed, ResolvedPublic

Description

Extend the Navigation Timing extension to collect metrics when the user leaves and the loadEventEnd hasn't happened. Check best practices (e.g Boomerang & Beacon API) how to do it.

Make sure to send these values with other keys, so we don't mess with the current metrics and we easy can compare those two.

This is one step to make sure we measure metrics for mobile users on slow connections.

Event Timeline

Peter claimed this task.
Peter raised the priority of this task from to Medium.
Peter updated the task description. (Show Details)
Peter added a project: Performance-Team.
Peter added subscribers: Peter, Jdlrobson.

Beware that even now, our amount of data points for loadEventEnd is consistently lower than that of lowerEventStart, which is rather mysterious and one of many known issues currently.

https://grafana.wikimedia.org/dashboard/db/navigation-timing?panelId=25&fullscreen

Screen Shot 2016-02-08 at 20.33.59.png (1×2 px, 267 KB)

(loading is loadEventStart, totalPageLoadTime is loadEventEnd - to be renamed at some point..)

Thanks, yes that's really annoying. I'll sync with you when I've done the changes.

Hmm I wonder if this is worth doing? On a throttled 2G connection the navigation timing extension has finished loading after 40 seconds. Maybe we should re-think how we collect the metrics.

Screen Shot 2016-02-09 at 10.17.20 AM.png (382×1 px, 138 KB)

I've prepared sending a failure event if the browser is closed, just to make it easy to see if this is an issue at all. However if it takes 40+ seconds until we have the navigation extension on slow connections, then we should focus moving the code so it's runs earlier.

Lets sync this on Thursday.

Summary:

  • Loading pages with many images (Obama etc) on a slow connection (2G) makes the loadEventEnd event happen late. You can try this yourself by throttling using devtools in Chrome. Using regular 2G on https://en.m.wikipedia.org/wiki/Barack_Obama makes the loadEventEnd happens on 50-55 seconds and the firstPaint 5 seconds. So it seems on mobile phone with 2G there's a good risk that we actually miss these values since we filter out values higher than 60s.
  • Another problem is that on a slow connection the Navigation Timing extension is low prio and when I tests in WPT with a 2G connection (SPDY), the extension is downloaded after 40 seconds, meaning if we add a catch for users that navigates away from the page, we will not catch them if that happens before 40 s.

Summary:

  • Loading pages with many images (Obama etc) on a slow connection (2G) makes the loadEventEnd event happen late. You can try this yourself by throttling using devtools in Chrome. Using regular 2G on https://en.m.wikipedia.org/wiki/Barack_Obama makes the loadEventEnd happens on 50-55 seconds and the firstPaint 5 seconds. So it seems on mobile phone with 2G there's a good risk that we actually miss these values since we filter out values higher than 60s.
  • Another problem is that on a slow connection the Navigation Timing extension is low prio and when I tests in WPT with a 2G connection (SPDY), the extension is downloaded after 40 seconds, meaning if we add a catch for users that navigates away from the page, we will not catch them if that happens before 40 s.

@Peter thanks for confirming this. I was struggling to understand how fully load time was under 10s here - https://grafana.wikimedia.org/dashboard/db/mobile-2g?panelId=44&fullscreen

Updated: We added so we have a higher limit than 60s to be able to see if we actually miss values, when i checked a couple of weeks ago I couldn't see any better metrics (think we picked up one more or something like that. Lets check that again.

Per navtiming.py, the new experimental bucket accepts values upto 180s instead of 60s.

The median is mostly unaffected:

Screen Shot 2016-03-30 at 03.27.44.png (1×2 px, 260 KB)

The 95th percentile is upto 1-2 seconds higher in that bucket:

Screen Shot 2016-03-30 at 03.22.47.png (1×2 px, 303 KB)

The 99th percentile is upto 20 seconds higher in that bucket:

Screen Shot 2016-03-30 at 03.27.16.png (1×2 px, 238 KB)

While the value range is upto 20 seconds larger, the actual data rate is not much different.

The experimental bucket does include more data points (otherwise it'd be broken), but only 1 or 2 data points data points at most. Most minutes the rate is the same.

Screen Shot 2016-04-21 at 19.25.02.png (957×2 px, 251 KB)

Over 24 hours, the experimental bucket contained (101141-100900=) 241 more data points. Not much, though remember this is on a sampling of 1:1000.

Given the median is unchanged and no weird unexpected skew elsewhere, I'd say this experiment is a success and should be safe to apply to the main navtiming bucket (and remove navtiming-experimental).

I know we don't like changing the logic of navtiming but in this case the median hardly changes and it'll definitely improve telemetry on the percentiles per the previous comment.

Given the median is unchanged and no weird unexpected skew elsewhere, I'd say this experiment is a success and should be safe to apply to the main navtiming bucket (and remove navtiming-experimental).

Yep, this makes sense. Thanks for the analysis!

Change 284743 had a related patch set uploaded (by Ori.livneh):
Promote 'experimental' sanity check to be the default

https://gerrit.wikimedia.org/r/284743

Change 284743 merged by Ori.livneh:
Promote 'experimental' sanity check to be the default

https://gerrit.wikimedia.org/r/284743

I wonder if we have hit the limit so we collect all data now?

We increased the max limit for metrics to be 120 seconds and the 99 percentile rise to almost 40-55 s. Before when we had the limit of 60 seconds, the 99 percentile was around 30s. Almost half sometimes of the max limit. What happens if increase it to 240 seconds :)

Great work guys. Really glad to see this happen and for us to represent our users better!

I'm curious could any of these experiments have impacted the data we collected for German Wikipedia via NavigationTiming in the past month? The 95th percentile increased there over the last month.

No, we kept the change completely separated so we have the new metrics in another namespace.

Implementing that we send metrics if the user leave and loadEventEnd hasn't happened will make it much more complex and it isn't worth it for now. The increased collect time is fine.