Page MenuHomePhabricator

Record interaction to next paint
Open, Needs TriagePublic

Description

One of the upcoming metrics in Core Web Vitals that we miss out on today is interaction to next paint, it is a better version to measure lag for the users than first input delay. We should start measuring it so we can keep track of it.

Instead of storing the data in the navtiming schema as in T264032 and sending the data when we fetch Navigation Timing data, we want to push it throughout the page lifecycle to a new schema (like we done before with some other metrics).

The instrumentation of getting the interaction to next paint is a little bit more complicated than the other metrics we collect, we can find some inspiration on how to do it in https://github.com/GoogleChrome/web-vitals/blob/main/src/onINP.ts

To collect the metric we need to do a couple of things (you can see the full picture at https://wikitech.wikimedia.org/wiki/Performance#/media/File:WMF_Performance_Team_infrastructure_2022.png):

  1. Add the collection of the actual metric in the navigation timing extension https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/NavigationTiming/+/refs/heads/master - you need to pull up Mediawiki and do the changes in the extension, add tests for it and run the tests.
  1. Then we need to make sure that the data is stored in a new schema https://gerrit.wikimedia.org/r/plugins/gitiles/schemas/event/secondary/+/refs/heads/master/jsonschema/analytics/legacy
  1. And then we need to take care of data when it arrives and send it to Graphite/Prometheus. That happens in navtiming.py: https://gerrit.wikimedia.org/r/plugins/gitiles/performance/navtiming/
  1. When the data has started to arrive we can make a new dashboard/graph in Grafana where we can look at the new metric.

Event Timeline

We can follow the pattern on how we implemented First Input delay in https://phabricator.wikimedia.org/T238091

Google announced that INP will be the new metric in 2024:
https://web.dev/inp-cwv/

I propose we start collect it ASAP so we can create a plan on how to make the metric better.

I've asked on the Performance Slack channel if there's any work for simplifying how to get INP in Chrome before 2024 (when it becomes a Google Core Web Vitals), else this is the current implementation that we need to work from: https://github.com/GoogleChrome/web-vitals/blob/main/src/onINP.ts

I've seen the implementation changes a little over time, so lets aim for the newest version.

@Krinkle do you see anyway we can do this and do not need to massage the data when it arrives to the server? When is the latest we can beacon back data today? I'm thinking maybe it's ok to loose some data just to get this going since looking at Crux data this is the metrics where we have most need for improvement.

So https://chromestatus.com/feature/5690553554436096 just finished origin trial, maybe we should for it to reach Chrome and then collect it only for Chrome even though it sucks, it would help us to minimise the code for collecting INP and keep track of it ourselves.

The difficult part here is that browsers inherently don't support right now a reliable "at most once" callback to compute and beacon something that is both widely supported and also relatively late/close to the end of a browser's tab lifetime. This is basically the classic "onBeforeUnload" problem and the Page Lifecycle API reasoning, which have moved further away from this classic model. That's good, in that it is honest and accurately represents how mobile browsers work, but also means it's not going to get easier. The first and only thing I see in Web APIs that would make this possible is the Pending Beacon API, but that's still a draft.

What we have today is the Page Visiblity API which fairly accurately tells you on both mobile and desktop that a pageview is no longer in the foreground (e.g. switch or minimise apps, or switch tabs). This doesn't mean that it is closed, so it can be switched back and continue its life again later.

With that, I see two practical options today:

  1. Find a clever way to structure or process the data.

This means we can send the (potentially incomplete) data we have whenever the tab gets hidden. This is how Analytics' Session Length metric works today. It periodically indicates that a pageview existed that lasted upto N minutes. Because earlier events are naturally a subset of the other, there is a way to process this data such that the fact that the same tab sends multiple beacons does not cause problems with the data.

For us this could mean we send the INP metric from pagehide associated with a pageviewID and then in the backend require some periodic processing to ignore earlier beacons from the same client. This however cannot be done in realtime because once increments are aggregated and sent to Prometheus, we can't undo that. And unline session length data, this timing information is bucketed in ways we can't accumulate in a way that is still accurate.

  1. Measure only upto the first tab hide.

This would mean we don't measure the same way, so rather than building up and tuning the INP percentile in the browser tab until it is officially closed (like Chrome internals do), we would only measure until the first tab hide. This means we may miss some datapoints from people that switch tabs a lot or for any reason did not do the "main" action until after switching back. But maybe it's good enough, and would support real-time processing, and would avoid attaching sensitive pageViewIDs to this data.

Peter added a subscriber: larissagaulia.