Page MenuHomePhabricator

[Bug] The statsv client should send a request when the page unloads so we are not losing events
Closed, ResolvedPublic

Description

The statsv client maintains an in-memory queue of metrics to send to the server. They will be sent within 2 seconds of being queued. It is possible we might be losing some where the browser is closed within that time.

Problem

If the queue length is under 50 (an internally set maximum batch size), then the metrics are sent to the server/dequeued two seconds after the last metric was added to the queue (A). If the queue length is over 50, then a batch of metrics are sent to the server/dequeued every two seconds until either the queue is empty or the preconditions for A are met.

However, if the queue has metrics in it and the page unloads, then in principle those metrics aren't sent to the server and are lost.

Steps to reproduce

  1. Visit https://en.wikipedia.org/wiki/Cult_of_Luna
  2. Open your browser's JavaScript console
  3. Type mw.track( 'counter.test.statsv.foo2', 1 )
  4. Wait for 2 seconds
  5. Observe that an HTTP GET request is sent to a URL like: https://en.wikipedia.org/beacon/statsv?test.statsv.foo2=1c
  6. Type mw.track( 'counter.test.statsv.foo2', 2 )
  7. Refresh the page within 2 seconds (i.e. before the request has a chance to send)
  8. Observe that no HTTP GET request has been sent and the event dropped

AC

  • When the page unloads, any metrics in the queue should be sent to the server
  • When the page loses visibility, any metrics in the queue should be sent to the server

QA steps

  1. Ensure that you're browser is preserving network logs

preserve-log.png (64×555 px, 19 KB)

  1. Visit https://en.wikipedia.beta.wmflabs.org/
  2. Open your browser's JavaScript console
  3. Type mw.track( 'counter.test.statsv.foo2', 1 )
  4. Wait for 2 seconds
  5. Observe that an HTTP GET request has been sent to a URL like: https://en.wikipedia.beta.wmflabs.org/beacon/statsv?test.statsv.foo2=1c
  6. Type mw.track( 'counter.test.statsv.foo2', 2 ) again
  7. Refresh the page immediately
  8. Observe that an HTTP GET request has been sent to a URL like: https://en.wikipedia.beta.wmflabs.org/beacon/statsv?test.statsv.foo2=2c

Signoff steps

  1. Announce the change to wikitech (?)
  2. Add an annotation to https://grafana.wikimedia.org/d/000000566/overview?panelId=15&fullscreen&orgId=1 at the time when the change is deploying

Notes

  1. @phuedx: The AC around page visibility comes from a conversation with @Krinkle. IIRC, in some mobile browsers, the unload event doesn't fire before the JavaScript VM is torn down after a tab is closed. However, we're more confident that the visibilitychange event does fire (see T208980#4792762).
  2. The change should live in WikimediaEvents/modules/all/ext.wikimediaEvents.statsd.js

QA Results

ACStatusDetails
1T208980#5139193
2T208980#5139193

Event Timeline

@Krinkle: Please feel free to correct me where I'm wrong/add more detail about this 🙂 If there are any sources for the browser/JavaScript VM behaviour differences between desktop and mobile devices, that'd be great!

LGTM.

For desktop, adding the unload handler should cover all cases users would run into under normal usage (apart from forced/failed scenarios such as force quit, browser crashing, or device shutting down).

For mobile, I believe the unload handler quite often doesn't actually end up being fired when a tab is closed depending on how the user does it because mobile browsers have a way to slowly take a way resources and call frequency of a background page, until eventually it ends up in a frozen state that would not yet be permanently closed (as such, no unload). If the user resumed it, it would be the same page still. But, because it is frozen, the user can also close it without a way for unload to fire. Effectively similar to a force quit of the whole browser from the outside (except just for one tab).

Adding the visiblitychange handler I suspect would indeed make it very likely for our event processor to get called at least once between changing tabs/apps on mobile and the user potentially closing the (frozen) tab. This is based on the assumption that a mobile browser will fire the visiblitychange event when changing tabs, and that the operating system (Android/iOS) also allows the browser to fire the visiblitychange event when switching apps without changing tabs.

I recommend we confirm these assumptions in the wild on a few real devices and browsers.

If visiblitychange ends up being unreliable or if we find there's some technical and spec-supported reason for it to not always do what we want for this use case, there is also the Page Lifecycle API coming up which is tailored for exactly this "freezing" behaviour that we see in mobile browsers. Further reading/listening/viewing:

Jdlrobson triaged this task as Medium priority.Dec 13 2018, 10:51 PM

Can you expand the problem statement here? Does this relate to ReadingDepth or something else? It would be great to tie this back to our products and any bugs we're seeing.

Some QA steps would also be great to help aid an estimation of how we know what done looks like.

Can you expand the problem statement here? Does this relate to ReadingDepth or something else? It would be great to tie this back to our products and any bugs we're seeing.

The StatsV client (the client-side part of the process that converts counter.foo, timer.foo, and guage.foo analytics events on the client to metrics in Graphite, plotted in Grafana) silently drops metrics that it's queued to send to the client when the page unloads. This bug impacts all StatsV-based instrumentation run by all Grade A browsers on all OSes.

Some QA steps would also be great to help aid an estimation of how we know what done looks like.

See T208980#4830604.

Got it thanks!
What reading web products using statsv that might be impacted? Edit counting is the only one I'm aware of from https://grafana.wikimedia.org/d/000000566/reading-web-dashboard?orgId=1 Are there others?

What reading web products using statsv that might be impacted?

Page Previews.

My bad this is a more generic version of the more specific T212970. Do we want tasks for every single schema or are we expecting to make this change somewhere generic?

@Jdlrobson I don't think that's entirely true. They're not asking for the same thing.

T212970 is asking to reduce the number of HTTP requests by not sending each increment separately. That can be closed as resolved because we already don't do that. Since the first version, statsv has done batching and coalescing. All use of Statsv automatically benefits from this.

This task, on the other hand, is effectively asking to do less batching, by sending metrics off immediately in cases where we expect the tab to get frozen soon (on mobile) and thus can't be that it'll come back to finish the debounce and send the metric.

IIRC, in some mobile browsers, the unload event doesn't fire before the JavaScript VM is torn down after a tab is closed. However, we're more confident that the visibilitychange event does fire

Any reason to choose the visibilitychange event over the beforeunload event?

@nray: See @Krinkle's detailed explanation here: T208980#4792762

Thank you, I didn't read that closely enough. Sounds like in addition to the unload event not firing in common scenarios on mobile devices the beforeunload event wouldn't fire either.

Jdlrobson renamed this task from [Bug] The statsv client should send a request when the page unloads to [Bug] The statsv client should send a request when the page unloads so we are not losing events.Feb 25 2019, 10:24 PM
Jdlrobson updated the task description. (Show Details)

@Krinkle @phuedx Where does this code live? core?

It's in WikimediaEvents (statsd.js).

Not sure if this is a reading task/performance task or task that needs us both but looks clear to me now.

@nray: See @Krinkle's detailed explanation here: T208980#4792762

Thank you, I didn't read that closely enough. Sounds like in addition to the unload event not firing in common scenarios on mobile devices the beforeunload event wouldn't fire either.

I could've been more helpful here, @nray. Sorry!

Indeed, if you look at the state diagram that Google provide in their documentation, it's possible that a page will never fire the unload event if it transitions from HIDDEN to DISCARDED but will fire the beforeunload event.

Many developers treat the unload event as a guaranteed callback and use it as an end-of-session signal to save state and send analytics data, but doing this is extremely unreliable, especially on mobile! The unload event does not fire in many typical unload situations, including closing a tab from the tab switcher on mobile or closing the browser app from the app switcher.

The same documentation cautions against using the "legacy" beforeunload and unload events (see https://developers.google.com/web/updates/2018/07/page-lifecycle-api#the-unload-event) for the reasons above as well as performance reasons:

Furthermore, the mere presence of a registered unload event handler (via either onunload or addEventListener()) can prevent browsers from being able to put pages in the page navigation cache for faster back and forward loads.

The documentation recommends that we use the pagehide event wherever possible:


function onUnload( /* ... */ ) {
}

if ( window.hasOwnProperty( 'onpagehide' ) ) {
    $( window ).on( 'pagehide', ( e ) => {
      // The page isn't going to be persisted to the browser's BFCache and is about to be unloaded.
      if ( !e.persisted ) {
        onUnload( e );
      }
    } );
} else {
    $( window ).on( 'unload', onUnload );
}

Change 500046 had a related patch set uploaded (by Phuedx; owner: Phuedx):
[mediawiki/extensions/WikimediaEvents@master] statsv: Send queued events when document unloads

https://gerrit.wikimedia.org/r/500046

Thank you @phuedx, that is a very helpful diagram explanation. One more question I had while looking at: If we only rely on the pagehide event, isn't it possible to miss the path that goes from HIDDEN => freeze => FROZEN? It looks like hidden can go directly to freeze in that diagram; should we care about that path?

Change 500046 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] statsv: Send queued events when document unloads

https://gerrit.wikimedia.org/r/500046

Does this need QA/QA steps?

I'll duplicate the reproduction steps with links to the BC, since the change has been deployed there. I'll also update the signoff steps.

phuedx lowered the priority of this task from Medium to Low.Apr 8 2019, 1:51 PM
Edtadros subscribed.

Test Result

Status: ✅ PASS
OS: macOS Mojave
Browser: Chrome
Device: MBP
Emulated Device: iPhoneX

Test Artifact(s):

QA steps

Ensure that you're browser is preserving network logs

Visit https://en.wikipedia.beta.wmflabs.org/
Open your browser's JavaScript console
Type mw.track( 'counter.test.statsv.foo2', 1 )
Wait for 2 seconds
✅ AC1: Observe that an HTTP GET request has been sent to a URL like: https://en.wikipedia.beta.wmflabs.org/beacon/statsv?test.statsv.foo2=1c

Screen Shot 2019-04-25 at 3.39.29 PM.png (963×1 px, 270 KB)

Type mw.track( 'counter.test.statsv.foo2', 2 ) again
Refresh the page immediately
✅ AC2: Observe that an HTTP GET request has been sent to a URL like: https://en.wikipedia.beta.wmflabs.org/beacon/statsv?test.statsv.foo2=2c

Screen Shot 2019-04-25 at 3.40.39 PM.png (1×1 px, 374 KB)

Thanks to @Krinkle both for discussing this issue and for the prompt review and merge.