Page MenuHomePhabricator

[Bug] The statsv client should send a request when the page unloads so we are not losing events
Closed, ResolvedPublic

Description

The statsv client maintains an in-memory queue of metrics to send to the server. They will be sent within 2 seconds of being queued. It is possible we might be losing some where the browser is closed within that time.

Problem

If the queue length is under 50 (an internally set maximum batch size), then the metrics are sent to the server/dequeued two seconds after the last metric was added to the queue (A). If the queue length is over 50, then a batch of metrics are sent to the server/dequeued every two seconds until either the queue is empty or the preconditions for A are met.

However, if the queue has metrics in it and the page unloads, then in principle those metrics aren't sent to the server and are lost.

Steps to reproduce

  1. Visit https://en.wikipedia.org/wiki/Cult_of_Luna
  2. Open your browser's JavaScript console
  3. Type mw.track( 'counter.test.statsv.foo2', 1 )
  4. Wait for 2 seconds
  5. Observe that an HTTP GET request is sent to a URL like: https://en.wikipedia.org/beacon/statsv?test.statsv.foo2=1c
  6. Type mw.track( 'counter.test.statsv.foo2', 2 )
  7. Refresh the page within 2 seconds (i.e. before the request has a chance to send)
  8. Observe that no HTTP GET request has been sent and the event dropped

AC

  • When the page unloads, any metrics in the queue should be sent to the server
  • When the page loses visibility, any metrics in the queue should be sent to the server

QA steps

  1. Ensure that you're browser is preserving network logs

  1. Visit https://en.wikipedia.beta.wmflabs.org/
  2. Open your browser's JavaScript console
  3. Type mw.track( 'counter.test.statsv.foo2', 1 )
  4. Wait for 2 seconds
  5. Observe that an HTTP GET request has been sent to a URL like: https://en.wikipedia.beta.wmflabs.org/beacon/statsv?test.statsv.foo2=1c
  6. Type mw.track( 'counter.test.statsv.foo2', 2 ) again
  7. Refresh the page immediately
  8. Observe that an HTTP GET request has been sent to a URL like: https://en.wikipedia.beta.wmflabs.org/beacon/statsv?test.statsv.foo2=2c

Signoff steps

  1. Announce the change to wikitech (?)
  2. Add an annotation to https://grafana.wikimedia.org/d/000000566/overview?panelId=15&fullscreen&orgId=1 at the time when the change is deploying

Notes

  1. @phuedx: The AC around page visibility comes from a conversation with @Krinkle. IIRC, in some mobile browsers, the unload event doesn't fire before the JavaScript VM is torn down after a tab is closed. However, we're more confident that the visibilitychange event does fire (see T208980#4792762).
  2. The change should live in WikimediaEvents/modules/all/ext.wikimediaEvents.statsd.js

QA Results

ACStatusDetails
1T208980#5139193
2T208980#5139193

Event Timeline

phuedx created this task.Nov 7 2018, 6:34 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 7 2018, 6:34 PM
phuedx updated the task description. (Show Details)Nov 7 2018, 6:42 PM

@Krinkle: Please feel free to correct me where I'm wrong/add more detail about this 🙂 If there are any sources for the browser/JavaScript VM behaviour differences between desktop and mobile devices, that'd be great!

Krinkle added a comment.EditedDec 3 2018, 3:44 AM

LGTM.

For desktop, adding the unload handler should cover all cases users would run into under normal usage (apart from forced/failed scenarios such as force quit, browser crashing, or device shutting down).

For mobile, I believe the unload handler quite often doesn't actually end up being fired when a tab is closed depending on how the user does it because mobile browsers have a way to slowly take a way resources and call frequency of a background page, until eventually it ends up in a frozen state that would not yet be permanently closed (as such, no unload). If the user resumed it, it would be the same page still. But, because it is frozen, the user can also close it without a way for unload to fire. Effectively similar to a force quit of the whole browser from the outside (except just for one tab).

Adding the visiblitychange handler I suspect would indeed make it very likely for our event processor to get called at least once between changing tabs/apps on mobile and the user potentially closing the (frozen) tab. This is based on the assumption that a mobile browser will fire the visiblitychange event when changing tabs, and that the operating system (Android/iOS) also allows the browser to fire the visiblitychange event when switching apps without changing tabs.

I recommend we confirm these assumptions in the wild on a few real devices and browsers.

If visiblitychange ends up being unreliable or if we find there's some technical and spec-supported reason for it to not always do what we want for this use case, there is also the Page Lifecycle API coming up which is tailored for exactly this "freezing" behaviour that we see in mobile browsers. Further reading/listening/viewing:

Jdlrobson triaged this task as Normal priority.Dec 13 2018, 10:51 PM
Jdlrobson added a subscriber: Jdlrobson.EditedDec 18 2018, 12:29 AM

Can you expand the problem statement here? Does this relate to ReadingDepth or something else? It would be great to tie this back to our products and any bugs we're seeing.

Some QA steps would also be great to help aid an estimation of how we know what done looks like.

phuedx updated the task description. (Show Details)Dec 18 2018, 10:07 AM

Can you expand the problem statement here? Does this relate to ReadingDepth or something else? It would be great to tie this back to our products and any bugs we're seeing.

The StatsV client (the client-side part of the process that converts counter.foo, timer.foo, and guage.foo analytics events on the client to metrics in Graphite, plotted in Grafana) silently drops metrics that it's queued to send to the client when the page unloads. This bug impacts all StatsV-based instrumentation run by all Grade A browsers on all OSes.

Some QA steps would also be great to help aid an estimation of how we know what done looks like.

See T208980#4830604.

Peter added a subscriber: Peter.Dec 18 2018, 11:05 AM

Got it thanks!
What reading web products using statsv that might be impacted? Edit counting is the only one I'm aware of from https://grafana.wikimedia.org/d/000000566/reading-web-dashboard?orgId=1 Are there others?

What reading web products using statsv that might be impacted?

Page Previews.

Jdlrobson reopened this task as Open.

My bad this is a more generic version of the more specific T212970. Do we want tasks for every single schema or are we expecting to make this change somewhere generic?

Krinkle added a comment.EditedFeb 11 2019, 11:46 PM

@Jdlrobson I don't think that's entirely true. They're not asking for the same thing.

T212970 is asking to reduce the number of HTTP requests by not sending each increment separately. That can be closed as resolved because we already don't do that. Since the first version, statsv has done batching and coalescing. All use of Statsv automatically benefits from this.

This task, on the other hand, is effectively asking to do less batching, by sending metrics off immediately in cases where we expect the tab to get frozen soon (on mobile) and thus can't be that it'll come back to finish the debounce and send the metric.

nray added a subscriber: nray.Feb 12 2019, 2:21 AM

IIRC, in some mobile browsers, the unload event doesn't fire before the JavaScript VM is torn down after a tab is closed. However, we're more confident that the visibilitychange event does fire

Any reason to choose the visibilitychange event over the beforeunload event?

@nray: See @Krinkle's detailed explanation here: T208980#4792762

nray added a comment.Feb 14 2019, 5:16 PM

@nray: See @Krinkle's detailed explanation here: T208980#4792762

Thank you, I didn't read that closely enough. Sounds like in addition to the unload event not firing in common scenarios on mobile devices the beforeunload event wouldn't fire either.

@Krinkle @phuedx Where does this code live? core?

Jdlrobson renamed this task from [Bug] The statsv client should send a request when the page unloads to [Bug] The statsv client should send a request when the page unloads so we are not losing events.Feb 25 2019, 10:24 PM
Jdlrobson updated the task description. (Show Details)

@Krinkle @phuedx Where does this code live? core?

It's in WikimediaEvents (statsd.js).

Not sure if this is a reading task/performance task or task that needs us both but looks clear to me now.

phuedx updated the task description. (Show Details)Feb 26 2019, 9:43 AM
Jdlrobson removed Jdlrobson as the assignee of this task.Mar 4 2019, 11:55 PM
phuedx added a comment.EditedMar 28 2019, 5:34 PM

@nray: See @Krinkle's detailed explanation here: T208980#4792762

Thank you, I didn't read that closely enough. Sounds like in addition to the unload event not firing in common scenarios on mobile devices the beforeunload event wouldn't fire either.

I could've been more helpful here, @nray. Sorry!

Indeed, if you look at the state diagram that Google provide in their documentation, it's possible that a page will never fire the unload event if it transitions from HIDDEN to DISCARDED but will fire the beforeunload event.

Many developers treat the unload event as a guaranteed callback and use it as an end-of-session signal to save state and send analytics data, but doing this is extremely unreliable, especially on mobile! The unload event does not fire in many typical unload situations, including closing a tab from the tab switcher on mobile or closing the browser app from the app switcher.

The same documentation cautions against using the "legacy" beforeunload and unload events (see https://developers.google.com/web/updates/2018/07/page-lifecycle-api#the-unload-event) for the reasons above as well as performance reasons:

Furthermore, the mere presence of a registered unload event handler (via either onunload or addEventListener()) can prevent browsers from being able to put pages in the page navigation cache for faster back and forward loads.

The documentation recommends that we use the pagehide event wherever possible:


function onUnload( /* ... */ ) {
}

if ( window.hasOwnProperty( 'onpagehide' ) ) {
    $( window ).on( 'pagehide', ( e ) => {
      // The page isn't going to be persisted to the browser's BFCache and is about to be unloaded.
      if ( !e.persisted ) {
        onUnload( e );
      }
    } );
} else {
    $( window ).on( 'unload', onUnload );
}

Change 500046 had a related patch set uploaded (by Phuedx; owner: Phuedx):
[mediawiki/extensions/WikimediaEvents@master] statsv: Send queued events when document unloads

https://gerrit.wikimedia.org/r/500046

nray added a comment.Mar 29 2019, 5:07 PM

Thank you @phuedx, that is a very helpful diagram explanation. One more question I had while looking at: If we only rely on the pagehide event, isn't it possible to miss the path that goes from HIDDEN => freeze => FROZEN? It looks like hidden can go directly to freeze in that diagram; should we care about that path?

Change 500046 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] statsv: Send queued events when document unloads

https://gerrit.wikimedia.org/r/500046

phuedx updated the task description. (Show Details)Apr 4 2019, 8:12 AM

Does this need QA/QA steps?

I'll duplicate the reproduction steps with links to the BC, since the change has been deployed there. I'll also update the signoff steps.

phuedx updated the task description. (Show Details)Apr 4 2019, 10:33 AM
phuedx lowered the priority of this task from Normal to Low.Apr 8 2019, 1:51 PM
phuedx assigned this task to Edtadros.Apr 8 2019, 5:12 PM
Edtadros reassigned this task from Edtadros to ovasileva.Apr 25 2019, 10:42 PM
Edtadros added a subscriber: Edtadros.

Test Result

Status: ✅ PASS
OS: macOS Mojave
Browser: Chrome
Device: MBP
Emulated Device: iPhoneX

Test Artifact(s):

QA steps

Ensure that you're browser is preserving network logs

Visit https://en.wikipedia.beta.wmflabs.org/
Open your browser's JavaScript console
Type mw.track( 'counter.test.statsv.foo2', 1 )
Wait for 2 seconds
✅ AC1: Observe that an HTTP GET request has been sent to a URL like: https://en.wikipedia.beta.wmflabs.org/beacon/statsv?test.statsv.foo2=1c

Type mw.track( 'counter.test.statsv.foo2', 2 ) again
Refresh the page immediately
✅ AC2: Observe that an HTTP GET request has been sent to a URL like: https://en.wikipedia.beta.wmflabs.org/beacon/statsv?test.statsv.foo2=2c

phuedx claimed this task.Apr 26 2019, 8:36 AM
phuedx added a subscriber: ovasileva.
phuedx updated the task description. (Show Details)Apr 26 2019, 1:10 PM
  1. Add an annotation to https://grafana.wikimedia.org/d/000000566/overview?panelId=15&fullscreen&orgId=1 at the time when the change is deploying

phuedx closed this task as Resolved.Apr 26 2019, 1:24 PM

Thanks to @Krinkle both for discussing this issue and for the prompt review and merge.