Page MenuHomePhabricator

More robust timing reporting in client-side EditAttemptStep instrumentation
Open, Needs TriagePublic

Description

In T251772 we identified that in some cases ready_timing has a negative value (and so negative that it can cause events to be invalid because it's outside the range of acceptable values). Per

logEditEvent( 'ready', {
	timing: readyTime - window.performance.timing.navigationStart
} );

As such, I'd speculate that window.performance.timing.navigationStart was weirdly massive.

Since ready_timing is "Timing information about action=ready – time in milliseconds since the editor was initialised." does it ever make sense to send a negative number? What about imposing a lower bound of 0 client-side and sending null when the timing is negative? Does { "timing": -47610868 } (which is the MIN(event.ready_timing) observed) have any meaning to us? (This is not a rhetorical question, I'm genuinely asking here.)

This task is about updating the client-side instrumentation so that instead of sending negative timing values (which would eventually be invalid once legacy EventLogging is ported over to modern Event Platform, per T252592) the instrumentation sends null. A negative value does not tell us anything useful, so it's more meaningful at analysis time to have missing data than useless data.

Event Timeline

mpopov moved this task from Triage to Tracking on the Product-Analytics board.

Copying over a comment from @mpopov which gives us a sense for the impact of this issue if left unresolved:

Looking at negative vs positive/non-zero timings for action=ready:

monthready_timing <0ready_timing ≥0
227 events4576588 events
339 events5711270 events
435 events6570861 events
52 events956814 events

Open question

  • What level of effort would be involved with implementing the behavior this task is asking for?

updating the client-side instrumentation so that instead of sending negative timing values (which would eventually be invalid once legacy EventLogging is ported over to modern Event Platform, per T252592) the instrumentation sends null.

  • Where (read: what editing interfaces) would this change need to be implemented?
    • This should be implemented in the visual editor as well as the wikitext editor, across platforms.

Change 596706 had a related patch set uploaded (by Bearloga; owner: Bearloga):
[mediawiki/extensions/WikiEditor@master] Omit negative timing data in events

https://gerrit.wikimedia.org/r/596706

Change 596709 had a related patch set uploaded (by Bearloga; owner: Bearloga):
[mediawiki/extensions/VisualEditor@master] Omit negative timing data in events

https://gerrit.wikimedia.org/r/596709

Open question

  • What level of effort would be involved with implementing the behavior this task is asking for?

I just uploaded a couple of very minor patches (implementing this behavior) for review.

If it's only happened once, I'm inclined to call it some sort of fluke.

That said... it came from WikiEditor, so we can tell that the offending code was:

logEditEvent( 'ready', {
	timing: readyTime - window.performance.timing.navigationStart
} );

As such, I'd speculate that window.performance.timing.navigationStart was weirdly massive.

From ext.wikiEditor.js

				readyTime = Date.now();
				logEditEvent( 'ready', {
					timing: readyTime - window.performance.timing.navigationStart

The problem here most likely is due to mixing of two very different clocks. One measures wall time and is subject to system clock adjustements, the other is monotonic and has no expected absolute baseline. Even when the result is non-negative, the result is not (strictly) meaningful.

The readyTime here needs to use window.performance.now() instead.

I would generally discourage filtering of such values client-side. It needs to happen server-side (at ingestion, or query-time) no matter what, since anyone can submit phoney packets anyway. Doing it both ways would duplicate the logic (adding more sync/maintenance cost), is easy to forget about and is yet another thing for consumers of the data to factor in when trying to understand the meaning of the data. It also adds to the to the weight of the client, which is always nice to keep down where possible :)

I would generally discourage filtering of such values client-side. It needs to happen server-side (at ingestion, or query-time) no matter what, since anyone can submit phoney packets anyway. Doing it both ways would duplicate the logic (adding more sync/maintenance cost), is easy to forget about and is yet another thing for consumers of the data to factor in when trying to understand the meaning of the data. It also adds to the to the weight of the client, which is always nice to keep down where possible :)

In this case it is better to have the client perform this tiniest of checks and omit a negative timing from the event data so that the event as a whole does not get rejected at ingestion time, although if switching to window.performance.now() would get rid of the possibility entirely then this is a moot point.

The problem here most likely is due to mixing of two very different clocks. One measures wall time and is subject to system clock adjustements, the other is monotonic and has no expected absolute baseline. Even when the result is non-negative, the result is not (strictly) meaningful.

The readyTime here needs to use window.performance.now() instead.

@DLynch: what do you think about switching to performance.now(), rather than including the check-and-omit logic?

@DLynch: what do you think about switching to performance.now(), rather than including the check-and-omit logic?

I can't think of any particular objection to it. It's sub-millisecond, so we just need to make sure we remember to convert it.

Change 596709 abandoned by Bearloga:
[mediawiki/extensions/VisualEditor@master] Omit negative timing data in events

Reason:

https://gerrit.wikimedia.org/r/596709

Change 596706 abandoned by Bearloga:
[mediawiki/extensions/WikiEditor@master] Omit negative timing data in events

Reason:

https://gerrit.wikimedia.org/r/596706