Page MenuHomePhabricator

EditAttemptStep sent event with "ready_timing": -18446744073709543000
Closed, ResolvedPublic


This caused Refine into Hive to fail for year=2020,month=5,day=2,hour=23. This is indeed an integer as the schema specifies, but it does not fit in the range of a Long.

Numeric value (-18446744073709543000) out of range of long (-9223372036854775808 - 9223372036854775807)

The offending event is:

{"dt": "2020-05-02T23:32:32Z", "event": {"action": "ready", "editing_session_id": "xxxxx", "editor_interface": "wikitext", "integration": "page", "is_oversample": false, "mw_version": "1.35.0-wmf.28", "page_id": 0, "page_ns": 0, "page_title": "Ma_ma3na_ism_mohamed", "page_token": "xxxxx", "platform": "desktop", "ready_timing": -18446744073709543000, "revision_id": 0, "session_token": "xxxxx", "user_class": "IP", "user_editcount": 0, "user_id": 0, "version": 1}, "ip": "xxxxx", "recvFrom": "cp3060.esams.wmnet", "revision": 19802113, "schema": "EditAttemptStep", "seqId": xxxxx, "userAgent": {"browser_family": "Opera", "browser_major": "24", "browser_minor": "0", "device_family": "Other", "is_bot": false, "is_mediawiki": false, "os_family": "Windows", "os_major": "XP", "os_minor": null, "wmf_app_version": "-"}, "uuid": "0acd9b1126a45420bdb04346af4e574d", "webHost": "", "wiki": "arwiki"}

I believe this has only happened once, so maybe it is a fluke.

I reran the Refine job for this hour and dropped the malformed record.

sudo -u analytics kerberos-run-command analytics /usr/local/bin/refine_eventlogging_analytics --table_whitelist_regex='EditAttemptStep' --since='2020-05-02T22:00:00' --until='2020-05-03T00:00:00' --ignore_failure_flag=true --dataframereader_options='mode:DROPMALFORMED'

Event Timeline

Ottomata created this task.May 4 2020, 2:10 PM
Restricted Application added a project: Analytics. · View Herald TranscriptMay 4 2020, 2:10 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Nuria edited subscribers, added: nshahquinn-wmf; removed: Neil9830409.

Hi @DLynch , can we look at what may have caused this issue?
Wanted to be sure if this was just a lone incident or something bigger. cc @ppelberg

DLynch added a comment.May 4 2020, 7:46 PM

If it's only happened once, I'm inclined to call it some sort of fluke.

That said... it came from WikiEditor, so we can tell that the offending code was:

logEditEvent( 'ready', {
	timing: readyTime - window.performance.timing.navigationStart
} );

As such, I'd speculate that window.performance.timing.navigationStart was weirdly massive.

mpopov added a subscriber: mpopov.May 5 2020, 2:40 PM
logEditEvent( 'ready', {
	timing: readyTime - window.performance.timing.navigationStart
} );

As such, I'd speculate that window.performance.timing.navigationStart was weirdly massive.

Since ready_timing is "Timing information about action=ready – time in milliseconds since the editor was initialised." does it ever make sense to send a negative number? What about imposing a lower bound of 0 client-side and sending null when the timing is negative? Does { "timing": -47610868 } (which is the MIN(event.ready_timing) observed) have any meaning to us? (This is not a rhetorical question, I'm genuinely asking here.)

Looking at negative vs positive/non-zero timings for action=ready:

monthready_timing <0ready_timing ≥0
227 events4576588 events
339 events5711270 events
435 events6570861 events
52 events956814 events

The major offenders are:

editor interfacebrowserevents
wikitextChrome 4957
wikitextChrome 8014
wikitextChrome 796
wikitextOpera 365
wikitextChrome 783
wikitextDragon 583
wikitextChrome 813

Is there anything particular about WikiEditor (since we have no records of negative ready timings from any other editor interface) why readyTime = would be earlier than the moment "after the user agent finishes prompting to unload the previous document" or "immediately before the user agent starts checking any relevant application caches"?

Since it doesn't appear to be a 100% reliable API even in modern browsers for the case of WikiEditor, perhaps we can at least implement that rule client-side to (1) avoid storing values which don't mean anything to us anyway (please feel free to correct this) and (2) prevent cases like this. Granted it won't prevent errors when readyTime >>> window.performance.timing.navigationStart but it's something. Thoughts?

+1 @mpopov, the client should never send negative numbers, no matter what the browsers are telling you :) I'd come up with an extra flag on the schema or some way to signal different magnitudes of negative numbers or something.

Milimetric moved this task from Incoming to Radar on the Analytics board.May 7 2020, 3:49 PM
mpopov closed this task as Resolved.May 12 2020, 8:00 PM
mpopov claimed this task.

Thanks for your input, folks! I've split the outcomes into:

  • T252593: making the client-side instrumentation more robust in this regard
  • T252592: updating the schema (once it's ported to MEP) for more robust server-side validation

and will be closing this task since nothing is left to be done here.

Aklapper edited projects, added Analytics-Radar; removed Analytics.Jun 10 2020, 6:43 AM