Page MenuHomePhabricator

EventLogging validation errors for EditAttemptStep
Open, Needs TriagePublic

Description

Errors from logstash:

  • None is not of type 'integer'
  • '.event.abort_timing' should be integer

Logstash search: https://logstash.wikimedia.org/goto/317e3a351a88fb9e361996847880321c

Maybe a dozen errors per minute. It isn't obvious to me which field is failing the validation.

Event Timeline

matmarex added subscribers: DLynch, matmarex.

The missing field is loaded_timing.

It's not just loaded_timing, there's also ready_timing.

I think getting a null there would have to mean loaded is firing before init, because of this code:

			case 'ready':
				return timeStamp - timing.init;
			case 'loaded':
				return timeStamp - timing.init;

Making this happen just means there needs to be a route to calling activateTarget which doesn't call trackActivateStart. Or maybe an overlap with a shutdown? If the init for a new session happens after the abort for the old one, it'd wipe out the stored timing from the init.

Thanks for looking into it! I'll just mention that, the schema can be changed to make the fields optional, which will then give you proper metrics showing how often this condition occurs.

Thanks for looking into it! I'll just mention that, the schema can be changed to make the fields optional, which will then give you proper metrics showing how often this condition occurs.

I'm pretty sure they're already optional, or at least they don't have required: true. It's just that EventLogging doesn't treat a null that makes its way into the event as being equivalent to the value not having been provided.

Thanks for looking into it! I'll just mention that, the schema can be changed to make the fields optional, which will then give you proper metrics showing how often this condition occurs.

I'm pretty sure they're already optional, or at least they don't have required: true. It's just that EventLogging doesn't treat a null that makes its way into the event as being equivalent to the value not having been provided.

I see what you mean, thanks. It doesn't look like EventLogging supports type: ["integer", "null"] so the fix will have to be a code tweak like you were suggesting...

Only mentioning this as inspiration, not to cause stress: we've been successfully correcting a few other schema errors, and as a result this one became the new, first-place source of errors with approximately 300k in the last week. It accounts for over half of all EventLogging processor errors: https://logstash.wikimedia.org/goto/1b8685471942f292d5c629dfda566ac4

Looking at logstash, this seems to have slowed down greatly without us having made any deliberate changes -- I see 2,567 reports of it in the last 7 days. I also noticed that the abort_timing field is triggering it as well.

I further noticed a different EditAttemptStep validation message with notably more reports, and filed a separate task for that one.

I observe that this clearly started on the train that finished on September 10th 2020: https://logstash.wikimedia.org/goto/593caf4c624c014de46c35db4a0e180b

i.e. when 1.36.0-wmf.8 rolled out to production.

On that same release was the fix to T261664 ("EditAttemptStep validation error '0' is not of type 'integer'"), which is at least suspicious, so I think that a likely issue is that patch having some faulty assumption that's causing the increase in errors.

Looking at it... that patch had the negative effect of changing edits on brand new pages to be sending revision_id: NaN rather than revision_id: undefined (which I'm assuming EventLogging filtered out better than it did the previously-mentioned nulls).

Change 644290 had a related patch set uploaded (by DLynch; owner: DLynch):
[mediawiki/extensions/VisualEditor@master] Give revision_id a fallback that'll validate

https://gerrit.wikimedia.org/r/644290

revision_id: undefined (which I'm assuming EventLogging filtered out better than it did the previously-mentioned nulls).

If revision_id is not required, and you don't want to send it, you should omit including it in the event data.

Schema definition says it should be 0 for create-a-new-page cases, which is what this one covers.

Change 644290 merged by jenkins-bot:
[mediawiki/extensions/VisualEditor@master] Give revision_id a fallback that'll validate

https://gerrit.wikimedia.org/r/644290

ppelberg added a subscriber: ppelberg.

@awight / @DLynch / @Ottomata, we're just coming across this ticket in triage and wonder: Does the patch David wrote [i] resolve the issue(s) Adam observed or whether more work is needed?


i. https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/644290/

Assuming I'm right about it, it'll resolve the "it's back!" issue, but the original issue the ticket was opened for is unaddressed.

Still seeing about 1k errors per day for this schema, now with these complaints:

  • '.event.abort_timing' should be integer
  • '.event.loaded_timing' should be integer
  • '.event.ready_timing' should be integer
  • '.event.first_change_timing' should be integer
  • '.event.save_intent_timing' should be integer

Looking at some example events, the problem is that we're sending an explicit "abort_timing": null, but the schema specifies that the fields are not nullable, only *omittable*.

I think that counts as the original issue, with the "it's back!" one resolved.

Quickly skimming the errors in logstash, I notice that the clear majority is action=abort + abort_type=nochange. This would seem to imply that in ve.init.mw.trackSubscriber.js's computeDuration, it attempting return timeStamp - timing.ready; is failing. I suspect this means that most of these errors are when someone navigates away from an edit page after the editor has finished activating but before it has become 'ready'... which should be a very narrow window. I'll need to do some more investigation to work out why it's apparently not.

Updated logstash link in the description.

I can reproduce this locally by clicking "Edit" then pressing Escape quickly. It takes a few tries to get the timing right, but it's reproducible.

I think this can occur because:

  • Tearing down and setting up the editor are both actually multi-step processes, with a few promises in the middle
  • Other code may run after a promise is resolved but before the promise success callback runs
  • We log the 'abort' event at the beginning of the teardown process; in contrast, we log 'ready' and 'loaded' at the end of the setup process.

As a consequence, if the user action to close the editor and the network response with article data arrive at just the right time, it's possible that we will:

  1. Send the 'init' event
  2. Set up the editor
  3. Send the 'abort' event (which expects the 'ready' event to have been sent, since the editor is set up)
  4. Tear down the editor
  5. And only then send the 'ready' and 'loaded' events (which expect the 'abort' event to not have been sent)

Most of our code can handle that, but apparently not the event logging. The events 'abort', 'ready' and 'loaded' will all have messed up timing fields.

Change 708347 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/extensions/VisualEditor@master] Move abort event tracking from the start to the end of the teardown process

https://gerrit.wikimedia.org/r/708347

I haven't really proved that the above is exactly what happens, but I think at this point it's a better use of time to just try it and see if that fixes the problem. I couldn't reproduce it locally, but there might be other scenarios that are just harder to reproduce.

Also, moving where the tracking happens will cause the 'abort' event to be logged a bit later than previously. I doubt that we're using the abort timing for any analysis though, so it shouldn't matter.

I don't think we've ever done any analysis that'd depend on the timing of the abort event at that kind of fine-grained level. Things like "X% of sessions ended with an abort, vs Y% that ended with a saveSuccess" is where it normally comes up.

Change 708347 merged by jenkins-bot:

[mediawiki/extensions/VisualEditor@master] Move abort event tracking from the start to the end of the teardown process

https://gerrit.wikimedia.org/r/708347

We should check Logstash after this is deployed to confirm that the issue is fixed.

hi - just fyi, this error is still occurring ~1100 in the last 12 hours

Screen Shot 2021-08-06 at 10.26.56 AM.png (2×3 px, 791 KB)

That seems to be consistent with the pre-deploy rate of errors, so I don't think this patch actually touched whatever's causing this.

I looked into this again, and I can reproduce the error locally when using the mobile version, and my default mode is visual, and I close the loading screen before the editor loads.

Change 730071 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/extensions/VisualEditor@master] Remove duplicate load error handling code

https://gerrit.wikimedia.org/r/730071

Change 730071 merged by jenkins-bot:

[mediawiki/extensions/VisualEditor@master] Remove duplicate load error handling code

https://gerrit.wikimedia.org/r/730071

This has massively dropped since 10/21 (when the deploy would have reached all the wikis):

image.png (338×426 px, 17 KB)

...but it's still not completely gone. We've clearly hit the major cause, though.