Page MenuHomePhabricator

EventLogging validation errors for EditAttemptStep
Open, Needs TriagePublic

Description

From logstash,
None is not of type 'integer'

Maybe a dozen errors per minute. It isn't obvious to me which field is failing the validation.

Event Timeline

matmarex added subscribers: DLynch, matmarex.

The missing field is loaded_timing.

It's not just loaded_timing, there's also ready_timing.

I think getting a null there would have to mean loaded is firing before init, because of this code:

			case 'ready':
				return timeStamp - timing.init;
			case 'loaded':
				return timeStamp - timing.init;

Making this happen just means there needs to be a route to calling activateTarget which doesn't call trackActivateStart. Or maybe an overlap with a shutdown? If the init for a new session happens after the abort for the old one, it'd wipe out the stored timing from the init.

Thanks for looking into it! I'll just mention that, the schema can be changed to make the fields optional, which will then give you proper metrics showing how often this condition occurs.

Thanks for looking into it! I'll just mention that, the schema can be changed to make the fields optional, which will then give you proper metrics showing how often this condition occurs.

I'm pretty sure they're already optional, or at least they don't have required: true. It's just that EventLogging doesn't treat a null that makes its way into the event as being equivalent to the value not having been provided.

Thanks for looking into it! I'll just mention that, the schema can be changed to make the fields optional, which will then give you proper metrics showing how often this condition occurs.

I'm pretty sure they're already optional, or at least they don't have required: true. It's just that EventLogging doesn't treat a null that makes its way into the event as being equivalent to the value not having been provided.

I see what you mean, thanks. It doesn't look like EventLogging supports type: ["integer", "null"] so the fix will have to be a code tweak like you were suggesting...

Only mentioning this as inspiration, not to cause stress: we've been successfully correcting a few other schema errors, and as a result this one became the new, first-place source of errors with approximately 300k in the last week. It accounts for over half of all EventLogging processor errors: https://logstash.wikimedia.org/goto/1b8685471942f292d5c629dfda566ac4

Looking at logstash, this seems to have slowed down greatly without us having made any deliberate changes -- I see 2,567 reports of it in the last 7 days. I also noticed that the abort_timing field is triggering it as well.

I further noticed a different EditAttemptStep validation message with notably more reports, and filed a separate task for that one.

I observe that this clearly started on the train that finished on September 10th 2020: https://logstash.wikimedia.org/goto/593caf4c624c014de46c35db4a0e180b

i.e. when 1.36.0-wmf.8 rolled out to production.

On that same release was the fix to T261664 ("EditAttemptStep validation error '0' is not of type 'integer'"), which is at least suspicious, so I think that a likely issue is that patch having some faulty assumption that's causing the increase in errors.

Looking at it... that patch had the negative effect of changing edits on brand new pages to be sending revision_id: NaN rather than revision_id: undefined (which I'm assuming EventLogging filtered out better than it did the previously-mentioned nulls).

Change 644290 had a related patch set uploaded (by DLynch; owner: DLynch):
[mediawiki/extensions/VisualEditor@master] Give revision_id a fallback that'll validate

https://gerrit.wikimedia.org/r/644290

revision_id: undefined (which I'm assuming EventLogging filtered out better than it did the previously-mentioned nulls).

If revision_id is not required, and you don't want to send it, you should omit including it in the event data.

Schema definition says it should be 0 for create-a-new-page cases, which is what this one covers.

Change 644290 merged by jenkins-bot:
[mediawiki/extensions/VisualEditor@master] Give revision_id a fallback that'll validate

https://gerrit.wikimedia.org/r/644290

ppelberg added a subscriber: ppelberg.

@awight / @DLynch / @Ottomata, we're just coming across this ticket in triage and wonder: Does the patch David wrote [i] resolve the issue(s) Adam observed or whether more work is needed?


i. https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/644290/

Assuming I'm right about it, it'll resolve the "it's back!" issue, but the original issue the ticket was opened for is unaddressed.

Still seeing about 1k errors per day for this schema, now with these complaints:

  • '.event.abort_timing' should be integer
  • '.event.loaded_timing' should be integer
  • '.event.ready_timing' should be integer
  • '.event.first_change_timing' should be integer
  • '.event.save_intent_timing' should be integer

Looking at some example events, the problem is that we're sending an explicit "abort_timing": null, but the schema specifies that the fields are not nullable, only *omittable*.

I think that counts as the original issue, with the "it's back!" one resolved.

Quickly skimming the errors in logstash, I notice that the clear majority is action=abort + abort_type=nochange. This would seem to imply that in ve.init.mw.trackSubscriber.js's computeDuration, it attempting return timeStamp - timing.ready; is failing. I suspect this means that most of these errors are when someone navigates away from an edit page after the editor has finished activating but before it has become 'ready'... which should be a very narrow window. I'll need to do some more investigation to work out why it's apparently not.