Page MenuHomePhabricator

EventLogging validation errors for EditAttemptStep
Open, Needs TriagePublic

Description

From logstash,
None is not of type 'integer'

Maybe a dozen errors per minute. It isn't obvious to me which field is failing the validation.

Event Timeline

awight created this task.Oct 31 2019, 10:30 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 31 2019, 10:30 PM
matmarex added subscribers: DLynch, matmarex.

The missing field is loaded_timing.

DLynch added a comment.Nov 4 2019, 5:47 PM

It's not just loaded_timing, there's also ready_timing.

I think getting a null there would have to mean loaded is firing before init, because of this code:

			case 'ready':
				return timeStamp - timing.init;
			case 'loaded':
				return timeStamp - timing.init;

Making this happen just means there needs to be a route to calling activateTarget which doesn't call trackActivateStart. Or maybe an overlap with a shutdown? If the init for a new session happens after the abort for the old one, it'd wipe out the stored timing from the init.

awight added a comment.Nov 5 2019, 9:58 PM

Thanks for looking into it! I'll just mention that, the schema can be changed to make the fields optional, which will then give you proper metrics showing how often this condition occurs.

Thanks for looking into it! I'll just mention that, the schema can be changed to make the fields optional, which will then give you proper metrics showing how often this condition occurs.

I'm pretty sure they're already optional, or at least they don't have required: true. It's just that EventLogging doesn't treat a null that makes its way into the event as being equivalent to the value not having been provided.

awight added a comment.Nov 6 2019, 8:52 AM

Thanks for looking into it! I'll just mention that, the schema can be changed to make the fields optional, which will then give you proper metrics showing how often this condition occurs.

I'm pretty sure they're already optional, or at least they don't have required: true. It's just that EventLogging doesn't treat a null that makes its way into the event as being equivalent to the value not having been provided.

I see what you mean, thanks. It doesn't look like EventLogging supports type: ["integer", "null"] so the fix will have to be a code tweak like you were suggesting...

awight added a comment.EditedNov 13 2019, 10:14 AM

Only mentioning this as inspiration, not to cause stress: we've been successfully correcting a few other schema errors, and as a result this one became the new, first-place source of errors with approximately 300k in the last week. It accounts for over half of all EventLogging processor errors: https://logstash.wikimedia.org/goto/1b8685471942f292d5c629dfda566ac4

Looking at logstash, this seems to have slowed down greatly without us having made any deliberate changes -- I see 2,567 reports of it in the last 7 days. I also noticed that the abort_timing field is triggering it as well.

I further noticed a different EditAttemptStep validation message with notably more reports, and filed a separate task for that one.