Page MenuHomePhabricator

Review EditAttemptStep validation failures in production
Closed, ResolvedPublic


This logstash search (WMF NDA login required) finds a few validation failures (~180 a day) in production for EditAttempStep, all of type None is not of type 'integer' talking about how *_timing is set to null; this is coming from a range of UAs and a number of different event types (abort, ready and loaded seen in the latest five as I look). They're all in VE (not 2010 WTE or 2017 WTE

Setting this value to -1 would get this data into the results but trivially able to be filtered out, which might be better than losing it, though I can't quite see how it's not being set anyway.

Event Timeline

Restricted Application added a project: Product-Analytics. · View Herald TranscriptDec 4 2018, 10:38 PM
kzimmerman moved this task from Triage to Tracking on the Product-Analytics board.Dec 6 2018, 9:35 PM
kzimmerman added a subscriber: kzimmerman.

@Jdforrester-WMF It appears this is a task for @Neil_P._Quinn_WMF to keep an eye on, but not an analysis request. Is that correct? Moving to our Tracking column for now.

@Jdforrester-WMF It appears this is a task for @Neil_P._Quinn_WMF to keep an eye on, but not an analysis request. Is that correct? Moving to our Tracking column for now.

Yes, unless the level of validation failures is significant in how it skews the reported numbers, I imagine it's a quality issue, not a request. :-)

Tagging @DLynch for your awareness in case this is an issue with our code.

JTannerWMF moved this task from To Triage to Freezer on the VisualEditor board.Dec 12 2018, 2:32 PM

Kibana suggests upon following the provided link: "Unable to completely restore the URL, be sure to use the share functionality". I'm sufficiently unfamiliar with using it that the rest of this reply is going to be based just on what was said in the ticket description.

Making the assumption that NaN would get nulled out in submission, a plausible reason for this is there being an obscure path where the init event can not occur, which would result in all future timing values based on it being NaN.

Quick testing with trackdebug confirms that on mobile switching back and forth between visual and source modes causes a flurry of events which we probably don't want:

On desktop VE switching between VE and NWE just doesn't log anything at all for EditAttemptStep. (There's not an exact parallel, since there's a not a way to switch to a non-VE source mode without triggering a pageload, which is what's happening on mobile. But it's probably close enough.)

As such, I'll quickly write up a patch that stops the abort/ready events from making it through in this situation.

Sidenote: from a feature analysis standpoint, this does imply that we're not tracking editor-mode switches at all. Should we be doing that?

Change 479247 had a related patch set uploaded (by DLynch; owner: DLynch):
[mediawiki/extensions/MobileFrontend@master] schemaEditAttemptStep: avoid logging invalid null timings when switching editors

Change 479247 merged by jenkins-bot:
[mediawiki/extensions/MobileFrontend@master] schemaEditAttemptStep: avoid logging invalid null timings when switching editors

Jdforrester-WMF renamed this task from Review EditAttempStep validation failures in production to Review EditAttemptStep validation failures in production.Dec 12 2018, 5:43 PM
Jdforrester-WMF updated the task description. (Show Details)
matmarex removed a project: Patch-For-Review.
matmarex added a subscriber: matmarex.

QA here would presumably involve waiting for the patch to be deployed, then verifying that the validation failures stopped.

There's also some intellectual verification -- turn on trackdebug, and verify that the timing keys in the logged events after a switch don't have any null / NaN values.

I tried to verify this, but this link does not work either. It just opens the main page (default dashboard) for me, as far as I can tell.

Sorry, the search you need to do is eventlogging_EventError AND event.schema:EditAttemptStep, and by pushing the search window out to 30 days I've validated that the last time we recorded this error in production was 2018-12-06T20:54:16, so AFAICT this is now fixed.

marcella closed this task as Resolved.Jan 15 2019, 2:24 PM
marcella claimed this task.