Page MenuHomePhabricator

Add tracking of errors
Closed, ResolvedPublic

Description

  • ErrorUnknown triggers a store action which delegates to the tracker service
  • use the "codes" of the errors to bucket them for tracking. Make sure there is a fallback bucket.
  • ensure the errors we know can happen have descriptive "codes"
  • add a dashboard
  • also informs the error code/type when showing them in the component

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 1 2020, 3:08 PM

I looked a bit more into the possible error types we have which don’t have more specific handlers, and I think they are:

  • INVALID_ENTITY_STATE_ERROR (validateEntityState action: entity has no statements for property)
  • UNSUPPORTED_DATAVALUE_TYPE (validateBridgeApplicability action: data type is string but data value type isn’t)
  • SAVING_FAILED (saveBridge action: entitySave failed)
  • APPLICATION_LOGIC_ERROR (initBridge action: any part of init failed; setTargetValue or saveBridge actions: called in wrong state; saveBridge action: statement mutation failed)

I think the .type of each error is the “error code” we can use to bucket the error. (It looks like we’ve been a bit inconsistent with the spelling of these errors so far – I see both lower_snake_case and SCREAMING_SNAKE_CASE – so I would map them all to a format more suitable for Grafana, lower-kebab-case, with some simple String.replace()s.) If any error doesn’t have a .type, we can fall back to "undefined" or something like that, though if I’m not mistaken it Shouldn’t Happen™.

That said, I don’t think it’s very sensible to call the errors during init – which can include loss of network connectivity, for example – “application logic errors”, so I think we should change that first, and introduce a new error type for the big Promise.all() in initBridge. (I feel like this topic rings a bell somewhere in my head, but I can’t remember if it was discussed anywhere. I don’t see anything relevant in the Git history, at least.)

Michael added a subscriber: Michael.Apr 2 2020, 1:52 PM

I looked a bit more into the possible error types we have which don’t have more specific handlers, and I think they are:

  • INVALID_ENTITY_STATE_ERROR (validateEntityState action: entity has no statements for property)
  • UNSUPPORTED_DATAVALUE_TYPE (validateBridgeApplicability action: data type is string but data value type isn’t)
  • SAVING_FAILED (saveBridge action: entitySave failed)
  • APPLICATION_LOGIC_ERROR (initBridge action: any part of init failed; setTargetValue or saveBridge actions: called in wrong state; saveBridge action: statement mutation failed)

I think the .type of each error is the “error code” we can use to bucket the error.

No, that is not the intention (at least not the one I had during story writing). This would be way too coarse. I think we should use the message of the actual Error object we are throwing, e.g. 'Entity flagged missing in response.'. _That_ is what I want to see in Grafana. The big error categories can maybe be part of the string that is then logged, like INVALID_ENTITY_STATE_ERROR.Entity flagged missing in response.
But the message that we log should be unique to the place where we throw the exception.

But we aren’t throwing any actual Error objects. At the point where this task is supposed to happen – in ErrorUnknown – what we have is a list of ApplicationError objects, which may contain nothing else other than the type.

I also understand this task to be about tracking, not logging. StatsD/Graphite/Grafana is not a logging pipeline into which we can throw arbitrary strings. The set of metrics we track should be limited (though I’m not sure what the technical restrictions are).

But we aren’t throwing any actual Error objects. At the point where this task is supposed to happen – in ErrorUnknown – what we have is a list of ApplicationError objects, which may contain nothing else other than the type.

Yes, this is what I meant in the task breakdown when I was talking about the challenge of making every Error unique. We also throw several Result not well-formed errors that are really not that helpful all on their own.

I also understand this task to be about tracking, not logging. StatsD/Graphite/Grafana is not a logging pipeline into which we can throw arbitrary strings. The set of metrics we track should be limited (though I’m not sure what the technical restrictions are).

This is a bit of both or rather a trade-off between logging (logstash/Kibana) and tracking (statsd/Grafana) when we have no tool that sits in the middle. We need actionable information of which specific error occurs how often in order to be able to decide which more specific error handling we should implement next.

Well, it sounds to me like you’re asking for a complete overhaul of our error handling, and I didn’t really sign up for that…

I’ll upload what I have for now.

Change 585513 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[mediawiki/extensions/Wikibase@master] bridge: introduce “initialization error” type

https://gerrit.wikimedia.org/r/585513

Change 585514 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[mediawiki/extensions/Wikibase@master] bridge: track unknown errors

https://gerrit.wikimedia.org/r/585514

FWIW I'll try to reason about what was written here until tomorrow. Be friends!

Well, it sounds to me like you’re asking for a complete overhaul of our error handling, and I didn’t really sign up for that…

I’ll upload what I have for now.

Permissions and bailouts aside, error handling is non-existent. Thus, not overhauling, but implementing a bit more for the first time. What we currently have are some exceptions being sprinkled over the code to make the compiler happy. At no point have these exceptions or errors been given much thought, and our current way of displaying them is utterly broken even for what little it was meant to do.

This story is the beginning of actually tying those pieces together in a way that they become useful and actionable. The very minimum we should be trying to achieve is to actually get an idea of what exactly is going wrong how often.

I agree that improving the exceptions themselves is indeed out of scope of this task (and story). But this task should ensure that the information we track is as meaningful and actionable as possible given what we have available. And for that, we need more than just "Initialization didn't work".

Well, I think the changes I uploaded do provide a way forward in that direction: where we need to track more specific information, split out an additional error .type. (In fact, the first change already does this, splitting INITIALIZATION_ERROR out of ErrorTypes.APPLICATION_LOGIC_ERROR.) Some of the existing error types are already highly specific: both INVALID_ENTITY_STATE_ERROR and UNSUPPORTED_DATAVALUE_TYPE are thrown in exactly one location each, so if we see them pop up, we know what happened (though not necessarily why it happened). SAVING_FAILED, on the other hand, is currently rather generic, and I expect we’ll refine it as part of T248087: Step 1: Improved generic error screen: on save (impact: high).

I also wonder if we shouldn’t maybe track all the error types in Graphite, not just the ones that end up in ErrorUnknown but also the ones that have other handlers, including the bailouts. Then this could happen as part of the addError action or the addApplicationErrors mutation.

I also wonder if we shouldn’t maybe track all the error types in Graphite, not just the ones that end up in ErrorUnknown but also the ones that have other handlers, including the bailouts. Then this could happen as part of the addError action or the addApplicationErrors mutation.

I agree with tracking the other bailout conditions besides datatype, but that should happen in a different chart on Grafana. In my mind, the tracking to be added in this tasks exists so we can decide which error condition currently collected in the generic error screen should get its own specific error screen. Adding the bailout "errors" to this mix doesn't help with that decision.

Well, I think the changes I uploaded do provide a way forward in that direction: where we need to track more specific information, split out an additional error .type. (In fact, the first change already does this, splitting INITIALIZATION_ERROR out of ErrorTypes.APPLICATION_LOGIC_ERROR.)

I'm fine with adding the INITIALIZATION_ERROR. It is tracking only "error.unknown.INITIALIZATION_ERROR" that I disagree with because that covers at least 10 different exceptions being thrown and we don't even handle each of the requests timing out explicitly.

My yes-model would be to track "error.unknown.${error.type}.${exception-message}". Where error.type could be INITIALIZATION_ERROR and exception-message could be 'mw.Api rejected but result does not contain error(s)'.

Well, I don’t think it’s acceptable to add exception messages to Grafana – that’s abusing the system, in my eyes. Grafana is for seeing when and how often certain events occur, not for getting detailed information about those events. I think EventLogging would be better suited for that kind of analysis (though I’m not very familiar with it).

Pablo-WMDE updated the task description. (Show Details)Apr 7 2020, 7:50 AM

I feel the conflict originates from discussing ways to achieve two (+) different things.

Looking at what is written and what we can anticipate (subjectively then) to be its intention, this story primarily cares about communicating to the user and enabling them to report problems.

Part of this communication is the (sticking to the wording from the story) "Error message" - as of now in master this is the type of the first of the applicationErrors.

These types are arbitrarily created by us and they aggregate root causes - this inevitably leads to some loss of insights but offers the possibility to enumerate them (there are only so many).

During the reporting of the bug we'll also put into the report the "Debug information" (full list of the applicationErrors and their "previous exceptions") - allowing us to gain EventLogging-like insights incl. the information which was "lost" when we aggregated the problems.

From a dev perspective it is absolutely desirable to get hold of EventLogging-like insights about errors without relying on the user filing a ticket - but that can't be the job of this task in my eyes.

The patch Lucas created is roughly in line with what I expected to come out of this when we talked about it in task break-down.

The way we "decide which error condition currently collected in the generic error screen should get its own specific error screen", can only be informed by the concrete problems found in the "Debug information" of reported problems the way this stands, true.

When/if we address that (or maybe continuously until then) I think it makes sense to further the distinction of logic and runtime errors as top level error types and consciously letting our specialized errors inherit from one or the other - this will help (on top of more logic) decide which errors are worth sending over the wire to some centralized logged infrastructure, which do not (a bit different compared to a "backend" project IMO).

Thanks @Pablo-WMDE for chiming in. It seems I have misunderstood the purpose of why we are adding this tracking to the generic error handler. Could you elaborate on the value that we are expecting from adding this tracking as described in this task?

Pablo-WMDE updated the task description. (Show Details)Apr 7 2020, 3:03 PM

@Michael I'm afraid there was not much left to add, all the things the pre-existing debate brought up were right one way or another. Chiming in on this meant not adding more but leaving some of it out to find a pragmatic way forward.

value that we are expecting from adding this tracking as described in this task

Not a big one, but a step in the right direction. Is it hard to imagine more fine grained insights? No. Is it better than nothing? I think so. Follow-ups are always an option (or, for all team members equally, damage prevention by insisting to discuss things further in task break-down).

Change 585513 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] bridge: introduce “initialization error” type

https://gerrit.wikimedia.org/r/585513

Change 585514 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] bridge: track unknown errors

https://gerrit.wikimedia.org/r/585514

Lucas_Werkmeister_WMDE closed this task as Resolved.Apr 9 2020, 3:35 PM

Tentatively closing this because I think we said further improvements to tracking would be part of another story.