Page MenuHomePhabricator

Alert RelEng when mw-client-error editing dashboard shows errors at a rate of over 1000 errors in a 12 hr period
Open, MediumPublic

Description

We missed this this event happening last week.

It needs to be automated as it seems like (a) it's automatable (somehow) and (b) this is the kind of thing that humans are very bad at.

Tagging SRE Observability in case they know of some magic that I don't for this.

Event Timeline

Perhaps https://grafana.wikimedia.org/d/000000566/overview?viewPanel=16&orgId=1&from=now-30d&to=now could be used, comparing the hour before and after a deploy? I don't know too much about alerting to know if this is possible though.

Perhaps https://grafana.wikimedia.org/d/000000566/overview?viewPanel=16&orgId=1&from=now-30d&to=now could be used, comparing the hour before and after a deploy? I don't know too much about alerting to know if this is possible though.

This is correct. It is currently not possible to alert on Kibana dashboards directly, but it is easy to alert on metrics which the linked dashboard tracks. I would venture to guess this task was submitted because the metric measures all client errors without differentiating between known and unknown errors.

There are a few problems:

  1. Error logs do not have a "I'm a known error" property.
  2. Log triage heavily depends on Kibana filters to hide known errors and surface unknown errors.
  3. Extracting data from Kibana is painful, if not impossible.

It seems we would greatly benefit from metrics enhanced with the information generated by the log triage process. That is, tracking unknown errors separately from known errors.

The log transformation pipeline could tag logs with the property of being known, but this approach makes for some questions that need answering:

  1. What is the source of truth for known errors?
    1. How would we know that changes to the source of truth were authorized?
    2. Can the source of truth be abused to DoS the log transformation pipeline?
  2. Can the log transformation pipeline react to changes in the source of truth?
  3. Should the log transformation pipeline be reactive?
    1. All pipeline changes are controlled by Puppet and subject to code review and deployment. This style of management is not conducive to reactive changes.
  4. Because known error classification would happen outside of code review and deployment:
    1. How bad would it be if an unknown error was accidentally or intentionally classified as a known error?
    2. How likely is it for an unknown error to be unintentionally classified as a known error? (Like with a .+ regex)
    3. How would we know if a previously unknown error was accidentally or intentionally classified as a known error?

Please suggest other options and opinions. They are very much welcome.

colewhite triaged this task as Medium priority.Oct 20 2021, 5:18 PM

...
Please suggest other options and opinions. They are very much welcome.

I think you've captured this problem perfectly. Thank you for this.

The log transformation pipeline could tag logs with the property of being known, but this approach makes for some questions that need answering:

Right now we use log_client_error_hits for an alert which tells us that that we've exceeded the maximum number of client errors to be considered healthy [1]. It doesn't cover the case where a deploy introduces new client errors that doesn't exceed the maximum client errors. This would however be very useful to have.

We currently mark known bugs on this dashboard: https://logstash.wikimedia.org/app/kibana#/dashboard/AXDBY8Qhh3Uj6x1zCF56 so presumably a query could be setup now, that reflects "existing bugs".

If bugs that cause alerts result in the train being rolled back and getting fixed presumably the query that's true now will be correct going forward, however, if that's not the case, then that query is likely to change weekly as new bugs get introduced, and we'd need some way to automatically expand the query.

[1] https://gerrit.wikimedia.org/g/operations/puppet/+/97d86d1ac49f84bfcd7ac6d0b4f975489e3ac6a2/modules/prometheus/files/es_exporter/20-client-error.cfg#1

Capturing a couple of points from this morning's discussion:

  • Drastically increased volume of errors for a known thing might be a signal; i.e., sometimes it's the same error (or close) but some code change has caused a substantial increase in frequency.
  • We think the stack trace is often a more effective unique key for an error than the message itself.
    • Something something with AST?
    • Naive question: Would it be possible to use something like Exception::getTrace() on the MediaWiki side of things to build an error key that takes function calls / arguments into account without being pinned to specific files & line numbers? I'm imagining a hash that only changes if the actual code path changes but is resilient against minor physical details of file layout.

(Noting that stack trace thoughts above are more geared towards PHP errors than the client-side ones that triggered this specific task originally.)

Had a meeting about this today. Key takeaways for me:

  1. The workflow improvements outweigh the potential overall risk (severity and likelihood) of an overly-broad pattern match hiding severe issues. Even if one was encountered, we have a path to handle it.
  2. Phabricator is not ideal as the source of truth for matching rules, but providing linking is ideal.
  3. OpenSearch is a better source of truth to reduce dependencies.
  4. We should explore the proposal of a Dashboards app to managing mapping rules for tasks to (glob|regex) strings.
  5. The ideal solution would be in MediaWiki somehow exposing a representation in logs of the path (AST?) taken to reach a particular log message.
  6. Stack traces are a potential answer to "How did we reach this log?". Creates questions about the reliability of parsing, extracting, and matching.