Page MenuHomePhabricator

Set up fatal error queue for change propagation
Closed, ResolvedPublic

Description

In case there was a fatal failure of a job processing in change propagation, we need to post an event to a special fatal_error (name TBD) topic.

Fatal errors include, but not limited to:

  1. Error in parsing a JSON message - in this case we can't event attempt to execute a rule, and it's 100% clear that there's an error on the producer side.
  2. Error in in rule templates evaluation - can't even attempt to execute the rule and it's clear that there's a bug somewhere in the system.
  3. Retry limit exceeded for the rule. These messages must be manually reviewed as some manual action might need to happen. Either some changes in the code, manual rerender of html, manual purge of some URIs etc.

To resolve this task we need the following pieces:

  • Set up a generic fatal_error event schema. It might be useful not only within the change propagation service, but in other services too. EventLogging already has a schema for a generic error: EventError, however we might consider expanding it to include optionally: error http response if the consumer was issuing some HTTP request for a failure, stacktrace might also be useful, number of retries attempted. However in general I think our new schema should extend EventError schema and be backwards compatible with it. Alternatively we could simply allow additional parameters in the EventError schema to let services arbitrary extend it. @mobrovac @Ottomata what do you think?
  • Emit events to the fatal_error topic from change-propagation service
  • Look if we could reuse/adapt analytics solution that takes fatal_error events and posts them to logstash, or set up a new consumer for that.

Event Timeline

Change 284986 had a related patch set uploaded (by Ppchelko):
Create a general error event schema

https://gerrit.wikimedia.org/r/284986

Another use case for the "dead letter" queue is manually triggering a retry of events. This can be useful to recover from a bug that caused certain events to temporarily fail.

Change 284986 merged by Ottomata:
Create a general error event schema

https://gerrit.wikimedia.org/r/284986

A schema was defined, topic created, change-prop now emits events to this topic. Resolving.