Set up fatal error queue for change propagation
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Pchelolo
	Apr 22 2016, 9:58 PM

Description

In case there was a fatal failure of a job processing in change propagation, we need to post an event to a special fatal_error (name TBD) topic.

Fatal errors include, but not limited to:

Error in parsing a JSON message - in this case we can't event attempt to execute a rule, and it's 100% clear that there's an error on the producer side.
Error in in rule templates evaluation - can't even attempt to execute the rule and it's clear that there's a bug somewhere in the system.
Retry limit exceeded for the rule. These messages must be manually reviewed as some manual action might need to happen. Either some changes in the code, manual rerender of html, manual purge of some URIs etc.

To resolve this task we need the following pieces:

Set up a generic fatal_error event schema. It might be useful not only within the change propagation service, but in other services too. EventLogging already has a schema for a generic error: EventError, however we might consider expanding it to include optionally: error http response if the consumer was issuing some HTTP request for a failure, stacktrace might also be useful, number of retries attempted. However in general I think our new schema should extend EventError schema and be backwards compatible with it. Alternatively we could simply allow additional parameters in the EventError schema to let services arbitrary extend it. @mobrovac @Ottomata what do you think?
Emit events to the fatal_error topic from change-propagation service
Look if we could reuse/adapt analytics solution that takes fatal_error events and posts them to logstash, or set up a new consumer for that.

Details

	Subject	Repo	Branch	Lines +/-
	Create a general error event schema	mediawiki/event-schemas	master	+72 -1

Customize query in gerrit

Event Timeline

• Pchelolo created this task.Apr 22 2016, 9:58 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 22 2016, 9:58 PM

Change 284986 had a related patch set uploaded (by Ppchelko):
Create a general error event schema

https://gerrit.wikimedia.org/r/284986

gerritbot added a project: Patch-For-Review.Apr 22 2016, 11:27 PM

Another use case for the "dead letter" queue is manually triggering a retry of events. This can be useful to recover from a bug that caused certain events to temporarily fail.

• mobrovac moved this task from Backlog to In Progress Before Value Streams Kickoff (August 15th) on the Event-Platform board.Apr 29 2016, 11:04 AM

Change 284986 merged by Ottomata:
Create a general error event schema

https://gerrit.wikimedia.org/r/284986

A schema was defined, topic created, change-prop now emits events to this topic. Resolving.

Set up fatal error queue for change propagationClosed, ResolvedPublicActions

Description

Details

Event Timeline

Set up fatal error queue for change propagation
Closed, ResolvedPublic
Actions