In case there was a fatal failure of a job processing in change propagation, we need to post an event to a special fatal_error (name TBD) topic.
Fatal errors include, but not limited to:
- Error in parsing a JSON message - in this case we can't event attempt to execute a rule, and it's 100% clear that there's an error on the producer side.
- Error in in rule templates evaluation - can't even attempt to execute the rule and it's clear that there's a bug somewhere in the system.
- Retry limit exceeded for the rule. These messages must be manually reviewed as some manual action might need to happen. Either some changes in the code, manual rerender of html, manual purge of some URIs etc.
To resolve this task we need the following pieces:
- Set up a generic fatal_error event schema. It might be useful not only within the change propagation service, but in other services too. EventLogging already has a schema for a generic error: EventError, however we might consider expanding it to include optionally: error http response if the consumer was issuing some HTTP request for a failure, stacktrace might also be useful, number of retries attempted. However in general I think our new schema should extend EventError schema and be backwards compatible with it. Alternatively we could simply allow additional parameters in the EventError schema to let services arbitrary extend it. @mobrovac @Ottomata what do you think?
- Emit events to the fatal_error topic from change-propagation service
- Look if we could reuse/adapt analytics solution that takes fatal_error events and posts them to logstash, or set up a new consumer for that.