The WDQS streaming updater captures the problems it encounter into 3 different streams.
These streams are then processed by a reconciliation mechanism running hourly and can emit "reconcile" events to attempt to fix the problem.
The issue is that if the problems are populated from different jobs (the production job and a test job) reconcile events might be processed by the production job while the original problems were actually encountered in the test job.
This problem was uncovered after a failure of mirrormaker between kafka-main@eqiad and kafka-jumbo:
- the test pipeline running in dse-k8s saw multiple thousands of late events while mirrormaker was going up and down
- these late events were pushed as errors to eqiad.rdf-streaming-updater.lapsed-action in kafka-jumbo
- the reconciliation mechanism processed these events and shipped new reconcile events to eventgate-main with the tag wdqs_source_tag_prefix@eqiad
- the production job running in wikikube@eqiad processed the events tagged as wdqs_source_tag_prefix@eqiad
- the consumers running close to blazegraph had thousands of reconcile to process (very slow) and caused the wdqs machines to lag behind
When reporting a problem the updater should tag more precisely these events such it so that reconcile events can properly be tagged appropriately and not rely on the datacenter to determine the provenance of the original error.
AC:
- the updater job has new --pipeline argument that accepts a string
- the side output events have a new field named 'pipeline' or 'emitter_id'
- the reconcile process can map this new field to a corresponding reconciliation_source
- the updater job has a way to stop emitting side outputs (for testing purposes)