Change Details

When processing change-events the update pipeline must make sure that these changes are processed in order by the ingestion pipeline. Reordering implies some buffering that might be done either using a flink window or a custom process function using timers, there are pros and cons for both: - windowing might help to have fewer timers but will hold a bigger state - a custom process function might have to trigger more timers but might allow to optimize the events (merge) on the fly Reordering should mainly use the rev_id to sort events and the mediawiki timestamp when the rev_id does not change (delete vs undeletes). To efficiently store the re-ordering state the events will have to be partitioned using the key: `[wiki_id, page_id]`. The optimization step should merge events related to the same page with a set of rules that will have to be specified and document (possibly in this ticket). Having an idea of how many events we could merge (deduplicate) might be interesting (trade-of between state size, latency and de-duplication efficiency). This might be doable doing a quick analysis of the `cirrusSearchLinksUpdate` job backlog (in hdfs if they are ingested there or directly from kafka jumbo) over a time window of 1, 2, 5 and 10min and see how many events can be de-duplicated. AC: - the preparation job has buffering operator to properly re-order events - the preparation job should merge events related to a same page into a single event using some rules that will have to be documented - the buffering operator should have its buffer delay tunable - the buffering operator should expose metrics about how many events are de-duplicated