When processing change-events the update pipeline must make sure that these changes are processed in order by the ingestion pipeline.
Reordering implies some buffering that might be done either using a flink window or a custom process function using timers, there are pros and cons for both:
- windowing might help to have fewer timers but will hold a bigger state
- a custom process function might have to trigger more timers but might allow to optimize the events (merge) on the fly
Reordering should mainly use the rev_id to sort events and the mediawiki timestamp when the rev_id does not change (delete vs undeletes).
To efficiently store the re-ordering state the events will have to be partitioned using the key: `[wiki_id, page_id]`.
The optimization step should merge events related to the same page with a set of rules that will have to be specified and document (possibly in this ticket).
Having an idea of how many events we could merge (deduplicate) might be interesting (trade-of between state size, latency and de-duplication efficiency).
This might be doable doing a quick analysis of the `cirrusSearchLinksUpdate` job backlog (in hdfs if they are ingested there or directly from kafka jumbo) over a time window of 1, 2, 5 and 10min and see how many events can be de-duplicated.
AC:
- the preparation job has buffering operator to properly re-order events
- the preparation job should merge events related to a same page into a single event using some rules that will have to be documented
- the buffering operator should have its buffer delay tunable
- the buffering operator should expose metrics about how many events are de-duplicated