As a platform engineer, I want to create a stream assembling all known streams produced by MW EventBus to reflect the changes to the state of a page in a single and easily consumable stream.
As of today EventBus emits events related to the actions made to a page into different streams. The nature of our multi-DC kafka setup also makes use of different streams (kafka topics) that may route the same logical stream into different kafka cluster and thus different topics (e.g. MW DC switch, switching the event-gate dns discovery record).
Reading multiple streams to capture state changes of a MW page is far from trivial:
- with different streams the events are likely to be processed in a unpredictable order forcing consumers to use a relatively big state or to attempt to re-order events to avoid applying incoherent changes (i.e. re-import a revision on a page whose delete event was processed just before).
- streams that can become idle are challenging because you never know if you have to wait for possible incoming data or continue processing other streams (if attempting to do some event-time re-ordering)
If we were to design this from scratch we would probably implement a single stream holding all possible state changes related to a page. Doing this today would require EventBus to emit such events to a single stream but also relying on the upcoming stretch kafka stretch cluster (T307944).
Changing EventBus might be relatively easy but putting in place the kafka stretch cluster is something that might take time. We might have an opportunity here to use a minimal stream processor to mimic what it would look like to have such stream in place and evaluate what benefits it could offer:
- a single stream to consume: predictable and repeatable ingestion
- ordered: no need to hold any state on the consumer to determine whether or not an event is safe to apply or has been superseded by a previously seen event
The service must
- Listen to:
- mediawiki.revision-create (should include page-create)
- mediawiki.page-suppress (similar to page-delete but for "suppressed deletes")
- Re-order: assuming a partitioning scheme based on the wiki+page_id, re-order events based on:
- event time (for delete v.s. undelete which unfortunately do not have any other ways to sort by)
- Format all these streams into a single consolidated stream
- Output the resulting events to kafka
Service deployed running on POC instance of Flink in YARN, producing to the kafka test cluster in eqiad. (no SLO’s)
- Data modeling exercise for new consolidated stream
Why are we doing this?
- Simplify event stream consumption. Consumers can listen to a single & well ordered stream that represents the state of a page (minus the content) rather than a page action (current design).