The aim is to determine a way that allows internal teams across the foundation a way to access MediaWiki state changes with the associated page content (around 3 hours after the event) without having to wait for the monthly dumps or making calls to the MW API to enrich data. Ideally we would have the data available from 1 table.
Proposed Process (to be discussed)
- Agree on schema for page state changes - ongoing here - this will consolidate multiple existing streams into one and enrich with page content
- Build the logic to populate this stream from MW (~page-change-stream)
- Build Flink job to consume stream and enrich, publish to new stream (~page-change-stream-content)
- By default the stream events will be stored in the Analytics Hive tables
- The page content will be too large to efficiently store and query as parquet, so needs a special case to be stored in avro
- Retain the events long term (event_sanitized)
- Build a one time backfill process that takes the existing dumps, formats them to the new schema and populates the Hive tables - to be discussed how far we go back, etc?
For now page change streams won't be ordered. Planned for later would be to build a Flink job to order the streams and publish to an ordered stream (the enrichment job could be switched to read from this stream so we would have an ordered and enriched stream).