Page MenuHomePhabricator

[Shared Event Platform] [SPIKE] Decide on page state change storing and backfill approach
Closed, DuplicatePublic

Description

NOTE: The purpose of this ticket is to capture discussions and decisions on providing a way to access MediaWiki state change events and page content in the existing Analytics Hive tables.

The aim is to determine a way that allows internal teams across the foundation a way to access MediaWiki state changes with the associated page content (around 3 hours after the event) without having to wait for the monthly dumps or making calls to the MW API to enrich data. Ideally we would have the data available from 1 table.

Proposed Process (to be discussed)

  • Agree on schema for page state changes - ongoing here - this will consolidate multiple existing streams into one and enrich with page content
  • Build the logic to populate this stream from MW (~page-change-stream)
  • Build Flink job to consume stream and enrich, publish to new stream (~page-change-stream-content)
  • By default the stream events will be stored in the Analytics Hive tables
  • The page content will be too large to efficiently store and query as parquet, so needs a special case to be stored in avro
  • Retain the events long term (event_sanitized)
  • Build a one time backfill process that takes the existing dumps, formats them to the new schema and populates the Hive tables - to be discussed how far we go back, etc?

For now page change streams won't be ordered. Planned for later would be to build a Flink job to order the streams and publish to an ordered stream (the enrichment job could be switched to read from this stream so we would have an ordered and enriched stream).

Event Timeline

The page content will be too large to efficiently store and query as parquet, so needs a special case to be stored in avro

I'd like to understand this better. How does avro help here more than a compressed parquet file would?

I'd like to understand this better. How does avro help here more than a compressed parquet file would?

@JAllemandou ? IIUC, it is unlikely that queries on page content can be used efficiently in a columnar way; i.e. there are rarely going to be aggregations or groupings on page content. More likely, rows will be processed one at a time and transformed?

@lbowmaker shared with me the following Slack thread with @JAllemandou's rationale: https://wikimedia.slack.com/archives/C02BB8L2S5R/p1654174524991399?thread_ts=1654106678.906859&cid=C02BB8L2S5R

tl;dr : page content is typically big and parquet wants to read a full row group to memory, which may cause OOMs. Makes sense!

One of the goals of https://phabricator.wikimedia.org/T309784 was provide some guarantee about event ordering.
We should think a bit about access patterns and decide whether we want to attempt ordering on the page state change (e.g. as part of this spec), or leave it to consumers. Ideally, I'd like an SLO that captures what client can expect.

I created a phab to capture this behaviour in the context for Mediawiki Stream Enrichment: https://phabricator.wikimedia.org/T311603