As discussed in T401260#11230961, we'd like to use the existent Druid mediawiki_history_reduced dataset to serve global editor editing related metrics. The metrics to serve are:
- Edit Metrics
- Total edit count within a date range.
- Total number of days edited within a date range.
- Longest daily consecutive edit streak within a date range.
- List of edited pages within a date range.
In order to do compute these daily, and for global user metrics, we'll have to make a few changes to mediawiki_history_reduced in Druid.
mediawiki_history_reduced (and it's upstream mediawiki_history) are event based monthly snapshot tables. This means:
- event based - each row represents a single event happening, e.g. an edit.
- monthly snapshot - All records are fully regenerated monthly from sqooped MediaWiki MariaDB tables.
To compute global editor metrics, we need to add user_central_id and page_id to mediawiki_history_reduced.
To compute these daily, we'll need to append new events to mediawiki_history_reduced at least daily.
While it would be very nice to have these changes applied to the upstream Hive tables, the minimum requirements are that these are satisfied for the Druid datasource used for serving. We may choose to e.g. add user_central_id to mediawiki_history etc (T365648: Add user_central_id to mediawiki_history and mediawiki_history_reduced Hive tables) and source this info in Druid from there, but doing so is not required.
Incremental updates to upstream Hive tables is surely out of scope, so we will just apply daily updates to the Druid dataset. To do this, we have a few options:
1. Incrementally update the latest snapshot
- Determine the 'latest snapshot dataset', e.g. mediawiki_history_reduced_2025_08 in Druid.
- Compute daily mediawiki history 'event' data from event.mediawiki_page_change_v1, starting from latest time in latest snapshot
- Load daily mediawiki history events into latest snapshot dataset.
Pros:
- No new Druid datasources to maintain
- Incremental lambda arch: each month the new snapshot will supersede the previous, giving us eventual consistency.
Cons:
The daily loading Airflow DAG needs to align its time with latest time in each new monthly snapshot. This may be some complicated sensor work.- We can avoid this con by having the daily loading DAG write to both current and future month. This is still a bit awkward but means we don't have to do any complicated alignments.
- Not clear how/if we can roll back to old snapshots if we create bugs.
- mediawiki_history_reduced contains 'digest' event types. These are aggregations in rows in the data itself. If we don't update the digest aggregations, the digests will no longer match aggregations of events as the new event rows are added. This is a little awkward, and Joseph recommends possibly wanting to split the digest rows out into their own dataset any way.
2. Incrementally update a new mediawiki_history_reduced_latest dataset
- When loading a new mediawiki_history_reduced snapshot into druid, also load it into a new (digest-less) mediawiki_history_reduced_latest dataset.
- Compute daily mediawiki history 'event' data from event.mediawiki_page_change_v1, starting from latest time in latest snapshot
- Load daily mediawiki history events into mediawiki_history_reduced_latest dataset.
Pros:
- Incremental lambda arch: each month the new snapshot will supersede the previous, giving us eventual consistency.
- No digest awkwardness in mediawiki_history_reduced_latest
- Airflow sensor less complicated (we can just ongoing load daily into mediawiki_history_reduced_latest
- No dataset switching needed.
Cons:
- New Druid datasource to maintain
- Dataset storage and segment cache duplication - Existent AQS and Global Editor Metrics will not utilize the same segment cache.
- Difficult to rollback mediawiki_history_reduced_latest.
3. Single lambda style mediawiki_history_reduced
This is the same as Option 2, but without the cons of extra dataset storage and segment cache duplication. Existing AQS usages would be migrated to the same dataset.
We'd have to migrate the digest (pre-aggregations) rows to a new separate monthly snapshot dataset.
We could implement Option 2 and then migrate to Option 3 later.
Pros:
- Single incremental eventually consistent mediawiki_history_reduced to maintain.
- Airflow sensor less complicated (we can just ongoing load daily into mediawiki_history_reduced_latest
- No dataset switching needed.
Cons:
- Rollbacks will be manual backfills / dataset replacements
- We need to create a new monthly digest/aggregation dataset in druid.
We could also consider computing and streaming mediawiki_history_reduced events into Druid realtime, but that would require running new streaming enrichment jobs and Druid ingestion jobs, which is probably out of scope for this task.
Done is
- page_id and user_central_id fields are added to Druid mediawiki_history_reduced.
- Druid mediawiki_history_reduced is updated daily