For WE3.3.7, we need to compute the following daily metrics:
For each global user editor account:
- Edit Metrics
- Total edit count within a date range.
- Total number of days edited within a date range.
- Longest daily consecutive edit streak within a date range.
List of edited pages within a date range.~(requirement removed)
- Pageview Metrics:
- Total number of pageviews within a date range to pages edited by this user.
- List of most viewed paged edited by this user.
This task is about computing these metrics ongoing in the Data Lake and then writing them daily to a serving layer datastore (Druid or Cassandra as of 2025).
A separate task will encompass backfilling necessary data.
Edit related metrics
These can be computed ongoing from event.mediawiki_page_change_v1 or mediawiki_content_history_v1, and pageviews_hourly tables.
While there are varying product needs, the smallest granularity requested is daily.
Per editor metrics are already computable from the monthly snapshot Druid mediawiki_history_reduced dataset. We need a few changes to this Druid dataset to support our use cases, these will be tracked in T406069: Global Editor Metrics - Druid mediawiki_history_reduced changes.
Pageview related metrics
For the 2 pageview related metrics, we'll need to create new serving layer datasets. The serving layer (cassandra) data models are being designed and reviewed at https://wikitech.wikimedia.org/wiki/User:Ottomata/Global_Editor_Metrics_2025_Design_Draft#Data_model_(if_structured).
For the Data Lake pipeline, we'll need to compute some 'intermediate' tables from which we can fill the serving layer tables.
Get all pages a user edited before a date
To compute these, we need a lookup table that given a user_central_id and a date, will give us a list of pages that that user has edited on or before that date.
Option 1
This table might look something like:
Table name: user_edited_pages (?)
| Field | Type | Description |
| user_central_id | bigint | MediaWiki user central ID. At Wikimedia, this is the CentralAuth globaluser gu_id field. |
| wiki_id | string | |
| page_id | bigint | |
| earliest_edit_dt | timestamp | Timestamp on which the user first edited this page. |
| latest_edit_dt | timestamp | Timestamp on which the user last edited this page. |
From this, we could compute the list of pages a user has edited WHERE earliest_edit_dt < {date_in_question}
We could backfill this table using either mediawiki_history or mediawiki_content_history_v1 (with T406515), and update it daily from either mediawiki_content_history_v1 (with T406515) or from event.mediawiki_page_change_v1.
Option 2
However, this user_edited_pages is a mutable dataset. We need to discuss if we can/want to maintain and update this as Iceberg table. Alternatively , we could design an append only daily table like
Alternative:
Table name: user_edited_pages_daily (?)
| Field | Type | Description |
| user_central_id | bigint | MediaWiki user central ID. At Wikimedia, this is the CentralAuth globaluser gu_id field. |
| wiki_id | string | |
| page_id | string | |
| year | string | |
| month | string | |
| day | string | |
Each day, this table would store for each user all pages that user edited that day. When computing the list of pages edited before a date, we could select distinct wiki_id,page_id before that date.
Decision: Option 2 T407559: Global Editor Metrics - Data Pipeline - edit_per_editor_per_page_daily
Pageview intermediate tables
Once we have a lookup for user_edited_pages, we should be able to compute daily metric tables from which we can write to the serving layer (cassandra).
We can either:
- store a daily count of pageviews per editor per page and from this compute the total pageviews per editor and top k pages viewed tables and write that to cassandra
- pre compute and store total pageview count and top k pages in Data lake, and write that to cassandra.
In order to compute the top k metric, we'll need to have a complete picture of pageview counts per user per page anyway. So, as long as storage for this in the Data Lake isn't a huge issue (is it?), we should probably do that. If we do, this table may look like
Table name: pageviews_per_editor_per_page
| Field | Type | Description | |
| user_central_id | bigint | MediaWiki user central ID. At Wikimedia, this is the CentralAuth globaluser gu_id field. | |
| wiki_id | string | ||
| page_id | bigint | ||
| views | bigint | ||
| year | string | ||
| month | string | ||
| day | string | ||
From this Hive table, we can then compute:
- Sum of daily pageviews per user
- e.g. SELECT SUM(views) group by user_central_id...
- Top k List of most viewed paged edited by this user per month
- e.g. order by views desc limit 10 grouped per user
That data could then be stored in Cassandra tables for serving. See https://wikitech.wikimedia.org/wiki/User:Ottomata/Global_Editor_Metrics_2025_Design_Draft#Data_model_(if_structured) for serving layer data model
Done is
- Data pipeline to daily update editing metrics in mediawiki_history_reduced in Druid.
- Data pipeline to daily update pageviews_per_editor in Cassandra
- Data pipeline to monthly update pageviews_top_pages_per_editor_monthly in Cassandra