Page MenuHomePhabricator

Global Editor Metrics - Data Pipeline
Closed, ResolvedPublic

Description

For WE3.3.7, we need to compute the following daily metrics:

For each global user editor account:

  • Edit Metrics
    • Total edit count within a date range.
    • Total number of days edited within a date range.
    • Longest daily consecutive edit streak within a date range.
    • List of edited pages within a date range.~ (requirement removed)
  • Pageview Metrics:
    • Total number of pageviews within a date range to pages edited by this user.
    • List of most viewed paged edited by this user.

This task is about computing these metrics ongoing in the Data Lake and then writing them daily to a serving layer datastore (Druid or Cassandra as of 2025).

A separate task will encompass backfilling necessary data.

Edit related metrics

These can be computed ongoing from event.mediawiki_page_change_v1 or mediawiki_content_history_v1, and pageviews_hourly tables.

While there are varying product needs, the smallest granularity requested is daily.

Per editor metrics are already computable from the monthly snapshot Druid mediawiki_history_reduced dataset. We need a few changes to this Druid dataset to support our use cases, these will be tracked in T406069: Global Editor Metrics - Druid mediawiki_history_reduced changes.


Pageview related metrics

For the 2 pageview related metrics, we'll need to create new serving layer datasets. The serving layer (cassandra) data models are being designed and reviewed at https://wikitech.wikimedia.org/wiki/User:Ottomata/Global_Editor_Metrics_2025_Design_Draft#Data_model_(if_structured).

For the Data Lake pipeline, we'll need to compute some 'intermediate' tables from which we can fill the serving layer tables.

Get all pages a user edited before a date

To compute these, we need a lookup table that given a user_central_id and a date, will give us a list of pages that that user has edited on or before that date.

Option 1

This table might look something like:

Table name: user_edited_pages (?)

FieldTypeDescription
user_central_idbigintMediaWiki user central ID. At Wikimedia, this is the CentralAuth globaluser gu_id field.
wiki_idstring
page_idbigint
earliest_edit_dttimestampTimestamp on which the user first edited this page.
latest_edit_dttimestampTimestamp on which the user last edited this page.

From this, we could compute the list of pages a user has edited WHERE earliest_edit_dt < {date_in_question}

We could backfill this table using either mediawiki_history or mediawiki_content_history_v1 (with T406515), and update it daily from either mediawiki_content_history_v1 (with T406515) or from event.mediawiki_page_change_v1.

Option 2

However, this user_edited_pages is a mutable dataset. We need to discuss if we can/want to maintain and update this as Iceberg table. Alternatively , we could design an append only daily table like

Alternative:
Table name: user_edited_pages_daily (?)

FieldTypeDescription
user_central_idbigintMediaWiki user central ID. At Wikimedia, this is the CentralAuth globaluser gu_id field.
wiki_idstring
page_idstring
yearstring
monthstring
daystring

Each day, this table would store for each user all pages that user edited that day. When computing the list of pages edited before a date, we could select distinct wiki_id,page_id before that date.

Decision: Option 2 T407559: Global Editor Metrics - Data Pipeline - edit_per_editor_per_page_daily

Pageview intermediate tables

Once we have a lookup for user_edited_pages, we should be able to compute daily metric tables from which we can write to the serving layer (cassandra).

We can either:

  • store a daily count of pageviews per editor per page and from this compute the total pageviews per editor and top k pages viewed tables and write that to cassandra
  • pre compute and store total pageview count and top k pages in Data lake, and write that to cassandra.

In order to compute the top k metric, we'll need to have a complete picture of pageview counts per user per page anyway. So, as long as storage for this in the Data Lake isn't a huge issue (is it?), we should probably do that. If we do, this table may look like

Table name: pageviews_per_editor_per_page

FieldTypeDescription
user_central_idbigintMediaWiki user central ID. At Wikimedia, this is the CentralAuth globaluser gu_id field.
wiki_idstring
page_idbigint
viewsbigint
yearstring
monthstring
daystring

From this Hive table, we can then compute:

  • Sum of daily pageviews per user
    • e.g. SELECT SUM(views) group by user_central_id...
  • Top k List of most viewed paged edited by this user per month
    • e.g. order by views desc limit 10 grouped per user

That data could then be stored in Cassandra tables for serving. See https://wikitech.wikimedia.org/wiki/User:Ottomata/Global_Editor_Metrics_2025_Design_Draft#Data_model_(if_structured) for serving layer data model


Done is

  • Data pipeline to daily update editing metrics in mediawiki_history_reduced in Druid.
  • Data pipeline to daily update pageviews_per_editor in Cassandra
  • Data pipeline to monthly update pageviews_top_pages_per_editor_monthly in Cassandra

Related Objects

Event Timeline

We'd like to generate a few days of editor_metrics_per_user_per_page_daily data ASAP in order to better understand how big this data might get over time. The size of this data is quite hard to estimate, as there are quite a few variables that can influence the number of records per day (# of distinct editors, # of pages each editor has edited, # of pages that get pageviews, etc. etc.).

T403664: EventBus - Add central user id to MediaWiki events was deployed last week, so since 2025-09-19, all Hive event.mediawiki_page_change_v1 records have the revision.editor.user_central_id field. We should create a test table with this schema, and write SQL to compute the metrics for just a few days as soon as we can.

@mforns !

From pageview_hourly documentation's description of its page_id field:

MediaWiki page_id for this page title. For redirects this could be the page_id of the redirect or the page_id of the target. This may not always be set, even if the page is actually a pageview. is received in the X-Analytics header. As of 2017-06-12, page_id is populated on access methods desktop and mobile web requests, but not mobile app. This means that >95% of pageview requests have a page_id so far.

How should we join event.mediawiki_page_change_v1 with pageview_hourly then? I suppose where page_id=page_id OR page_title=page_title?

Ahoelzl triaged this task as High priority.Sep 25 2025, 9:42 PM

After T405039#11203135, @amastilovic was doing some work to generate some daily sample data to estimate per user and page pageview count sizes. Today, as noted in T401260#11230961, we decided to change directions (if we can, we think we can!).

So, we think it is not worth spending more effort generating per user per page pageview sample data at this time. We'll meet tomorrow and do some tasking on how most efficiently to move forward with the new approach.

Relevant Decisions from a meeting today:

user_central_id
Daily source dataset decision

We had originally decided to use event.mediawiki_page_change_v1 as the incremental datasource, since this dataset now has user_central_id. However, mediawiki_content_history_v1 would be better, since it has daily consistency reconciliation, and would be more accurate.

We discussed that we could join mediawiki_content_history_v1 with event.mediawiki_page_change_v1 to look up the relevant user_central_id, but that it would be much more ideal if mediawiki_content_history_v1 just had a user_central_id field. Dan thinks adding this field would be less work then testing and getting the join right.

(This decision was also documented at T406069#11247231)

@Ottomata Regarding the naming of user_edited_pages as described in T407559 - why do we need the zero at the end of the table name? user_edited_pages_daily0 looks unnecessary to me, but I could be convinced otherwise.

Daily source dataset decision

Add user_central_id to mediawiki_content_history_v1 and use it as incremental daily datasource.

Marcel and I just discussed this more. Conclusion: We need more than user_central_id in mediawiki_content_history_v1 to do this right.

mediawiki_content_history_v1 does not have any field to indicate whether the user was a (self identified) bot, nor if the user is a real user vs IP editor vs temp account editor (we could determine this manually from the format of user_text).

We had planned to generate mediawiki_history_reduced in druid daily from mediawiki_content_history_v1, but because we cannot determine user_type from it alone, we'd have to do some complicated joining to look it up. e.g. Daily join mediawiki_content_history_v1 with mediawiki_page_change_v1 to find the associated page_change event and find the editors user_type there. If event is missing in mediawiki_page_change_v1, then join with mediawiki_user_history to look up user's type there. If missing from there (likely because user is new this month and mediawiki_user_history is monthly dataset), then we can't determine user_type at all.

We think mediawiki_content_history_v1 is just too complicated for our timelines, so we are going to use event.mediawiki_page_change_v1 instead. This will suffer from rare missed events, and missed events due to potential outages. We did vet this with the product teams earlier and they said this was okay. We were trying to do better, but I think it is too complicated.

Decision: We will use event.mediawiki_page_change_v1 as our daily incremental data source.

Additionally, I'd like to use this kind of think as a blocker for reconciled mediawiki_page_change and/or "incremental mediawiki_history", as we had always intended to do. The fact that our eventually consistent page change dataset is so Dumps schema focused is an error, and we should fix it so we don't keep banging into this wall.

user_edited_pages_daily0 looks unnecessary to me, but I could be convinced otherwise.

WIP Typo! I do that when i'm creating test tables so I can increment as I make changes. Final one will not have 0, but maybe _v1 if we want to version it!

WIP Typo! I do that when i'm creating test tables so I can increment as I make changes. Final one will not have 0, but maybe _v1 if we want to version it!

Got it - thanks.

Change #1199067 had a related patch set uploaded (by Ottomata; author: Ottomata):

[analytics/refinery@master] Add HQL for wmf_readership.pageview_per_editor_per_page_daily table

https://gerrit.wikimedia.org/r/1199067

Change #1196892 had a related patch set uploaded (by Ottomata; author: Ottomata):

[analytics/refinery@master] Add HQL for edit_per_editor_per_page_daily and pageview_per_editor_per_page_daily

https://gerrit.wikimedia.org/r/1196892

Change #1199067 abandoned by Ottomata:

[analytics/refinery@master] Add HQL for wmf_readership.pageview_per_editor_per_page_daily table

Reason:

merging this patchset into https://gerrit.wikimedia.org/r/c/analytics/refinery/+/1196892

https://gerrit.wikimedia.org/r/1199067

@mforns and I were debugging our pageviews/per_editor queries yesterday, and we ran into a very unexpected issue with pageviews_hourly. This issue is explored and (will be) documented in T408798: Spike: investigate incorrect page_id values in pageview_hourly.

For our purpose, we are okay. We can still use pageview_hourly.page_id to join on edited pages page_ids. There are 3 caveats though:

From Pageview_hourly docs

  1. pageviews from apps are not counted. This will be true for all of our global per editor pageview metrics, including top K pages viewed.

As of 2017-06-12, page_id is populated on access methods desktop and mobile web requests, but not mobile app. This means that >95% of pageview requests have a page_id so far.

  1. "pageviews per editor's edited pages" counting will be different than the existent "AQS pageviews per article counting"

In case of redirects, the page_id we received is the one of the redirected-to page. This means that, for instance, on the English Wikipedia the same page_id 534366 is associated with the different page_titles Barack_Obama (original content page), Barack_obama (redirect to main content page), Barack_Hussein_Obama (another redirect), Barack_H._Obama (again another redirect) ...

The existent AQS pageviews/per-article API reports pageviews based on whatever the page_title happens to be at the time of the pageview in the webrequest logs.

We need to associate past edits on pages with current pageviews, so we can't rely on the mutable page_title to join between different time periods. So, we sum over page_id, rather than page_title. This means that in our new pageviews/v3/per_editor API, pageview counts to a page will be slightly higher, especially if the page has a lot of redirected views.

Arguably, the new way of counting redirects is more correct, since MediaWiki itself is reporting the page_id of a redirect as the target's page_id. We will be counting a pageview to the target in the same way MediaWiki associates the target page_id with the requested redirect page.

  1. In some rare cases, we may overcount pageviews, because for some very strange reason:
    • It is possible to diff revisions belonging to different pages AND
    • diff views like this are counted as pageviews (if they have the title param set; they don't have to!),
    • and the page_id set by MW is one of the two pages being diffed (I think it is the page that has the revision_id provided in the diff query param.)

We think that this last caveat is unlikely to be a problem, as no normal human would construct a diff URL like that, and these kinds of pageviews are identified as bots so we filter them out from our counts anyway.

For intermediate Data Lake tables, Add HQL for edit_per_editor_per_page_daily and pageview_per_editor_per_page_daily (1196892) should be good to go from a data model and load query perspective.

We have some issues with # of small files (convo in CR) in the backfilled edit_per_editor_per_page_daily table that we think we can mitigate with partitioning/rewriting, but the data model and queries should be okay.

I've been testing backfilling pageview_per_editor_per_page. Fab repartitioned Alek's test table at fab.edit_per_editor_per_page_daily and it performs better now. I can backfill a month of pageview data using this table in a little over 5 minutes.
;

spark3-sql --master yarn --driver-memory 50G --executor-cores 2 --executor-memory 16G --conf spark.dynamicAllocation.maxExecutors=32 --conf spark.executor.memoryOverhead=1024 --conf spark.yarn.maxAppAttempts=1 \
-f ./load_pageview_per_editor_per_page_daily.hql \
-d edit_per_editor_per_page_daily_table='fab.edit_per_editor_per_page_daily' \
-d pageview_hourly_table='wmf.pageview_hourly' \
-d destination_table='otto.pageview_per_editor_per_page_daily10' \
-d start_day=2025-04-01 \
-d end_day=2025-05-01

...

Time taken: 334.243 seconds

OK so I've now officially backfilled the wmf_contributors. and wmf_readership. tables, but the process I had to use in order for the number of files to be small enough is complicated enough that it warrants being documented somewhere:

  1. I've enabled the 'write.distribution-mode' = 'hash' table property for these two tables
TBLPROPERTIES (
    'format-version' = '2',
    'write.delete.mode' = 'copy-on-write',
    'write.parquet.compression-codec' = 'zstd',
    'write.distribution-mode' = 'hash'  -- <<<<<<<<<<<
)
  1. In order for data backfill into these tables to work properly, I had to run Spark 3.3 with Iceberg 1.6.1. That particular setup is available in Jupyter under my own username (apart from Airflow SparkOperator) so I created the tables in my own user database and backfilled the data into them using a Jupyter notebook.
  2. Once the backfills were done, I ran the following SQL using sudo -u analytics spark3-sql -f load-edit.sql (so effectively using our regular Spark and Iceberg 1.2.1):
INSERT INTO wmf_contributors.edit_per_editor_per_page_daily
SELECT
*
FROM amastilovic.edit_per_editor_per_page_daily
CLUSTER BY `day`
;

Key here is the CLUSTER BY instruction, which preserved the small number of files in each partition.

@JAllemandou @xcollazo

Change #1196892 merged by Aleksandar Mastilovic:

[analytics/refinery@master] Add HQL for edit_per_editor_per_page_daily and pageview_per_editor_per_page_daily

https://gerrit.wikimedia.org/r/1196892

Revised: Daily source dataset decision

Further investigations showed that mediawiki_content_history_v1 is a viable source for the daily updated mediawiki history reduced Druid data set.
A derivative data set wmf_content.mediawiki_revision_history with user_central_id information has been created.

Implementation details here:
https://phabricator.wikimedia.org/T410688

Change #1216874 had a related patch set uploaded (by Aleksandar Mastilovic; author: Aleksandar Mastilovic):

[analytics/refinery@master] Add the monthly pageviews_per_editor HQL script

https://gerrit.wikimedia.org/r/1216874

Change #1216874 merged by Aleksandar Mastilovic:

[analytics/refinery@master] Add the monthly pageviews_per_editor HQL script

https://gerrit.wikimedia.org/r/1216874

Change #1217302 had a related patch set uploaded (by Aleksandar Mastilovic; author: Aleksandar Mastilovic):

[analytics/refinery@master] Make pageviews per editor HQL scripts able to do idempotent runs

https://gerrit.wikimedia.org/r/1217302

Change #1217302 merged by Aleksandar Mastilovic:

[analytics/refinery@master] Make pageviews per editor HQL scripts able to do idempotent runs

https://gerrit.wikimedia.org/r/1217302