Page MenuHomePhabricator

Adapt MW Content pipelines to the removal of upstream revision.rev_sha1
Closed, ResolvedPublic

Description

The work from T389026: Rethink rev_sha1 field will remove revision.rev_sha1 from MariaDB.

We depend on it:

This will affect wmf_content.mediawiki_content_history_v1 as well.

And it will also affect T384382: Production-level file export (aka dump) of MW Content in XML.

Since RevisionSlots::computeSha1() is php, we will have to reimplement that algorithm on our side if we are to continue offering that field in the table and in File Export.

In this task we need to adapt our code.

Some questions:

  • Looks like DumpsV1 code will need to compute this on the fly. Should we also compute it on the fly on File Export, and just remove the revision_sha1 field from `wmf_content.mediawiki_content_history_v1?
  • Alternatively we can compute it on ingestiong to wmf_content.mediawiki_content_history_v1 to honor its schema?

Decision: Given upstream MW is dropping the revision.rev_sha1 column, and that the corresponding column on our side is not used by any production workloads (code search, slack thread), we agreed to drop revision_sha1 from wmf_content.mediawiki_content_history_v1 and wmf_content.mediawiki_content_current_v1, and to compute it on the fly for File Export to honor the XSD Schema.

Some resources:
Def of RevisionSlots::computeSha1()
Def of base_convert: https://github.com/wikimedia/base-convert/blob/master/src/Functions.php#L40

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
main: bump event_schema_version for mw_content reconcilerepos/data-engineering/airflow-dags!1759xcollazobump-mw-content-stream-schemamain
main: bump mw content to pickup schema changes.repos/data-engineering/airflow-dags!1758xcollazodrop-sha1-add-central-idmain
main: bump refinery to pickup MCR fixes on mw content file exportrepos/data-engineering/airflow-dags!1755xcollazobump-refinery-on-mw-contentmain
Drop revision_sha1, add user_central_idrepos/data-engineering/mediawiki-content-pipelines!81xcollazodrop-sha1-add-central-idmain
Draft: main: bump refinery to pickup MCR fixes for mw content file exportrepos/data-engineering/airflow-dags!1751xcollazobump-refinerymain
Customize query in GitLab

Event Timeline

xcollazo renamed this task from Adapt MW Content pipelines to removel of upstream revision_sha1 to Adapt MW Content pipelines to the removal of upstream revision.rev_sha1.Sep 25 2025, 6:05 PM
xcollazo triaged this task as High priority.

Discussed this with @JAllemandou.

We came to these conclusions:

  • It seems the work from T389026 is honoring having revision_sha1 on the MW API.
  • This may require changes to page_change to honor there as well?
  • MW Content pipelines will need changes since we were expecting revision_sha1 to be there both for the reconcile algorithm (consistency_check.py), and also to generate the reconcile events (emit_reconcile_events_to_kafka.py). We could figure out how to calculate the field on the fly, or we could just drop revision_sha1.

Separately:

  • It looks like MW History will also need changes, and these changes also require a reimplementation of RevisionSlots::computeSha1().

This may require changes to page_change to honor there as well?

page_change honors it, via MediaWiki's own computed calculation in RevisionStoreRecord::getSha1. So in page_change, revision.rev_sha1 should == result of RevisionSlots::computeSha1()

generate the reconcile events

If needed, I wonder if the reconciliation / content enrichment pipelines could enrich the event with rev_sha1 from a MediaWiki API response? Although, I suppose if you need the logic to compute it for backfill of mediawik_content_history_v1 (and also for mediawiki_history), it would be better to do it in Data Lake if you can?

It looks like we can get it.

If needed, I wonder if the reconciliation / content enrichment pipelines could enrich the event with rev_sha1 from a MediaWiki API response?

Right, although the easier fix seems to just drop the field.

Although, I suppose if you need the logic to compute it for backfill of mediawik_content_history_v1 (and also for mediawiki_history), it would be better to do it in Data Lake if you can?

For MW Content purposes, we do need it for File Export. But we could derive by reimplementing the RevisionSlots::computeSha1() in, say, a UDF, which mediawiki_history could reuse.

In our meeting, @JAllemandou figured that it would be an easy implementation, but we do need to make sure we get the ordering of sha1s right. We need to consult with MW folks on Php's ksort behavior here, and what exactly is being used as the sort key.

In our meeting, @JAllemandou figured that it would be an easy implementation, but we do need to make sure we get the ordering of sha1s right.

I have investigated a bit, and found from here that the ordering is made on slot-role. I'll test with that.

xcollazo changed the task status from Open to In Progress.Oct 10 2025, 1:28 AM
xcollazo claimed this task.

Change #1195330 had a related patch set uploaded (by Xcollazo; author: Xcollazo):

[analytics/refinery/source@master] MW Dumper: Add support for Multi-content Revisions (MCR)

https://gerrit.wikimedia.org/r/1195330

Change #1195330 merged by jenkins-bot:

[analytics/refinery/source@master] MW Dumper: Add support for Multi-content Revisions (MCR)

https://gerrit.wikimedia.org/r/1195330

Just added this decision to the description of this task:

Decision: Given upstream MW is dropping the revision.rev_sha1 column, and that the corresponding column on our side is not used by any production workloads (code search, slack thread), we agreed to drop revision_sha1 from wmf_content.mediawiki_content_history_v1 and wmf_content.mediawiki_content_current_v1, and to compute it on the fly for File Export to honor the XSD Schema.

From MR 81:

In this MR we:

  • Drop column revision_sha1 from both wmf_content.mediawiki_content_*_v1 tables.
  • Add column user_central_id to both wmf_content.mediawiki_content_*_v1 tables.
  • Modify the code as needed for the above.

For now, we do not consider reconciling user_central_id. We will do that work later.

As part of getting these changes to production, we will have to run the following ALTERs:

ALTER TABLE wmf_content.mediawiki_content_history_v1 DROP COLUMN revision_sha1;
ALTER TABLE wmf_content.mediawiki_content_current_v1 DROP COLUMN revision_sha1;

-- the AFTER clause makes SELECT * output be more natural
ALTER TABLE wmf_content.mediawiki_content_history_v1 ADD COLUMN user_central_id BIGINT COMMENT 'Global cross-wiki user ID. See: https://www.mediawiki.org/wiki/Manual:Central_ID' AFTER user_id;
ALTER TABLE wmf_content.mediawiki_content_current_v1 ADD COLUMN user_central_id BIGINT COMMENT 'Global cross-wiki user ID. See: https://www.mediawiki.org/wiki/Manual:Central_ID' AFTER user_id;

Bug: T406515

Bug: T405641

Ran the followin in production:

$ hostname -f
an-launcher1002.eqiad.wmnet
$ whoami
analytics
$ kerberos-run-command analytics spark3-sql

spark-sql (default)> ALTER TABLE wmf_content.mediawiki_content_history_v1 DROP COLUMN revision_sha1;
25/10/20 19:13:21 WARN BaseTransaction: Failed to load metadata for a committed snapshot, skipping clean-up
Response code
Time taken: 2.315 seconds

spark-sql (default)> ALTER TABLE wmf_content.mediawiki_content_current_v1 DROP COLUMN revision_sha1;
25/10/20 19:13:28 WARN BaseTransaction: Failed to load metadata for a committed snapshot, skipping clean-up
Response code
Time taken: 0.211 seconds

spark-sql (default)> ALTER TABLE wmf_content.mediawiki_content_history_v1 ADD COLUMN user_central_id BIGINT COMMENT 'Global cross-wiki user ID. See: https://www.mediawiki.org/wiki/Manual:Central_ID' AFTER user_id;
25/10/20 19:13:36 WARN BaseTransaction: Failed to load metadata for a committed snapshot, skipping clean-up
Response code
Time taken: 0.237 seconds

spark-sql (default)> ALTER TABLE wmf_content.mediawiki_content_current_v1 ADD COLUMN user_central_id BIGINT COMMENT 'Global cross-wiki user ID. See: https://www.mediawiki.org/wiki/Manual:Central_ID' AFTER user_id;
25/10/20 19:13:42 WARN BaseTransaction: Failed to load metadata for a committed snapshot, skipping clean-up
Response code
Time taken: 0.181 seconds

Sanity checks:

spark-sql (default)> select wiki_id, revision_id, user_id, user_central_id from wmf_content.mediawiki_content_history_v1 where wiki_id='simplewiki' limit 10;
wiki_id	revision_id	user_id	user_central_id
simplewiki	59	0	NULL
simplewiki	60	20	NULL
simplewiki	1480	20	NULL
simplewiki	4210	2	NULL
simplewiki	4522	11	NULL
simplewiki	5619	121	NULL
simplewiki	8877	11	NULL
simplewiki	10140	0	NULL
simplewiki	22416	11	NULL
simplewiki	23660	793	NULL
Time taken: 1.236 seconds, Fetched 10 row(s)

spark-sql (default)> select wiki_id, revision_id, user_id, user_central_id from wmf_content.mediawiki_content_current_v1 where wiki_id='simplewiki' limit 10;
wiki_id	revision_id	user_id	user_central_id
simplewiki	10580813	1678721	NULL
simplewiki	10580720	1678706	NULL
simplewiki	10579440	430706	NULL
simplewiki	10579698	1673561	NULL
simplewiki	10579097	1595360	NULL
simplewiki	10579676	1677895	NULL
simplewiki	10580818	805501	NULL
simplewiki	10580800	1011873	NULL
simplewiki	10579686	1673561	NULL
simplewiki	10579603	1150185	NULL
Time taken: 0.921 seconds, Fetched 10 row(s)

All DAGs were successful overnight, although I did need to make a manual change to the DagProperties of the reconcile pipeline. Will send a minor patch with the change now.