Page MenuHomePhabricator

Adapt mediawiki_history to the removal of mediawiki revision.rev_sha1
Closed, ResolvedPublic

Description

The work from T389026: Rethink rev_sha1 field will remove revision.rev_sha1 from MariaDB.
This field is currently used to compute identity-reverts in the mediawiki_history dataset.
The value can still be computed using content-slots-sha1 available in the content table (to be accessed from the revision table with a join to the slots table and then the content table).

As written in this comment, an implementation of the sha1 value using content-slots-sha1 is easy.
An analysis of the code shows that the slot-order is made on the slot-role (alphabetical) and that slots to be considered are non-derived slots..

Event Timeline

Proposal for this task:

Questions:

  • Do we have slots / content for archived revisions? I assume yes, but I have never checked.
  • Is it ok to write the UDF as a Spark only UDF instead of a Hive UF (compatible with spark)? We don't use Hive execution engine anymore and the code is more complicated for Hive...

For reference: example of base-36 sha1 creation with scala/spark:

math.BigInt(spark.sql("SELECT sha1(CONCAT('1jkqj7lxs8l999wu3jzlmzafwh6e2h4', 't15553qh44ewu6rrgv4xbrkpoutrvj3'))").collect.map(r => r.getString(0)).head, 16).toString(36)

@Milimetric and @mforns , may I ask you to review the plan and question in the comment above please?

Proposal for this task:

sounds mostly good, no notes except for the below:

  • Update mediawiki_history scala job to compute sha1 from content for both archive and revision
    • Join archive/revision with slots, slots_role and content to get, for each revision, an array of struct(slot_role, content_sha1) (non-derived content/slots only)
    • Write a spark UDF getting the array of struct as parameter and returning the computed sha1
      • Sort the array of struct by slot_role
      • Reduce the array on content sha1 values as described here

Here - maybe worth considering - shifting MW history down a level to think about slots instead of revisions. Then we wouldn't have to compute this overall sha1, we'd be just dealing with the individual slot sha1. This would maybe be a bigger change in some cases but would be a better interpretation of the MW database model. With MCR adoption what it is now, maybe this doesn't matter. But if that takes off, we'd probably have to think about slots more anyway.

Questions:

  • Do we have slots / content for archived revisions? I assume yes, but I have never checked.

Yes, just checked:

select *
  from slots
           inner join
       (select ar_rev_id
          from archive
         where ar_timestamp > 2025
         limit 10
       ) a       on ar_rev_id = slot_revision_id
  • Is it ok to write the UDF as a Spark only UDF instead of a Hive UF (compatible with spark)? We don't use Hive execution engine anymore and the code is more complicated for Hive...

definitely Spark-only, no reason to take on tech debt

shifting MW history down a level to think about slots instead of revisions. Then we wouldn't have to compute this overall sha1, we'd be just dealing with the individual slot sha1. This would maybe be a bigger change in some cases but would be a better interpretation of the MW database model. With MCR adoption what it is now, maybe this doesn't matter. But if that takes off, we'd probably have to think about slots more anyway.

I don't really manage to wrap my head around what this change would entail and how the dataset would look like after the change. Given that there is only small adoption of MCR today, I'd go for the simple way of re-implementing rev_sha1. Happy to discuss more if you think the bigger change is worth the effort @Milimetric .

Hi!

Since we are making changes to MW history, could we do T365648: Add user_central_id to mediawiki_history and mediawiki_history_reduced Hive tables at the same time? We need this for T403660: WE3.3.7 Year in Review and Activity Tab Services - Global Editor Metrics, and it would be nice to include it in main MW history , rather than only in the downstream druid serving layer dataset.

Who is actively working on this?

Thank you!

Change #1196049 had a related patch set uploaded (by Joal; author: Joal):

[analytics/refinery/source@master] Update mediawiki_history job for rev_sha1 DB removal

https://gerrit.wikimedia.org/r/1196049

Change #1196049 merged by jenkins-bot:

[analytics/refinery/source@master] Update mediawiki_history job for rev_sha1 DB removal

https://gerrit.wikimedia.org/r/1196049

Change #1196469 had a related patch set uploaded (by Joal; author: Joal):

[analytics/refinery/source@master] Fix mediawiki_history bug from previous patch

https://gerrit.wikimedia.org/r/1196469

Change #1196469 merged by jenkins-bot:

[analytics/refinery/source@master] Fix mediawiki_history bug from previous patch

https://gerrit.wikimedia.org/r/1196469

Change #1196485 had a related patch set uploaded (by Joal; author: Joal):

[analytics/refinery/source@master] Fix mediawiki_history bug from previous patch

https://gerrit.wikimedia.org/r/1196485

Change #1196485 merged by jenkins-bot:

[analytics/refinery/source@master] Fix mediawiki_history bug from previous patch

https://gerrit.wikimedia.org/r/1196485

Change #1196518 had a related patch set uploaded (by Joal; author: Joal):

[analytics/refinery/source@master] Improve mediawiki_history previous patch

https://gerrit.wikimedia.org/r/1196518

Change #1196631 had a related patch set uploaded (by Joal; author: Joal):

[operations/puppet@production] Update sqoop for mediawiki_history

https://gerrit.wikimedia.org/r/1196631

Change #1196518 merged by jenkins-bot:

[analytics/refinery/source@master] Improve mediawiki_history previous patch

https://gerrit.wikimedia.org/r/1196518

Change #1196716 had a related patch set uploaded (by Joal; author: Joal):

[analytics/refinery@master] Nullify sha1 from revision and archive in sqoop

https://gerrit.wikimedia.org/r/1196716

Change #1196716 merged by Mforns:

[analytics/refinery@master] Nullify sha1 from revision and archive in sqoop

https://gerrit.wikimedia.org/r/1196716

Change #1196631 merged by Btullis:

[operations/puppet@production] Update sqoop for mediawiki_history

https://gerrit.wikimedia.org/r/1196631

xcollazo changed the task status from Open to In Progress.Oct 17 2025, 3:16 PM

Change #1202191 had a related patch set uploaded (by Joal; author: Joal):

[analytics/refinery/source@master] Fix mediawiki-history UDF NPE

https://gerrit.wikimedia.org/r/1202191

Change #1202191 merged by Joal:

[analytics/refinery/source@master] Fix mediawiki-history UDF NPE

https://gerrit.wikimedia.org/r/1202191

Change #1202334 had a related patch set uploaded (by Aleksandar Mastilovic; author: Aleksandar Mastilovic):

[analytics/refinery/source@master] Add appropriate NULL checks to computeForTuples

https://gerrit.wikimedia.org/r/1202334

Change #1202334 merged by jenkins-bot:

[analytics/refinery/source@master] Add appropriate NULL checks to computeForTuples

https://gerrit.wikimedia.org/r/1202334