Page MenuHomePhabricator

Rethink rev_sha1 field
Closed, ResolvedPublic

Description

This field exists in every row of revision table and is the largest field of that table. Overall it takes ~20% of the revision table and given how large this table is (206GB in enwiki, 377GB in wikidata) this adds up to quite a large amount of storage (I estimate this would be on average 40GB per replica = 8TB in total).

I'm not against storage of checksums and I see the value but I'm not sure the status quo is taking all trade-offs into account. It's important to think of the purpose of this field. Is it proof for integrity or proof for authenticity? If it's the former, then this is quite an overkill. If it's latter, then without a Merkle-tree and something stronger than sha1 (which is provably not cryptographically secure) we can't provide that (i.e. status quo can't be used for proof of authenticity). My point is that these we store revisions, not TLS certificates.

Another problem with rev_sha1 is that it stores the hex value of the checksum as string (varbinary(31)) which is quite wasteful.

Proposal 0: Drop rev_sha1 and compute it on the fly from content_sha1. <- We are going with this.

Proposal 1: Store the checksum as an unsigned int. That would be enough to protect against random corruptions and even revert detection. To get the value, it'd be enough to take first 8 digits of the hex value, turn into a decimal and get reminder of max value of int in mysql.

  • Pros: It saves a lot of space. 28 bytes per row which will add up to ~35GB being saved on average from each replica.
  • Cons: Slightly higher chance of collision: 1 in 4B revisions might collide (theoretically, in practice it's higher)

Proposal 2: Store the checksum as an unsigned bigint instead.

  • Pros: Pretty low chance of collision.
  • Cons: Saves only 30GB per replica

Proposal 3: Normalize the value into another table (there is some repetitions due to reverts and such).

  • Pros: It removes a lot from revision table, specially good for innodb buffer pool.
  • Cons: It doesn't remove the data and conceptually it's better to keep the checksum near the object itself. Also since the pointer in revision has to be a bigint, it doesn't save much.

See also T158986: Migrate SHA-1 hashes to SHA-256 (tracking) and subtasks.

Details

Related Changes in Gerrit:
SubjectRepoBranchLines +/-
mediawiki/coremaster+350 -15
mediawiki/coremaster+393 -14
mediawiki/coremaster+3 -45
mediawiki/coremaster+2 -9
operations/puppetproduction+5 -11
mediawiki/coremaster+22 -13
mediawiki/extensions/CheckUsermaster+0 -4
mediawiki/coremaster+4 -0
mediawiki/extensions/FlaggedRevsmaster+0 -1
mediawiki/coremaster+3 -21
mediawiki/coremaster+36 -35
mediawiki/extensions/WikimediaEditorTasksmaster+0 -1
mediawiki/extensions/WikimediaMaintenancemaster+0 -95
mediawiki/extensions/WikimediaMaintenancemaster+88 -0
mediawiki/coremaster+51 -4
mediawiki/coremaster+78 -0
mediawiki/coremaster+48 -9
mediawiki/coremaster+36 -9
mediawiki/coremaster+33 -15
mediawiki/coremaster+8 -25
mediawiki/coremaster+25 -8
mediawiki/extensions/EventBusmaster+1 -2
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change #1191196 had a related patch set uploaded (by Zabe; author: Zabe):

[mediawiki/core@master] ApiQueryRecentChanges: Compute SHA1 from content hashes on the fly

https://gerrit.wikimedia.org/r/1191196

Change #1191187 had a related patch set uploaded (by Zabe; author: Zabe):

[mediawiki/extensions/EventBus@master] tests: Stop setting a custom sha1 value for the revision

https://gerrit.wikimedia.org/r/1191187

Adding the Wikidata tag as its explicitly mentioned above. Please try to add the Wikidata umbrella tag if a ticket may impact Wikidata. The team at WMDE is aware of this task now and have no concerns provided there is no performance impact.

Change #1191187 merged by jenkins-bot:

[mediawiki/extensions/EventBus@master] tests: Stop setting a custom sha1 value for the revision

https://gerrit.wikimedia.org/r/1191187

Change #1191321 merged by jenkins-bot:

[mediawiki/core@master] RevisionStore: Correctly respect the search limit for identical revs

https://gerrit.wikimedia.org/r/1191321

In case anyone else is confused about the archive table:

When a page is deleted rows from the revision table are denormalized into the archive table, but nothing happens to the slots and content tables. So the rev_id becomes ar_rev_id and still points to the same set of rows with an equal slots.slot_revision_id. So everything that's said above applies to the archive table as well.

Change #1192582 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/core@master] Revert "RevisionStore: Correctly respect the search limit for identical revs"

https://gerrit.wikimedia.org/r/1192582

Change #1192582 merged by jenkins-bot:

[mediawiki/core@master] Revert "RevisionStore: Correctly respect the search limit for identical revs"

https://gerrit.wikimedia.org/r/1192582

Change #1192841 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/core@master] Reapply "RevisionStore: Correctly respect the search limit for identical revs"

https://gerrit.wikimedia.org/r/1192841

Change #1192841 merged by jenkins-bot:

[mediawiki/core@master] Reapply "RevisionStore: Correctly respect the search limit for identical revs"

https://gerrit.wikimedia.org/r/1192841

It would be nice to have an entry in tech news along the lines of:

The field rev_sha1 in revision table is being removed in favor of content_sha1 in content table. See the announcement for more information.

Change #1191196 merged by jenkins-bot:

[mediawiki/core@master] ApiQueryRecentChanges: Compute SHA1 from content hashes on the fly

https://gerrit.wikimedia.org/r/1191196

Change #1194226 had a related patch set uploaded (by Zabe; author: Zabe):

[mediawiki/core@master] ApiQueryDeletedrevs: Compute SHA1 from content hashes on the fly

https://gerrit.wikimedia.org/r/1194226

Change #1195646 had a related patch set uploaded (by Zabe; author: Zabe):

[mediawiki/core@master] tests: Add coverage for ApiQueryDeletedrevs sha1 code

https://gerrit.wikimedia.org/r/1195646

Change #1195646 merged by jenkins-bot:

[mediawiki/core@master] tests: Add coverage for ApiQueryDeletedrevs sha1 code

https://gerrit.wikimedia.org/r/1195646

Zabe moved this task from Refine to In progress on the DBA board.

Change #1194226 merged by jenkins-bot:

[mediawiki/core@master] ApiQueryDeletedrevs: Compute SHA1 from content hashes on the fly

https://gerrit.wikimedia.org/r/1194226

Hi Mediawiki-core and Data-Persistence folks,
while working on adapting data-engineering pipelines to the rev_sha1 removal, I did an analysis of stored rev_sha1 values against computed values (we have replicated the re-hashing algorithm for the cluster to mimic the rev_sha1 old values).
I have found what I think could be bugs on itwikivoyage and dewikivoayge: empty sha1 values in the content tables. I also found some minor inconsistencies in other projects, mostly commonswiki.
I have stored some results in this spreadsheet for you to look at. Please don't hesitate to ask if you have questions or need finer grain results.

Thanks for the catch. I try to make sure it's properly fixed after a good cry.

Change #1196904 had a related patch set uploaded (by Zabe; author: Zabe):

[mediawiki/extensions/WikimediaMaintenance@master] Add script to repopulate content_sha1 from revision/archive table

https://gerrit.wikimedia.org/r/1196904

Change #1196904 merged by jenkins-bot:

[mediawiki/extensions/WikimediaMaintenance@master] Add script to repopulate content_sha1 from revision/archive table

https://gerrit.wikimedia.org/r/1196904

Change #1196950 had a related patch set uploaded (by Zabe; author: Zabe):

[mediawiki/extensions/WikimediaMaintenance@master] Fix issues with T389026

https://gerrit.wikimedia.org/r/1196950

Change #1196950 merged by jenkins-bot:

[mediawiki/extensions/WikimediaMaintenance@master] Fix a few issues with T389026.php

https://gerrit.wikimedia.org/r/1196950

Mentioned in SAL (#wikimedia-operations) [2025-10-27T14:03:32Z] <zabe@deploy2002> mwscript-k8s job started: extensions/WikimediaMaintenance/T389026.php --wiki=dewikivoyage # T389026

Mentioned in SAL (#wikimedia-operations) [2025-10-27T14:04:56Z] <zabe@deploy2002> mwscript-k8s job started: extensions/WikimediaMaintenance/T389026.php --wiki=itwikivoyage # T389026

Change #1198988 had a related patch set uploaded (by Zabe; author: Zabe):

[mediawiki/extensions/WikimediaMaintenance@master] Drop script for T389026

https://gerrit.wikimedia.org/r/1198988

Looks like dewikivoyage and itwikivoyage are fixed. (We expect rev_sha1 and comment_sha1 to contain the same value since both wikis only have one slot role.)

wikiadmin2023@10.192.32.16(dewikivoyage)> select count(*) from content join slots on slot_content_id = content_id join revision on slot_revision_id = rev_id where content_sha1 = rev_sha1;
+----------+
| count(*) |
+----------+
|  1612372 |
+----------+
1 row in set (3.754 sec)

wikiadmin2023@10.192.32.16(dewikivoyage)> select count(*) from content join slots on slot_content_id = content_id join revision on slot_revision_id = rev_id where content_sha1 != rev_sha1;
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (3.647 sec)

wikiadmin2023@10.192.32.16(dewikivoyage)> select count(*) from content join slots on slot_content_id = content_id join archive on slot_revision_id = ar_rev_id where content_sha1 = ar_sha1;
+----------+
| count(*) |
+----------+
|    95433 |
+----------+
1 row in set (0.412 sec)

wikiadmin2023@10.192.32.16(dewikivoyage)> select count(*) from content join slots on slot_content_id = content_id join archive on slot_revision_id = ar_rev_id where content_sha1 != ar_sha1;
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (0.335 sec)

wikiadmin2023@10.192.32.16(dewikivoyage)>
wikiadmin2023@10.192.32.16(itwikivoyage)> select count(*) from content join slots on slot_content_id = content_id join revision on slot_revision_id = rev_id where content_sha1 = rev_sha1;
+----------+
| count(*) |
+----------+
|   767357 |
+----------+
1 row in set (2.876 sec)

wikiadmin2023@10.192.32.16(itwikivoyage)> select count(*) from content join slots on slot_content_id = content_id join revision on slot_revision_id = rev_id where content_sha1 != rev_sha1;
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (1.980 sec)

wikiadmin2023@10.192.32.16(itwikivoyage)> select count(*) from content join slots on slot_content_id = content_id join archive on slot_revision_id = ar_rev_id where content_sha1 = ar_sha1;
+----------+
| count(*) |
+----------+
|   118995 |
+----------+
1 row in set (0.536 sec)

wikiadmin2023@10.192.32.16(itwikivoyage)> select count(*) from content join slots on slot_content_id = content_id join archive on slot_revision_id = ar_rev_id where content_sha1 != ar_sha1;
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (0.404 sec)

wikiadmin2023@10.192.32.16(itwikivoyage)>

Change #1198988 merged by jenkins-bot:

[mediawiki/extensions/WikimediaMaintenance@master] Drop script for T389026

https://gerrit.wikimedia.org/r/1198988

Change #1202292 had a related patch set uploaded (by Zabe; author: Zabe):

[mediawiki/core@master] Drop MutableRevisionRecord::setSha1 without deprecation

https://gerrit.wikimedia.org/r/1202292

Change #1202297 had a related patch set uploaded (by Zabe; author: Zabe):

[mediawiki/core@master] RevisionRecord: Stop using rev_sha1

https://gerrit.wikimedia.org/r/1202297

Change #1202335 had a related patch set uploaded (by Zabe; author: Zabe):

[mediawiki/extensions/WikimediaEditorTasks@master] tests: Stop setting rev_sha1

https://gerrit.wikimedia.org/r/1202335

Change #1202292 merged by jenkins-bot:

[mediawiki/core@master] Drop MutableRevisionRecord::setSha1 without deprecation

https://gerrit.wikimedia.org/r/1202292

Change #1202336 had a related patch set uploaded (by Zabe; author: Zabe):

[mediawiki/core@master] import: Duplicate detection without sha1 values

https://gerrit.wikimedia.org/r/1202336

Change #1202297 merged by jenkins-bot:

[mediawiki/core@master] RevisionRecord: Stop using rev_sha1

https://gerrit.wikimedia.org/r/1202297

Change #1202335 merged by jenkins-bot:

[mediawiki/extensions/WikimediaEditorTasks@master] tests: Stop setting rev_sha1

https://gerrit.wikimedia.org/r/1202335

Change #1203152 had a related patch set uploaded (by Zabe; author: Zabe):

[mediawiki/extensions/FlaggedRevs@master] tests: Stop setting rev_sha1

https://gerrit.wikimedia.org/r/1203152

Change #1203161 had a related patch set uploaded (by Zabe; author: Zabe):

[mediawiki/core@master] Remove (ar|rev)_sha1 from query info

https://gerrit.wikimedia.org/r/1203161

Change #1203169 had a related patch set uploaded (by Zabe; author: Zabe):

[mediawiki/core@master] RevisionStore: Do not require rev rows to contain a sha1 value

https://gerrit.wikimedia.org/r/1203169

Change #1203152 merged by jenkins-bot:

[mediawiki/extensions/FlaggedRevs@master] tests: Stop setting rev_sha1

https://gerrit.wikimedia.org/r/1203152

Change #1191479 had a related patch set uploaded (by Zabe; author: Zabe):

[mediawiki/extensions/CheckUser@master] Stop querying rev_sha1 and ar_sha1

https://gerrit.wikimedia.org/r/1191479

Change #1203169 merged by jenkins-bot:

[mediawiki/core@master] RevisionStore: Do not require rev rows to contain a sha1 value

https://gerrit.wikimedia.org/r/1203169

Change #1191479 merged by jenkins-bot:

[mediawiki/extensions/CheckUser@master] Stop querying rev_sha1 and ar_sha1

https://gerrit.wikimedia.org/r/1191479

Change #1202336 merged by jenkins-bot:

[mediawiki/core@master] import: Duplicate detection without using rev_sha1

https://gerrit.wikimedia.org/r/1202336

Change #1191090 merged by Ladsgroup:

[operations/puppet@production] maintain-views: Hide rev_sha1 and ar_sha1 from wikireplicas

https://gerrit.wikimedia.org/r/1191090

Cookbook cookbooks.sre.wikireplicas.update-views run by ladsgroup: Started updating wiki replica views

Cookbook cookbooks.sre.wikireplicas.update-views started by ladsgroup completed:

  • an-redacteddb1001.eqiad.wmnet (PASS)
    • Ran Puppet agent
    • Ran 'maintain-views --replace --auto-depool --all-databases --table archive'
  • clouddb1017.eqiad.wmnet (PASS)
    • Ran Puppet agent
    • Ran 'maintain-views --replace --auto-depool --all-databases --table archive'
  • clouddb1018.eqiad.wmnet (PASS)
    • Ran Puppet agent
    • Ran 'maintain-views --replace --auto-depool --all-databases --table archive'
  • clouddb1019.eqiad.wmnet (PASS)
    • Ran Puppet agent
    • Ran 'maintain-views --replace --auto-depool --all-databases --table archive'
  • clouddb1020.eqiad.wmnet (PASS)
    • Ran Puppet agent
    • Ran 'maintain-views --replace --auto-depool --all-databases --table archive'
  • clouddb1013.eqiad.wmnet (PASS)
    • Ran Puppet agent
    • Ran 'maintain-views --replace --auto-depool --all-databases --table archive'
  • clouddb1014.eqiad.wmnet (PASS)
    • Ran Puppet agent
    • Ran 'maintain-views --replace --auto-depool --all-databases --table archive'
  • clouddb1015.eqiad.wmnet (PASS)
    • Ran Puppet agent
    • Ran 'maintain-views --replace --auto-depool --all-databases --table archive'
  • clouddb1016.eqiad.wmnet (PASS)
    • Ran Puppet agent
    • Ran 'maintain-views --replace --auto-depool --all-databases --table archive'

Cookbook cookbooks.sre.wikireplicas.update-views run by ladsgroup: Started updating wiki replica views

Cookbook cookbooks.sre.wikireplicas.update-views started by ladsgroup completed:

  • an-redacteddb1001.eqiad.wmnet (PASS)
    • Ran Puppet agent
    • Ran 'maintain-views --replace --auto-depool --all-databases --table revision'
  • clouddb1017.eqiad.wmnet (PASS)
    • Ran Puppet agent
    • Ran 'maintain-views --replace --auto-depool --all-databases --table revision'
  • clouddb1018.eqiad.wmnet (PASS)
    • Ran Puppet agent
    • Ran 'maintain-views --replace --auto-depool --all-databases --table revision'
  • clouddb1019.eqiad.wmnet (PASS)
    • Ran Puppet agent
    • Ran 'maintain-views --replace --auto-depool --all-databases --table revision'
  • clouddb1020.eqiad.wmnet (PASS)
    • Ran Puppet agent
    • Ran 'maintain-views --replace --auto-depool --all-databases --table revision'
  • clouddb1013.eqiad.wmnet (PASS)
    • Ran Puppet agent
    • Ran 'maintain-views --replace --auto-depool --all-databases --table revision'
  • clouddb1014.eqiad.wmnet (PASS)
    • Ran Puppet agent
    • Ran 'maintain-views --replace --auto-depool --all-databases --table revision'
  • clouddb1015.eqiad.wmnet (PASS)
    • Ran Puppet agent
    • Ran 'maintain-views --replace --auto-depool --all-databases --table revision'
  • clouddb1016.eqiad.wmnet (PASS)
    • Ran Puppet agent
    • Ran 'maintain-views --replace --auto-depool --all-databases --table revision'

I'm impressed it didn't fail on revision table but it's done now \o/

Change #1203161 merged by jenkins-bot:

[mediawiki/core@master] Remove (ar|rev)_sha1 from query info

https://gerrit.wikimedia.org/r/1203161

Change #1206443 had a related patch set uploaded (by Zabe; author: Zabe):

[mediawiki/core@master] Stop writing to ar_sha1 and rev_sha1

https://gerrit.wikimedia.org/r/1206443

Change #1206443 merged by jenkins-bot:

[mediawiki/core@master] Stop writing to ar_sha1 and rev_sha1

https://gerrit.wikimedia.org/r/1206443

Change #1207211 had a related patch set uploaded (by Zabe; author: Zabe):

[mediawiki/core@master] Drop ar_sha1 from archive

https://gerrit.wikimedia.org/r/1207211

Change #1207216 had a related patch set uploaded (by Zabe; author: Zabe):

[mediawiki/core@master] Drop rev_sha1 from revision table

https://gerrit.wikimedia.org/r/1207216

Change #1207211 merged by jenkins-bot:

[mediawiki/core@master] Drop ar_sha1 from archive table

https://gerrit.wikimedia.org/r/1207211

Change #1207216 merged by jenkins-bot:

[mediawiki/core@master] Drop rev_sha1 from revision table

https://gerrit.wikimedia.org/r/1207216