Populate revision.rev_sha1 and archive.ar_sha1 on Wikimedia wikis
This is related to bug 34104. It doesn't make any sense to add the hash columns and then not populate them.

I'm not sure if there's a maintenance script written yet. If so, this bug can have the "shell" keyword. Otherwise, that'll need to be done first.

bzimport raised the priority of this task from to Medium.Nov 22 2014, 12:22 AM
bzimport set Reference to bz36081.
bzimport added a subscriber: Unknown Object (MLST).

Yep, populateRevisionSha1 should take care of this.

IIRC, this was going to take a long time to run. We should schedule a time slot.

The script is currently running now.

Just for curiosity, are there any stats on how much of this is already done?
Is this expected to take days or weeks to finish?

There are some revisions from February with an empty sha1 on Portuguese Wikipedia:

It will most certainly take weeks to finish... I'm not sure if Aaron is running it foreachwiki in turn, or doing 1 per cluster or whatever...

I kicked the scripts again...some of them died due to intermittent ExternalStorage problems.

The sizes don't seem to be properly generated for revisions imported from other wikis. See:

The first three edits there were imported from the Nostalgia Wikipedia:

That's not related to this bug. That sounds like a rev_parent_id problem.

That is bug 36976.

Aaron continues to run this. He often needs to restart it and babysit, but there is progress, so it'll eventually get done. It's probably not sensible to venture a guess on when it'll be done, since it'd be a wild guess, but it's probably best measured in weeks.

The script for the last remaining rev ID range just finished today.

The sha1 is still empty in some revisions of this page:

13/70 for that page apparently..

mysql> select rev_id from revision where revision.rev_page = 224934 AND rev_sha1 = '';




13 rows in set (0.00 sec)

I've started scripts to catch any stragglers.

Possibly, some revs (in ID range x-y) were restored (undeleted) after the first script read a batch of revs to update (including some of ID range x-y) from a snapshot in time so it didn't catch them.

Second run has completed on all wikis. For enwiki:

  • rev_sha1 and ar_sha1 population complete [12420 revision rows, 25084506 archive rows].