Page MenuHomePhabricator

Populate revision.rev_sha1 and archive.ar_sha1 on Wikimedia wikis
Closed, ResolvedPublic

Description

This is related to bug 34104. It doesn't make any sense to add the hash columns and then not populate them.

I'm not sure if there's a maintenance script written yet. If so, this bug can have the "shell" keyword. Otherwise, that'll need to be done first.


Version: unspecified
Severity: enhancement

Details

Reference
bz36081

Event Timeline

bzimport raised the priority of this task from to Normal.Nov 22 2014, 12:22 AM
bzimport set Reference to bz36081.
bzimport added a subscriber: Unknown Object (MLST).

Yep, populateRevisionSha1 should take care of this.

IIRC, this was going to take a long time to run. We should schedule a time slot.

The script is currently running now.

(In reply to comment #3)

The script is currently running now.

Orly

Just for curiosity, are there any stats on how much of this is already done?
Is this expected to take days or weeks to finish?

There are some revisions from February with an empty sha1 on Portuguese Wikipedia:
https://pt.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=timestamp%7Csha1&titles=Matem%E1tica&rvstartid=29143621

Reedy added a comment.May 3 2012, 6:49 PM

(In reply to comment #5)

Just for curiosity, are there any stats on how much of this is already done?
Is this expected to take days or weeks to finish?
There are some revisions from February with an empty sha1 on Portuguese
Wikipedia:
https://pt.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=timestamp%7Csha1&titles=Matem%E1tica&rvstartid=29143621

It will most certainly take weeks to finish... I'm not sure if Aaron is running it foreachwiki in turn, or doing 1 per cluster or whatever...

He7d3r added a comment.May 5 2012, 7:51 PM

(In reply to comment #7)

Ready? http://lists.wikimedia.org/pipermail/wikitech-l/2012-March/059044.html

Nope: see the link from comment 5.

aaron added a comment.May 22 2012, 8:27 PM

I kicked the scripts again...some of them died due to intermittent ExternalStorage problems.

The sizes don't seem to be properly generated for revisions imported from other wikis. See:
http://en.wikipedia.org/w/index.php?title=Church_of_England&dir=prev&action=history

The first three edits there were imported from the Nostalgia Wikipedia:
http://en.wikipedia.org/w/index.php?title=Special:Log&page=Church+of+England

aaron added a comment.Jun 8 2012, 6:23 AM

(In reply to comment #10)

The sizes don't seem to be properly generated for revisions imported from other
wikis. See:
http://en.wikipedia.org/w/index.php?title=Church_of_England&dir=prev&action=history
The first three edits there were imported from the Nostalgia Wikipedia:
http://en.wikipedia.org/w/index.php?title=Special:Log&page=Church+of+England

That's not related to this bug. That sounds like a rev_parent_id problem.

(In reply to comment #11)

(In reply to comment #10)

The sizes don't seem to be properly generated for revisions imported from other
wikis. See:
http://en.wikipedia.org/w/index.php?title=Church_of_England&dir=prev&action=history
The first three edits there were imported from the Nostalgia Wikipedia:
http://en.wikipedia.org/w/index.php?title=Special:Log&page=Church+of+England

That's not related to this bug. That sounds like a rev_parent_id problem.

That is bug 36976.

Aaron continues to run this. He often needs to restart it and babysit, but there is progress, so it'll eventually get done. It's probably not sensible to venture a guess on when it'll be done, since it'd be a wild guess, but it's probably best measured in weeks.

aaron added a comment.Aug 11 2012, 4:21 AM

The script for the last remaining rev ID range just finished today.

(In reply to comment #15)

The sha1 is still empty in some revisions of this page:
https://pt.wikipedia.org/w/api.php?action=query&prop=revisions&format=jsonfm&rvprop=ids%7Ctimestamp%7Cuser%7Csize%7Csha1%7Ccomment&rvlimit=3&titles=Nabla&rvstartid=28914057

13/70 for that page apparently..

mysql> select rev_id from revision where revision.rev_page = 224934 AND rev_sha1 = '';
+----------+

rev_id

+----------+

10172058
10183927
11479284
12691641
12745322
12745331
12759878
26150163
26667605
26806035
26806040
28239870
28914057

+----------+
13 rows in set (0.00 sec)

I've started scripts to catch any stragglers.

Possibly, some revs (in ID range x-y) were restored (undeleted) after the first script read a batch of revs to update (including some of ID range x-y) from a snapshot in time so it didn't catch them.

aaron added a comment.Aug 19 2012, 7:16 PM

(In reply to comment #17)

I've started scripts to catch any stragglers.
Possibly, some revs (in ID range x-y) were restored (undeleted) after the first
script read a batch of revs to update (including some of ID range x-y) from a
snapshot in time so it didn't catch them.

Second run has completed on all wikis. For enwiki:

  • rev_sha1 and ar_sha1 population complete [12420 revision rows, 25084506 archive rows].