Page MenuHomePhabricator

[MCR] Script for populating empty ar_rev_id fields
Closed, ResolvedPublic

Description

Populate empty ar_rev_id fields:

  • Determine how many rows in archive have ar_rev_id = NULL. Let's call that number m. (e.g. enwiki has 508811 such rows, out of ~87793416 rows)
  • Reserve m (or m+k, for good measure) IDs in the revision table:
  • Make a note of max( max( rev_id ), max( ar_rev_id ) ), let's call it b.
  • Insert a row with rev_id = b+m+k into the revision table, and delete it again, to bump the auto-increment counter.
  • For any row in archive that has ar_rev_id = NULL, set ar_rev_id to a unique id between b+1 and b+m+k. This could be done via a temporary table, or programmatically.

Make ar_text and ar_flags unused:

For each row in archive that has a non-null ar_text field, insert a row into the text table, copying ar_text to old_text and ar_flags to old_flags.

  • Set ar_text_id to the old_id from the newly created text row.
  • Set ar_text and ar_flags to the empty string everywhere.

see https://www.mediawiki.org/wiki/Multi-Content_Revisions/Content_Meta-Data

Event Timeline

aude removed aude as the assignee of this task.Dec 14 2017, 3:01 PM
aude moved this task from Duplicate to Inbox on the Multi-Content-Revisions board.
aude added projects: Wikidata, SDC General.
thiemowmde triaged this task as Medium priority.Dec 14 2017, 3:44 PM

Make ar_text and ar_flags unused

Code for this is already written, see https://gerrit.wikimedia.org/r/#/c/393928/.

Populate empty ar_rev_id fields:

  • Determine how many rows in archive have ar_rev_id = NULL. Let's call that number m. (e.g. enwiki has 508811 such rows, out of ~87793416 rows)
  • Reserve m (or m+k, for good measure) IDs in the revision table:
  • Make a note of max( max( rev_id ), max( ar_rev_id ) ), let's call it b.
  • Insert a row with rev_id = b+m+k into the revision table, and delete it again, to bump the auto-increment counter.
  • For any row in archive that has ar_rev_id = NULL, set ar_rev_id to a unique id between b+1 and b+m+k. This could be done via a temporary table, or programmatically.

For running as a maintenance script, I'm not so sure about this plan. If something goes wrong the list of "reserved" rows is likely to be lost and a new list would have to be reserved. Instead, I'd fetch a batch of archive rows needing IDs, reserve that many revision rows and assign them immediately, then repeat until there are no more batches.

Also, "Insert a row with rev_id = b+m+k into the revision table, and delete it again, to bump the auto-increment counter." won't work for all databases. It'll work for MySQL/MariaDB (at least with InnoDB), SQLite, and MSSQL, but for PostgreSQL and Oracle you'd instead have to update the sequence that generates the incrementing IDs.

Change 402932 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@master] Populate ar_rev_id and make it non-nullable

https://gerrit.wikimedia.org/r/402932

Change 402932 merged by jenkins-bot:
[mediawiki/core@master] Populate ar_rev_id and make it non-nullable

https://gerrit.wikimedia.org/r/402932

The script is done, so marking this resolved. Actually running it is T191307: Run maintenance/populateArchiveRevId.php on all wikis.