Page MenuHomePhabricator

Update populateRevisionSha1.php to run over the content table
Closed, ResolvedPublic1 Estimated Story Points

Description

As identified in T200653, missing/empty rev_sha1 and content_sha1 values can cause various problems. Ensure that, between populateRevisionSha1.php script and populateContentTables.php, these values are populated. Or else if for some reason they cannot be populated, inform the user and provide helpful instructions.

populateRevisionSha1.php may be run in two situations:

  • initialize the rev_sha1 field for wikis being upgraded from a schema without that field, which was added in MediaWiki 1.19
  • fix missing rev_sha1 data on wikis that have already been upgraded

populateContentTables.php may be run in two situations:

  • initialize MCR content-related data for wikis being upgraded to an MCR schema
  • fix missing content-related data for wikis that have already been upgraded

There are some interdependencies to be aware of:

  • populateContentTables.php requires rev_sha1 to have been initialized
  • populateRevisionSha1.php requires access to the content, which for certain MCR stages requires the content and slot tables to have been initialized

populateRevisionSha1.php currently uses deprecated functions. These should be refactored.

Event Timeline

WDoranWMF set the point value for this task to 1.Apr 15 2019, 3:16 PM

The task description currently reads "content_sha1 needs to be populated by the populateRevisionSha1.php script". However, unless I'm misunderstanding things, this would cause a chicken-and-egg problem.

For a pre-MCR wiki being updated to the MCR schema, the content table is empty until it is filled in by populateContentTables.php. Per T217831, this doesn't (and shouldn't) happen until after populateRevisionSha1.php runs. This makes sense because populateContentTables.php uses revision.rev_sha1, and revision.rev_sha1 is filled in by populateRevisionSha1.php.

However, revision.rev_sha1 (which doesn't exist before MediaWiki 1.19) is initialized in populateRevisionSha1.php by hashing the content. Older versions of MediaWik used revision.rev_text_id to find the content, but MCR replaces that with slots/content tables. The new RevisionRecord class, by design, cannot find the content using revision.rev_text_id.

So (at least for older wikis) we need the content to populate revision.rev_sha1, and we need revision.rev_sha1 to populate the content.

Until now, populateRevisionSha1.php has been getting around this by using the deprecated Revision class (which does use revision.rev_text_id), but that is a short-term solution already in need of refactoring. Well, and maybe also by nobody updating ancient wikis that didn't have revision.rev_sha1 already populated.

The change I'm about to post modifies populateRevisionSha1.php as follows:

  • removes use of the deprecated Revision class, in favor of the new RevisionRecord class
  • if RevisionRecord is unable to find a sha1 (meaning this may be an older wiki being upgraded), fall back to a query that uses revision.rev_text_id to find the content
  • use SqlBlobStore and SlotRecord to calculate the necessary hash

This should populate revision.rev_sha1 in all cases, so that populateContentTables.php executes successfully.

(Most of the above is also true for the archive table, and fields ar_sha1 and ar_text_id, and the change handles that in the same way as the revision table and its corresponding fields.)

Some questions:

  1. am I thinking about all that correctly?
  1. does the update process actually run populateRevisionSha1.php before populateContentTables.php? I assumed it did, because that seems like the only way this makes sense, but looking at the code I wasn't as convinced.
  1. do I have to worry about content types other than text (external storage)?

Change 516730 had a related patch set uploaded (by BPirkle; owner: BPirkle):
[mediawiki/core@master] Update populateRevisionSha1.php for MCR schema changes

https://gerrit.wikimedia.org/r/516730

FYI I'm running this script on ~50 production wikis (the ones that have rows with ar_sha1 = '') as part of T219816

Jdforrester-WMF subscribed.

This missed the boat for the MW 1.33 release. Provisionally re-tagging to 1.34's release instead.

Change 516730 merged by jenkins-bot:
[mediawiki/core@master] Update populateRevisionSha1.php for MCR schema changes

https://gerrit.wikimedia.org/r/516730

This task is invalid as described, because the updater has always been running PopulateRevisionSha1 after PopulateContentTables. I changed PopulateContentTables to calculate the SHA1 (and the size) on the fly if needed and write it to the content table
(https://gerrit.wikimedia.org/r/520735), and Bill updated PopulateRevisionSha1 to make use of that information to populate rev_sha1 and ar_sha1 at the end of the update process (https://gerrit.wikimedia.org/r/c/mediawiki/core/+/516730/).

NOTE: The confusion stems from the existence of DatabaseUpdater::$postDatabaseUpdateMaintenance. This feature seems ill conceived, and bound to cause more trouble in the future. We should probably get rid of it.

The relevant patches mentioned above are merged now.