Page MenuHomePhabricator

Revisions created while MW 1.35 maintenance/update.php is running risk becoming inaccessible without explanation, rendering articles uneditable
Open, Needs TriagePublicBUG REPORT

Description

Desciclopédia [tm] (and a few other Uncyclopedias [tm] in the same, non-WMF series) recently upgraded from MW 1.31 to MW 1.35 as the old MediaWiki versions were breaking badly under the now-current PHP 7.4. While no longer in the top-16000 (or even the top 200000) in Alexa rank, DP is still fairly large and has a very active editor community. In the second half of the twenty hours (or so) which it took for a maintenance/update.php run to complete, there would have been more than a hundred edits made by individual users. Many of these edits leave the database in an inconsistent state where the most recent revision of an article, made while the update script was running, is being displayed in regular page view as "There is no content on this page". The revision history shows every revision except the ones made during the maintenance/update.php run (all the text from the old versions is there and viewable individually) but any attempt to edit the page fails with an edit conflict (MW sees the broken revision as conflicting with any new edit anyone - including the administrators - attempts to make).

The text of these "lost" revisions still exists and can be manually recovered at the database level. For instance, if [[uncyclopedia:pt:Pikachu Libre]] is one of the affected articles, going through the individual database tables gives:

SELECT * FROM `page` WHERE `page_namespace` = 0 AND `page_title` = 'Pikachu_Libre';
--- finds page_id = 605765
SELECT * FROM `revision` WHERE `rev_page` = 605765 ORDER BY `revision`.`rev_id` DESC;
--- finds a half-dozen revisions, of which the most recent (rev_id=3957083, rev_page=605765,rev_comment_id=0,rev_actor=0,rev_timestamp=20201118202636) is not viewable from within MW 1.35 and which (predictably) falls within the timeframe affected by the update.php run, which was done on 2020-11-18
--- Now take a gamble at the `slots` table, betting everything on 3957083
SELECT * FROM `slots` WHERE `slot_revision_id` = 3957083 AND `slot_origin` = 3957083;
--- So what do I win? A slot_content_id of 3501628, which gives:
SELECT * FROM `content` WHERE `content_id` = 3501628;
--- gives content_address = tt:3921557, where
SELECT old_text FROM `text` WHERE `old_id` = 3921557;
--- returns the text of this "lost" revision, but minus the edit comment and the author's name; at this point the sysadmin is promptly arrested for violating CC-BY-NC-SA's attribution requirement, a copyright infringement and a serious criminal offence, and promptly sent up the river to prison

OK, great. That still doesn't put the article back into an editable state, nor does it restore the "missing" revision. It is possible to get the article to be editable again by forcing a valid, previous revision.rev_id from the same page into the page table:

SELECT * FROM `revision` WHERE `rev_page` = 605765 AND `rev_timestamp` < '20201118000000' ORDER BY `revision`.`rev_id` DESC;
--- found rev_id=3870070 as the most recent before the day of the "upgrade", so force that value into the page table as:
UPDATE `page` SET `page_latest` = '3870070' WHERE `page`.`page_id` = 605765;
--- and somehow the page becomes editable, but without the revision made during the update.php run. Now multiply this across hundreds of edits over several dozen pages, and it becomes clear that - while the problem is not irremediable - this is rapidly becoming a disaster.

Obviously, a few issues here:

  1. Why is the update.php script leaving articles in an uneditable state if they were edited while the update was running on the server? I presume this is happening during all of the huge schema changes which replaced (page, revision, text) with (page, revision, actor, slot, content, text) but it gets ugly on a busy production site.
  2. Why is there no automated maintenance script to fix these? At a minimum, there should be some means to go through the page/revision tables, find any entries for which page.page_latest points to something which isn't usable content (for whatever reason) and repoint page.page_latest to the latest usable revision so that the affected pages can be edited again.
  3. What happened to revision.rev_actor and revision.rev_commentid? The documentation at [[mw:Manual:Revision_table]] claims that these fields were used in MW 1.35 to replace the old rev_comment, rev_user and rev_user_text, but on the actual database (with the MW 1.35 maintenance/update.php applied) they're both '0' for every record - for every page or revision created either before, during or after the conversion. The data has been moved somewhere else, but where, and why doesn't the documentation on https://www.mediawiki.org/wiki/Manual:Revision_table reflect this?

I've been manually setting page.page_latest back to sane values at the database level, one page at a time, as I have a brazillion very dissatisfied users at this point.

This process needs to be automated, and the bug that allowed this situation to occur needs to be fixed. Any suggestions?

Event Timeline

Carlb created this task.Sat, Nov 21, 9:39 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSat, Nov 21, 9:39 PM
RhinosF1 added a subscriber: RhinosF1.EditedSat, Nov 21, 10:54 PM

Doesn't the update guide advise making the wiki read only while updates in progress

Carlb added a comment.EditedSat, Nov 21, 11:34 PM

When retrieving pages manually from the database, the names may contain Unicode. A string literal may therefore need the _latin1, _binary or _utf8mb4 modifiers, ie:

SELECT * FROM `page` WHERE `page_title` LIKE _binary'Seleção_Venezuelana_de_Futebol';

If manually changing page.page_latest to point to some other revision to "fix" a page becoming uneditable, the revision id may be obtained from the page history, by clicking the direct link to the last "good" version, ie:

https://desciclopedia.org/index.php?title=Azul_Linhas_Aéreas_Brasileiras&oldid=3899292

The affected page can be found in the page table:

SELECT * FROM `page` WHERE `page_namespace`= 0 AND `page_title` = _binary'Azul_Linhas_Aéreas_Brasileiras';

(in this example, page_id=300716) and patched to point to the last revision visible in the page history, for example:

UPDATE `page` SET `page_latest` = '3899292' WHERE `page`.`page_id` = 300716;

After making any database changes, refresh the page with action=purge:

https://desciclopédia.org/index.php?title=Azul_Linhas_Aéreas_Brasileiras&action=purge

Otherwise, the "broken" version may still appear due to caching.

That doesn't restore the "lost" or hidden revisions, but it will at least make the page editable and readable.

Doesn't the update guide advise making the wiki read only while updates in progress

It does on the wiki page, but it doesn't look like the UPGRADE file in the repo suggests it.

It is also currently somewhat broken (T151833: maintenance/update.php complains about locked database when $wgReadOnly is set) and it doesn't report the error properly if you don't see the note about the workaround method (T203914: Update.php script fails to report database read-only error).