Page MenuHomePhabricator

Create backlog extension for fixing rev_parent_id and ar_parent_id
Open, Needs TriagePublic

Description

If the following proposed extension still does not get created by the end of 2020, then I'll include the following in the Community Wishlist Survey 2021 on Meta. Unfortunately, the Community Wishlist Survey 2020 has a different format from previous years, and will not accept this proposal.

The extension only needs to be installed on wikis that previously used MediaWiki 1.30 or earlier unless there are any imported edits that also ought to have rev_parent_id fixed to correspond to the one on the source wiki.

Before creating the extension, we should make the "populateParentId" script in MediaWiki also populate missing ar_parent_id fields, at least for those archive rows that have a non-null ar_page_id field, where it is assumed that the equivalent of "has the same rev_page" for the archive table is "has the same ar_namespace, ar_title, and ar_page_id combination". Dealing with the null ar_page_id case is a bit trickier, because we need to know when each revision was deleted, so it is best to leave ar_parent_id null for such deleted revisions for now.

We now keep the old parent ID when restoring revisions, but this previously wasn't the case. We should start fixing rev_parent_id for all old restored revisions.

First, the extension needs 2 globals named "$wg(ExtensionName)119date" and "$wg(ExtensionName)131date". These should respectively be the date the wiki started using MW 1.19 (when deleted revisions started to have parent IDs saved in the archive table) or later and the date the wiki started using MW 1.31 wmf.15 (when undeletions started to keep the old parent ID) or later. We also need 2 tables named "backlog_temp_page" and "backlog_temp_revision". The former will have columns named "btp_id", "btp_namespace", "btp_title", and "btp_timestamp". The latter will have columns named "btr_id", "btr_rev_id", "btr_btp_id", "btr_old_parent_id", "btr_new_parent_id", and "btr_table".

Next, we need to create a script that will dump some pages and revisions (including deleted ones) into the 2 tables. All pages that have a "restored page" log entry with a timestamp on or earlier than the "$wg(ExtensionName)131date" global will be dumped into the "backlog_temp_page" table, with "btp_timestamp" being the log entry's timestamp (the earliest one if there is more than one such entry). The same will also be done for pages with later "restored page" log entries if they also have at least one "deleted page" log entry with a timestamp on or earlier than the "$wg(ExtensionName)119date" global, as well as pages with import log entries. Again, if a page has more than one "restored page" or import log entry, or both, then it will only be added once to the table, and "btp_timestamp" will be the timestamp of the earliest such entry. Targets of merge and move log entries from titles already in the table will also be added to the table, with "btp_timestamp" being the timestamp of the earliest such entry, and this will be done recursively. Merge and move log entries with timestamps earlier than the "btp_timestamp" for the source page will be ignored. Once the "backlog_temp_page" table is completely filled, it is then time to fill in the "backlog_temp_revision" table. All live and deleted revisions for pages in the "backlog_temp_page" table will be dumped into the "backlog_temp_revision" table, with "btr_old_parent_id" and "btr_new_parent_id" both initially being the current value of rev_parent_id or ar_parent_id, and "btr_table" being either "revision" or "archive".

While the extension is ongoing, it needs some hooks. When a page listed in the "backlog_temp_page" table is deleted, the "btr_table" field must be changed from "revision" to "archive" for all rows corresponding to the revisions that had just been deleted. When such a page is undeleted, the "btr_table" field must be changed from "archive" to "revision" for all rows corresponding to the revisions that had just been restored. When such a page is moved, the new title must replace the old one in the "backlog_temp_page" table, and the old title will be re-added to the table if it still has some deleted revisions. In the latter case, the "btr_btp_id" field will be replaced with the ID of the newly inserted "backlog_temp_page" row for all rows in the "backlog_temp_revision" table corresponding to the deleted revisions for the old title. Finally, when Special:MergeHistory is used with a source page that is already in the "backlog_temp_page" table, the target page must be added to the table if it is not already there, and the "btr_btp_id" field must be updated for all rows corresponding to the revisions that had just been merged. When all revisions are merged, the source page will be removed from the "backlog_temp_page" table if it does not have any deleted revisions.

Finally, we need a special page named "Special:FixParentIDs". The special page will require a page from the "backlog_temp_page" table to be fixed. Then, all of the page's revisions from the "backlog_temp_revision" table will be listed on the special page, with the timestamp, author, edit summary, and "minor edit" status shown. For each revision, there will be 2 radio buttons below it. One of them will say to keep the current parent ID, and the other one will say to change the parent ID to whatever the user thinks it should be. There will also be 2 buttons named "Save settings" and "Fix page". The former will only update the "btr_new_parent_id" fields, while the latter will also immediately fix rev_parent_id or ar_parent_id for all of the page's live and deleted revisions and remove the page and its revisions from the "backlog_temp_page" and "backlog_temp_revision" tables.

Some clues or strategies that may help find the correct parent ID include the following:

  • Edit summaries (e.g. revisions with "Created page with" in their edit summaries are expected to have rev_parent_id equal to zero)
  • Revision sizes (revisions that are "related" tend to have sizes close to each other, while those that are "unrelated" tend to have sizes far from each other)
  • Text similarities
  • If the page had previously been selectively restored, then try converting the deleted revisions to redacted revisions using RevDel and then restore them.
  • For imported revisions, try to look up the revisions from the source wiki.
  • Try to find all of a page's revisions that are not "ancestors" of its latest revision, and fix each one of those to have the preceding one (or 0 for the earliest one) as the parent. This strategy will only work, however, if exactly 2 page histories have been merged together. If more than 2 have been merged together, then one still needs careful examination.
  • Try looking up the deletion log entries' timestamps. The earliest revision with a timestamp following that of a deletion log entry should usually be a recreation, and hence have rev_parent_id 0. For pages that have been moved, one might need to check the logs for the previous titles.
  • For null revisions (e.g. moves and protections), rev_parent_id should be fixed to be the rev_id of another earlier revision that uses the same slot_content_id in the slots table (doesn't matter which one if there is more than one). We might also want to fix slot_origin to be the smallest revision ID using the same content row for all null revisions.
  • When in doubt, as the last resort, just fix the parent ID to be the rev_id of the preceding revision if one cannot find a better alternative.

We should then install the extension on all Wikimedia wikis, and set a goal. The goal is to get the entire backlog cleared by 2030 at the latest (assuming that T206587 is declined by then).

User talk page message

After completing a request to fix parent IDs for revisions from a particular page, messages will be left on the user talk pages of the affected editors to let them know that the parent ID has successfully been fixed for one or more of their edits.

The message will look something like the following in English (as usual, the four tildes will automatically be replaced with a signature and a timestamp):

== Check out the following page: Affected page ==

Hi, {{ROOTPAGENAME}}. I would like to let you know that I have fixed parent IDs for one or more of your edits to the page [[Affected page]]. The affected revision ID(s) is/are the following: (List of affected revision IDs).

Please check the history of the page to confirm that the size diff numbers have successfully been fixed. ~~~~

{{ROOTPAGENAME}} is used here so that it remains displayed correctly if the user had been renamed or if the message had been archived. Also, if the user talk page is a redirect to another page, then the message will be posted at the target page instead, and the possessive pronoun "your" will be replaced with the possessive form of the original editor's username (which might, for example, be a bot username). For usernames that do not end with "s", "'s" will be added automatically; for those that do, one must decide whether or not "'s" should be added. For imported edits with usernames having an interwiki prefix, no message will be posted.

Other languages will of course need a translated version of the message.

With the example "User:Calliopejen1/Bronces de Benín" below, Millars would receive messages on both the English Wikipedia and the Spanish Wikipedia that say that the parent ID had been fixed for four of his edits on the respective wikis. Also, since User talk:Xqbot (enwiki) and Usuario discusión:Xqbot (eswiki) redirect to User talk:Xqt (enwiki) and Usuario discusión:Xqt (eswiki) respectively, Xqt would receive messages that say that the parent ID had been fixed for one of Xqbot's edits on the respective wikis. The size diff number would then change from a "heavy" red negative number (-12,550) relative to the old parent ID to a "light" green positive number (+24) relative to the new parent ID.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 15 2019, 1:27 AM
GeoffreyT2000 updated the task description. (Show Details)Jun 7 2019, 5:08 PM
GeoffreyT2000 updated the task description. (Show Details)

The following example shows that for imported revisions, rev_parent_id might require fixing on both the source and the target wikis.

The subpage User:Calliopejen1/Bronces de Benín on the English Wikipedia was imported from the page Bronces de Benín on the Spanish Wikipedia. There is also a move comment that says "fusión de historiales", which, of course, is the Spanish translation of "merging histories". This together with the negative size diff for the edit at 23:23, 21 July 2010 by Xqbot suggests that we should fix rev_parent_id for some of the the edits between the 2 moves by Millars. The strategy is to separate the revisions containing <nowiki>'ed categories from the ones that contain live categories until the Xqbot edit.

The following fixes should therefore be done:

  • Revision ID 526466627 on enwiki and revision ID 38879757 on eswiki should both be fixed to have rev_parent_id 0 to make them show as page creations on Millars' contributions on the respective wikis.
  • Revision ID 526466630 on enwiki and revision ID 38881146 on eswiki should be fixed to have rev_parent_id 526466626 and 38879182 respectively.
  • Revision ID 526466631 on enwiki and revision ID 38884289 on eswiki should be fixed to have rev_parent_id 526466629 and 38881051 respectively.
  • Revision ID 526466643 on enwiki and revision ID 38956953 on eswiki should be fixed to have rev_parent_id 526466630 and 38881146 respectively.
  • Revision ID 526466644 on enwiki and revision ID 38975104 on eswiki should be fixed to have rev_parent_id 526466641 and 38922039 respectively.
  • Finally, the rest of the revisions' parent IDs should be left unchanged on both wikis.
Restricted Application added a subscriber: Liuxinyu970226. · View Herald TranscriptAug 14 2019, 4:48 PM