Page MenuHomePhabricator

Add a "Reset parent IDs" checkbox to the undelete interface
Closed, DeclinedPublic

Description

With T183375, undeletion no longer sets the parent IDs for the undeleted revisions to the "expected" one given by the previous revision by ID (usually, but not always, also the previous revision by timestamp), but instead to the ar_parent_id values. For example, look at the history of Ernest Madu on Wikipedia, and you'll see that revision 820864133 has parent ID 0 (with an "N" mark in the user's contribs) instead of 820863993. While T183375 may have been a good idea, there are situations (such as selective undeletion or history splits) where it would be better to reset the parent IDs like they had always been.

Proposal
Add a new checkbox (unchecked by default) saying "Reset parent IDs" or something similar that, if checked, will fallback to the old behavior of resetting the parent revisions. If unchecked, ar_parent_id will be used as the rev_parent_id for the undeleted revisions, which is the current apparent behavior per T183375.

Alternative proposal
Always use the existing value of ar_parent_id unless it is "wrong" for any reason that can be automatically determined. If any of the following apply, the existing value will be ignored and the result of the getPreviousRevisionId function in RevisionStore.php will be used instead.

  1. The existing value is an unselected deleted revision for the same page title.
  2. The existing value is a revision, deleted or otherwise, from another different page title.
  3. The existing nonzero value is already used as the rev_parent_id for an earlier undeleted revision.
  4. The existing value is zero, but there are earlier selected revisions with the same ar_page_id.
  5. The existing value is a later selected revision, or a later live revision.
  6. The revision is the earliest one to be undeleted, and there are no earlier revisions in the page's live history (in this case, the rev_parent_id will be automatically set to zero).

Advantages (pros) of resetting parent IDs

  • Never makes revisions with broken parent revisions (T186280 and T193211; useful when removing unwanted edits with selective undeletion, although using revision deletion would be even better)
    • But don't use the checkbox when the only reason you are doing selective undeletion is that there are too many revisions to be undeleted at once, because it would cause the second disadvantage below
  • Always treats the oldest revision as a page creation (useful when splitting histories)
  • Allows fixing incorrect cases (e.g. Template talk:Db-g1/Archive 1 or Gema Switzerland on Wikipedia; usually caused by imports done in 2015 or earlier or undeletions of revisions deleted in pre-1.5 versions of MediaWiki; see also T38976)
  • May be used to correct previous selective undeletions that should have been full ones (otherwise, we might get 2 revisions with the same parent revision)
  • May also be used to fix some of the disadvantages below

Disadvantages (cons) of resetting parent IDs

  • May cause creation edits to suddenly become treated as non-creation edits if a deleted page has been recreated
    • To fix this, delete the page and then repeatedly undelete groups of revisions with the checkbox checked, starting from the latest group and then moving backward to earlier groups, where the groups consist of all revisions that originally belonged to one page ID
  • May conversely cause non-creation edits to become unexpectedly treated as creation edits
    • To fix this, delete the page and then undelete all of the revisions with the checkbox checked, trying repeatedly if necessary until it no longer gives an error (or just pick a small number n (e.g. 100) and then repeatedly undelete at most n revisions at a time with the checkbox checked, starting from the earliest n revisions, then the next n, etc., until there are no more revisions to be undeleted)
  • May cause minor edits to no longer appear minor in terms of the red negative or green positive "change in size" number in the history
    • To prevent this, always select the previous revision as well when selecting a minor revision
  • May cause null revisions resulting from imports to have rev_parent_id set to the latest revision imported at the time instead of the immediately preceding revision by timestamp with the same text
    • To fix this, delete the page, undelete all the non-imported revisions (including the null revision from the import) with the checkbox checked, and then undelete the imported revisions leaving the checkbox unchecked

Neutrality of resetting parent IDs

  • Won't make a difference in most cases, where the page had just one deletion log entry excluding revision visibility changes

In general, to determine whether the proposed checkbox should be used or not, ask yourself: Does resetting the parent revisions improve the structure of the page's history?. If the answer is "yes", then use the checkbox; otherwise don't. Theoretically, the checkbox could allow the parent revision for a given revision to be changed to any desired revision with a smaller ID and earlier timestamp. But as long as the checkbox is not being used abusively, this shouldn't be a problem to worry about. Another suggestion is to never do "bad history merges", because bad things could happen with parent IDs (as the history of the page Mord Fustang on Wikipedia shows); in particular, when a redirect was deleted to make way for a move, make sure not to accidentally undelete the deleted redirect revisions.

Event Timeline

What does "sets the revision parent id to an automatically pre-determined value based on comparing rev_id values" mean? What makes the old behaviour (whatever that is) a correct behaviour worth preserving?

@Lahwaacz I have fixed the description to clarify the meaning. I mean the previous revision by id (usually also by timestamp, but not always).

Whether to reset parent ids or not depends on the scenario. When undeleting pages that have been deleted and recreated several times, it might be better to leave the parent ids for the recreations as zero, but when doing history splits or merges, the better option is to reset the parent ids. The checkbox should be based on judgement, just like the "Assign edits to local users where the named user exists locally" checkbox when doing an import.

Please show me which PHP code sets the rev_parent_id to the ar_parent_id on undeletion.

Anyway, the page you linked is not a result of a simple deletion/undeletion, there has been history merging of multiple pages. The revision 820864133 has an edit summary "Created page with 'Dr. Ernest Madu (born February 18, 1960) is a nuclear cardiologist whose work focuses on providing affordable public healthcare in low-resource nations. As found...'" and an automatic link to Wikipedia:AES so that revision most likely had rev_parent_id equal to 0 from the start (i.e., this revision really created a new page). If the revision was ever deleted, you can't expect undeletion to assign different rev_parent_id from what it had before it was deleted.

Please show me which PHP code sets the rev_parent_id to the ar_parent_id on undeletion.

This code fetches the data from the archive table, [[https://phabricator.wikimedia.org/source/mediawiki/browse/master/includes/Storage/RevisionStore.php;2df1e9e5fc80aeb6bc02f205d0a3fa2e8eb8acea$1658|including ar_parent_id]]. That data is then passed to Revision::newFromArchiveRow, which a few calls later uses that to set the parent ID in the RevisionRecord object where it is later used when saving the revision back into the revision table.

The old code, prior to rMW745823287498: [MCR] Use RevisionStore::getArchiveQueryInfo in PageArchive, [[https://phabricator.wikimedia.org/source/mediawiki/browse/master/includes/page/PageArchive.php;b5370206ff602728c08403f14756d4339bf0a708$549-564|did not include ar_parent_id]] when selecting the data in the first place, so the saving code linked earlier would find a null parent ID and fall back to setting it to the ID of whichever revision currently in the revision table would be previous to the revision being undeleted.

I have changed the proposed default checked status to "unchecked" instead to match the default for the "Assign edits to local users where the named user exists locally" checkbox in Special:Import. In general, resetting may only be necessary in the case of selective undeletion, when one of the ar_parent_id values might be an unselected revision that will still remain in the archive table after undeletion.

The following 2 examples show that using the parent id from the archive table might be a problem in the case of selective undeletion: revision 821925874 in the history of Kurdish National Congress and revision 435909482 in the history of The Strange Familiar (which was actually originally part of "User:Lotsamuzak101/The Strange Familiar"). They look like they are new-page creations, but they do not actually show the "N" mark when viewing the user's contributions page. Their parent ids are actually 821819956 and 435904177 respectively (which you can find out by using exportation). Those two ids will display an error when you try to view them. Perhaps, in the mean time, we should add a maintenance script to change the parent ids for all revisions with "broken" parent ids to zero. Then they will show the "N" mark in the user's contributions page.

GTrang triaged this task as Medium priority.Feb 3 2018, 4:06 PM

@Graham87 @Addshore Would you two be able to take a look at this? As another example, the history of Talk:Po (disambiguation) clearly shows the earliest revision with a nonzero parent id despite the page being undeleted.

Yeah, that's really not a good thing. It used to be the case that deleting and undeleting pages would fix these byte count problems. I actually relied on this feature last night my time when doing a very convoluted history merge/split to the articles "Power set" and "Powerset (company)". A very problematic revision is revision ID number 196132373 from the "Powerset (company)" page with the edit summary "moved PowerSet to this, which was previously a redirect to Power set" (with links removed). It shows up in the history as having added 6,400 bytes to the page, despite both it and the previous revision having been 6,422 bytes. That's because the parent ID of the offending revision is 189554283, which now belongs to the Powerset page after my history merges/splits. If a maintenance script went through and made sure all parent IDs were in chronological order (and that was enforced from now on), that would solve quite a few problems such as this one ... AFAIK it would only negatively affect revisions with out-of-order dates, such as those in T4219.

What exactly is rev_parent_id supposed to be? Just a pointer to the previous revision in the article's history, or a pointer to the revision that this revision was based on, or what? rSVN19843: Revision table tree patch (No objections have been raised so i'm committing… is not clear.

I note the old behavior, if rev_parent_id is supposed to point to the previous revision, was buggy already: when you undeleted a revision its rev_parent_id would be reset, but the rev_parent_id of the following revision that should now be pointing to the newly-undeleted revision was not updated.

GTrang renamed this task from Undeletion no longer resets parent ids to Add an option to the undelete interface to reset parent ids.Feb 19 2018, 7:59 PM
GTrang updated the task description. (Show Details)
GTrang renamed this task from Add an option to the undelete interface to reset parent ids to Undeletion should allow resetting parent ids.Feb 19 2018, 8:11 PM

Change 419338 had a related patch set uploaded (by GeoffreyT2000; owner: GeoffreyT2000):
[mediawiki/core@master] Add an option to Special:Undelete to reset parent ids

https://gerrit.wikimedia.org/r/419338

The log events should say if the IDs were reset or not, otherwise you won't know what's happening on the wiki and debugging will be impossible without complex SQL queries like https://phabricator.wikimedia.org/T186280#4048072

There's no "should allow" that makes sense here.

  • If rev_parent_id is supposed to point to the parent revision at the time the revision was created, then it should never be reset. In that case, diffs and whatever else aren't working right with this interpretation of the field would need to be fixed.
  • If rev_parent_id is supposed to point to whichever revision is currently previous in the article's history, then it should always be reset. And brokenness with the previous behavior also needs fixing: undeletion should also reset the rev_parent_id of the revision following the undeleted revision, and similar resetting needs doing for imports, history merges, and anything else that moves revisions around or creates non-top revisions.
GTrang renamed this task from Undeletion should allow resetting parent ids to Add a "Reset parent ids" checkbox to the undelete interface.Mar 23 2018, 10:14 PM

Change 419338 abandoned by GeoffreyT2000:
Add an option to Special:Undelete to reset parent ids

Reason:
Need to establish consensus for the use of this checkbox, then, with an RfC.

https://gerrit.wikimedia.org/r/419338

Since Anomie's Code-Review -2 on the Gerrit patch that I have abandoned made this controversial, I decided to make this an RfC. If consensus is established, I will upload a new patch from scratch instead of restoring the abandoned patch.

The description of the advantages and disadvantages above is irrelevant to clarifying the meaning of the rev_parent_id field. You're assuming that the rev_parent_id is the only "unknown" and all other code is immutable, while you could modify other code to fix the wrong/undesired behaviour. You might even find out that the desired IDs can be computed differently, in which case using an ambiguous database field would be useless.

I think the following should be done to asses the current state before proposing changes:

  • find out which code depends on the rev_parent_id field
  • make sure that the ref_parent_id is not used with different interpretations in different parts of the code
Krinkle subscribed.

This RFC was triaged today in the TechCom meeting.

It seems that in its current state, this RFC does not have a clear problem statement that TechCom can help guide towards a solution. As such, I'm moving this to the Backlog. This is mainly because it mixes two distinct needs: Solving a technical problem, and providing a new ability to users.

I recommend reducing the scope of this RFC to just the technical problem (which does not involve adding new user abilities). For example "rev_parent_id should always point to a row currently in revision" or "there should be a linear chain of revisions where only 1 revision can have another revision as its parent", or something else.

Alternatively, if your primary interest is not about a specific technical problem but only about the end-result to a user, then I recommend reclassifying this task as a feature request or bug report for the MediaWiki-Page-deletion project, and untagging TechCom-RFC. Then, once there is agreement from project owners on how the feature should behave, you or another developer can create an RFC with a specific technical problem and (optionally) a proposed solution.

I tried to fix the description to make it look more like an actual RfC. This means that I moved this back to the "Inbox" column on the TechCom-RFC board.

GTrang renamed this task from Add a "Reset parent ids" checkbox to the undelete interface to Add a "Reset parent IDs" checkbox to the undelete interface.Apr 4 2018, 5:21 PM

We discussed this in the TechCom meeting and came to the conclusion that adding a mysterious checkbox to the interface is not the best way to solve this problem. Almost nobody will understand what this checkbox means (and can't without reading several pages of detailed technical documentation), and many of the use cases described are things that undeletion shouldn't be used for.

We think the best way to proceed is to 1) make a Special:SplitHistory page so people can split histories without having to do weird delete+undelete dances (there's already a Special:MergeHistory that was created for similar reasons), and 2) make this RFC be about ways to solve the larger issues around rev_parent_id. Several people (myself included) think it should probably be removed entirely, if we can find workarounds for the few things that it's currently used for. The use cases I'm aware of are marking the first revision, and computing the number of characters added/removed.

The Special:SplitHistory idea sounds like revision move (T23312), which would be really really cool.

Thinking about this, I decided to withdraw this RFC and will close this as declined. Instead, I will come up with another RFC for a third possible behavior for undeletion.