Page MenuHomePhabricator

RFC: How should we fix the undeletion system?
Open, MediumPublic

Description

The following supercedes T185167, which I had closed because it was determined to be too destructive.

With T183375, undeletion no longer sets the parent IDs for the undeleted revisions to the "expected" one given by the previous revision by ID (usually, but not always, also the previous revision by timestamp), but instead to the ar_parent_id values. For example, look at the history of Ernest Madu on Wikipedia, and you'll see that revision 820864133 has parent ID 0 (with an "N" mark in the user's contribs) instead of 820863993. But even this alone can still cause problems.

Flaws with the current interface

Some flaws include:

  • Failure to distinguish page IDs
  • Broken parent revisions
  • "Ancestor" revisions being scattered randomly across different titles and mixed in with unrelated histories, with no easy way to rejoin all the revisions into the history of one single title
    • This is the same problem that would occur when manually changing rev_page fields (with the current behavior, deleting A, moving B to A without redirect, undeleting a single revision for A, moving A back to B without redirect, and finally undeleting the rest of the revisions for A is effectively equivalent to simply changing the rev_page field for the single revision from A's page ID to B's page ID).
  • Takes forever to load for pages with thousands of deleted revisions
Proposals 1 and 2

Moved to T206587 to keep the main task's description to a reasonable size.

Proposal 3 (to be attempted for consensus if Proposals 1 and 2 both fail)

When undeleting, always set each undeleted revision's parent ID to the previous undeleted revision among those revisions that had the same ar_page_id field (which will usually be the same as the ar_parent_id field), or zero if there is no such revision, with the following exception: if selective undeletion had been used to effectively revert a page to an older revision and the page had not been edited since the selective undeletion, undeleting later revisions (which would imply a necessary update to the page_latest field in the page table) will leave the parent ID for the earliest undeleted revision as the page's current revision instead of changing the parent ID to zero. This can be done by making a variable that is initialized as an empty array and then changes each time a revision is being inserted. The key is the ar_page_id field and the value is the ID to be used as the parent ID. The exceptional case will not be applied if the page had been edited since the selective undeletion, in order to prevent "siblings" (two revisions with the same parent revision). Each distinct ar_page_id field would then be treated as a separate history (but nonetheless would still appear altogether in a single page's history).

Advantages of Proposal 3
  • Makes revisions that were originally from different page IDs distinguishable using rev_parent_id.
  • Works better for deleted revisions that have null ar_parent_id fields.
Disadvantage of Proposal 3
  • Not idempotent—when a page is undeleted, then re-deleted, and finally undeleted again, the parent IDs might be different from what they were after the first undeletion, because Special:Undelete will no longer know that two revisions had belonged to separate page IDs.
See also

The following is a different but related RFC.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Anomie added a comment.Jun 4 2018, 7:13 PM

Particularly in the case of Special:Contributions, I'm not sure that is correct. I at least would expect the number to indicate what the user did, that is how much of the page that edit by that user changed.

Given how change-averse the on-wiki communities can be, perhaps we should actually ask the people who use it for patrolling which behavior they expect before we go changing things.

B. We could record the size difference when the revision is created, perhaps in a change tag parameter.

I note that discussion in T185355: Normalize change tag schema seems to have been mostly in favor of dropping ct_params (except you, who wanted to "abuse" it), but it wasn't done at the time because Translate was still populating it. Translate no longer does so.

I still think that if you want a "revision_props" table to act like page_props, you should create that instead of conflating it with the user-visible change tags.

It would even be possible to change this to record the size of the diff, instead of the difference in size, which is more meaningful.

Possible, but perhaps not easy to do well. Sometimes the diff engine gets confused and shows a large diff for a small change.

  • The diff links shown along with the size difference (on the history page as well as the contributions page) use the diff=prev, which is implemented by Title::getRelativeRevisionID() to be based on the timestamp, not rev_parent_id. This would need to be changed to use rev_parent_id to work with option 2. Note that making diff=next also work with option 2 would need an index on rev_parent_id.

That proposal seems even more likely to break people's workflows. When paging through diffs you couldn't rely anymore on it not skipping revisions.

  • Options 1 however seems to be rather redundant, and produce misleading/inconsistent information in some cases.

It's a performance optimization, not "redundant".

And only "misleading/inconsistent" if you're still using the wrong definition for the field (or what the "size difference" is indicating).

  • Do we want to record the original "edit as intended", and if yes, where?

That question probably needs wider input than just whoever is likely to show up to the meeting.

Bawolff added a subscriber: Bawolff.Jun 4 2018, 7:14 PM

 A. We could compute size differences on the fly on history page, but not on the contributions page (at least not easily/efficiently)

There are edges cases where history pages need this - e.g. if you are filtering by a tag.

daniel added a comment.Jun 4 2018, 7:27 PM

For the record, my intent is to explore options and their consequences. I'm not proposing to actually implement any of them without consulting the community.

It's a performance optimization, not "redundant".

Well, redundancy is not necessarily bad - it can be used to improve resilience or, as you say, performance.

And only "misleading/inconsistent" if you're still using the wrong definition for the field (or what the "size difference" is indicating).

Well, the fact that I have a size difference and a diff link next to each other, and they my show completely different, even opposite things, seems at least confusing to me. But ok, that would not be the case of rev_parent_id was completely consistent with the sorted history.

The diff may still show an edit the user never made, though. I find that rather disturbing. I'd love to have at least *some* way to detect this.

GeoffreyT2000 updated the task description. (Show Details)Jun 5 2018, 2:22 PM
Anomie added a comment.Jun 7 2018, 2:51 PM

Use page creation log for the N marks.

Another option would be to apply an appropriate change tag to revisions that (re-)create the page.

Doing a selective undeletion on a Wikidata item is pointless, as Wikidata items cannot be moved nor can deleted Wikidata items be "recreated". The only way to make a deleted Wikidata item become blue-linked again is to undelete it. Existing Wikidata items should never have any deleted revisions. If there are any existing Wikidata items that have deleted revisions remaining due to a previous selective undeletion, then a bot should undelete them all as soon as possible (ASAP) before the alternative proposal gets approved.

daniel added a subscriber: Tgr.Jun 14 2018, 1:01 PM

@Tgr The meeting minutes say you are going to write up a proposal for fixing the parent id: tgr proposes: add a new field to store diff size, recompute it when the (timewise) previous revision changes. Use page creation log for the N marks. Use rev_parent_id for logical parents (and do not erase the information we have from the time period when it was actually used as a logical parent.).

Are you planning to do that?

Tgr added a comment.Jun 14 2018, 7:13 PM

I intend to, yeah.

Change 441045 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] Change PageUpdater::hasEditConflict to take a parameter.

https://gerrit.wikimedia.org/r/441045

Tgr added a comment.Jun 21 2018, 12:04 PM

Beyond the use cases already considered, we also need to look at how Revision[Record]::getParentId() is used in hooks at such. It seems at least FlaggedRevs would break if that changed to be the logical parent.

Change 441045 was initially targeted incorrectly to this task, but it has since been fixed to target T197685 instead. So I have removed the "Patch for Review" project and reviewer gerritbot, for now. Also, we actually already have a script named populateParentId for populating null rev_parent_id fields. We just needed that script to allow fixing existing rev_parent_id fields, as well as populating or fixing ar_parent_id fields.

Vvjjkkii renamed this task from RFC: Use ar_page_id to determine the parent IDs for undeleted revisions to 1qdaaaaaaa.Jul 1 2018, 1:12 AM
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
CommunityTechBot renamed this task from 1qdaaaaaaa to RFC: Use ar_page_id to determine the parent IDs for undeleted revisions.Jul 1 2018, 8:02 AM
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added a subscriber: Aklapper.
CommunityTechBot raised the priority of this task from High to Needs Triage.Jul 3 2018, 2:02 AM
GeoffreyT2000 updated the task description. (Show Details)Jul 6 2018, 2:59 PM
GeoffreyT2000 renamed this task from RFC: Use ar_page_id to determine the parent IDs for undeleted revisions to RFC: Overhaul undelete feature with a "pagearchive" table.Jul 8 2018, 7:26 PM
GeoffreyT2000 updated the task description. (Show Details)
GeoffreyT2000 updated the task description. (Show Details)Jul 8 2018, 8:52 PM
GeoffreyT2000 renamed this task from RFC: Overhaul undelete feature with a "pagearchive" table to RFC: Overhaul the undelete feature with a "pagearchive" table.Jul 8 2018, 9:14 PM
GeoffreyT2000 updated the task description. (Show Details)
GeoffreyT2000 updated the task description. (Show Details)Jul 8 2018, 9:20 PM
GeoffreyT2000 updated the task description. (Show Details)Jul 8 2018, 9:37 PM

I have renamed the task and rearranged the description to make it more focused on the alternative proposal. The new interface for undeletion will distinguish page IDs more clearly than the current interface, which lists all deleted revisions regardless of the value of their ar_page_id fields. It would also be simpler and easier to use. Instead of selecting revisions, all one has to do is to select a page ID to be undeleted and then click the "Restore" button.

daniel added a comment.EditedJul 10 2018, 10:51 AM

@GeoffreyT2000 You are proposing a change to how the deletion/undeletion system works conceptually. This is not something that can be decided by a technical RFC alone. Your proposal sounds sensible enough to me to be seriously considered, but for it to get resourcing for development, you'd have to show considerable demand from the admin community (or rather, communities). My point is: you are proposing a change in behavior, introducing and changing user-side concepts and processes. That's a feature request. It needs to be decided by product owners and UX people, not by software engineers. So it'S not a good fir for a TechCOm RFC.

Once there is community demand, and a developer team has agreed to work on an implementation, then it is time to consider a technical RFC. But that's only necessary if the solution is strategic, cross-cutting, or hard to undo. Which may or may not be the case, depending on how exactly this is implemented.

GeoffreyT2000 updated the task description. (Show Details)
GeoffreyT2000 updated the task description. (Show Details)Aug 5 2018, 3:23 PM
GeoffreyT2000 renamed this task from RFC: Overhaul the undelete feature with a "pagearchive" table to RFC: How should we fix the undeletion system?.Oct 10 2018, 2:16 AM
Scott added a subscriber: Scott.Mar 21 2019, 12:12 PM
daniel moved this task from Under discussion to Backlog on the TechCom-RFC board.Jul 22 2019, 4:46 PM
daniel added a project: Core Platform Team.
daniel added a subscriber: CCicalese_WMF.

Lacks product level direction. Putting this into the RFC backlog for now. Tagging Core Platform Team for input from the product perspective. @CCicalese_WMF, do you have any thoughts on this?

Krinkle updated the task description. (Show Details)Jul 22 2019, 4:48 PM

(Apparently upstream Phabricator decided h3 headings from markdown should be twice as big as the page title. Lowering heading level as workaround.)

CCicalese_WMF triaged this task as Medium priority.Jul 23 2019, 7:47 PM

This will need more analysis before a solution is chosen. It may be superseded by T20493. Putting this task in the CPT backlog for now for consideration in a future initiative.

daniel added a subscriber: kostajh.Aug 28 2019, 8:32 AM
Izno added a subscriber: Izno.Sep 1 2019, 3:55 PM