Page MenuHomePhabricator

RFC: How should we fix the undeletion system?
Open, Needs TriagePublic

Description

The following supercedes T185167, which I had closed because it was determined to be too destructive.

With T183375, undeletion no longer sets the parent IDs for the undeleted revisions to the "expected" one given by the previous revision by ID (usually, but not always, also the previous revision by timestamp), but instead to the ar_parent_id values. For example, look at the history of Ernest Madu on Wikipedia, and you'll see that revision 820864133 has parent ID 0 (with an "N" mark in the user's contribs) instead of 820863993. But even this alone can still cause problems.

Flaws with the current interface

Some flaws include:

  • Failure to distinguish page IDs
  • Broken parent revisions
  • "Ancestor" revisions being scattered randomly across different titles and mixed in with unrelated histories, with no easy way to rejoin all the revisions into the history of one single title
    • This is the same problem that would occur when manually changing rev_page fields (with the current behavior, deleting A, moving B to A without redirect, undeleting a single revision for A, moving A back to B without redirect, and finally undeleting the rest of the revisions for A is effectively equivalent to simply changing the rev_page field for the single revision from A's page ID to B's page ID).
  • Takes forever to load for pages with thousands of deleted revisions

Proposals 1 and 2

Moved to T206587 to keep the main task's description to a reasonable size.

Proposal 3 (to be attempted for consensus if Proposals 1 and 2 both fail)

When undeleting, always set each undeleted revision's parent ID to the previous undeleted revision among those revisions that had the same ar_page_id field (which will usually be the same as the ar_parent_id field), or zero if there is no such revision, with the following exception: if selective undeletion had been used to effectively revert a page to an older revision and the page had not been edited since the selective undeletion, undeleting later revisions (which would imply a necessary update to the page_latest field in the page table) will leave the parent ID for the earliest undeleted revision as the page's current revision instead of changing the parent ID to zero. This can be done by making a variable that is initialized as an empty array and then changes each time a revision is being inserted. The key is the ar_page_id field and the value is the ID to be used as the parent ID. The exceptional case will not be applied if the page had been edited since the selective undeletion, in order to prevent "siblings" (two revisions with the same parent revision). Each distinct ar_page_id field would then be treated as a separate history (but nonetheless would still appear altogether in a single page's history).

Advantages of Proposal 3

  • Makes revisions that were originally from different page IDs distinguishable using rev_parent_id.
  • Works better for deleted revisions that have null ar_parent_id fields.

Disadvantage of Proposal 3

  • Not idempotent—when a page is undeleted, then re-deleted, and finally undeleted again, the parent IDs might be different from what they were after the first undeletion, because Special:Undelete will no longer know that two revisions had belonged to separate page IDs.

See also

The following is a different but related RFC.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I think there underlying problem is that there is a conflict between what the rev_parent_id field is currently being used for, and the many things it could (or was) originally used for.

The page history and user contributions pages are ordered by timestamp (not revision id). The comparison view (linked from there) also by default compares to the "previous" revision by timestamp (not parent id). The parent id, as @Anomie describes, is only used for two features: The "size" number (difference in byte length of this revision and the "parent" revision), and the new page "N" marker (if parent_id is 0, as for the edit that created a page).

I'm sorry for potentially regressing this discussion back towards a previous state, but I think that, before we get deeper into technical details, we have to decide what outcomes we want to achieve.

History page (size difference and new page marker)

From a reader and product perspective, I suspect it would make sense for the "size" and "new page" marker to be based on timestamps, the same way the rest of the history page is. This is currently not the case, and that could be considered a bug. If we focus on this as a desired outcome, we can then decide how to achieve that. Either by making parent_id work the way it needs to to achieve this result, or by other means (e.g. compute at run-time, maybe cached, maybe a new database field or table).

The result would be:

  • The comparison view and the "size difference" always agree.
  • Only 1 revision to a page has the "N" mark, it is the first revision.

This would fix two bugs:

  • Incorrect showing of "New page" mark when restoring a revision without its parent.
  • The comparison view can be bigger/smaller than the "size difference", if parent_id and previous revision by timestamp are different as result if deletion/undeletion.
Preservation of data and user intent

From the perspective of research, archiving and preservation, it would be preferable to see how an edit was originally made. In other words, to have in the database, whether this revision was originally submitted in a way that created a page, or else to have in the database the revision ID that this edit was based on.

The result would be:

  • The comparison view can remain as it currently by default (based on prevision revision by timestamp, matching the history page). But we could introduce a new way to view the edit that was originally submitted.
  • Only revisions that originally created a page will have an "N" mark. This means if the oldest revision is deleted, there will no longer be an "N" mark. And if older revisions are merged to a page, there may be revisions before the "N" marked revision, or even multiple revisions marked "N".

This would require:

  • Implement "size difference" based on timestamp, or omit from history page if parent ID is not visible to current user.

This would fix two bugs:

  • Incorrect showing of "New page" mark when restoring a revision without its parent.
  • Incorrect showing of "new page" for the oldest revision even if that contributor did not actually create the page.

Both of these can be shown in a sensible way to users, and both of these fix problems. The current implementation is broken because it tries to do both, which is not possible. We have to decide what we want to achieve, and then determine how to make it happen. If the decision (with support from product managers) is to support both of these, then it will require additional database fields to be introduced, and/or new query plans to be figured out.

Anomie added a comment.EditedMay 23 2018, 7:56 PM

From a reader and product perspective, I suspect it would make sense for the "size" and "new page" marker to be based on timestamps, the same way the rest of the history page is. This is currently not the case, and that could be considered a bug.

It's mostly currently the case. It only gets screwed up because the rev_parent_id field isn't updated properly for undeletions and imports (and maybe history merges?).

The really confusing part here is that, until recently, undeletions did half of the needed update. Enough to make the "N" work right, and size differences if there weren't any newer revisions on the page than the ones being undeleted.

whether this revision was originally submitted in a way that created a page,

That sounds like a job for changetags, IMO. Although back-populating it might be somewhat error-prone.

Edit: Also T12331: Introduce page creation log, which is in the process of being implemented.

or else to have in the database the revision ID that this edit was based on.

By "based on", do you mean before or after automatic edit conflict resolution?

or else to have in the database the revision ID that this edit was based on.

By "based on", do you mean before or after automatic edit conflict resolution?

I think after conflict resolution. Intent being to see a diff that resembles the effective change that was made by the user, equal to what one would see when reviewing it immediately after saving (without any other changes being made to the database).

This RFC is getting kinda ridiculous. The comments requested by the OP should trigger a discussion between the OP and others, not endless modifications of the proposal by the OP based on the comments and whatnot. As a result, it is impossible to tell which points were already addressed in the comments, which comments were the motivation for the new points in the proposal etc.

TechCom is hosting a Public IRC Discussion of this RFC next week 2018-06-06 in the #wikimedia-office channel at 2pm PST(22:00 UTC, 23:00 CET). The meeting will be used as an opportunity to review the different options and the pros/cons of each.

FYI, I may not be able to make that meeting. If that turns out to be the case, I hope someone can adequately represent my views based on my prior comments in this task, particularly my counterproposal in T193690#4178492. Thanks.

daniel added a comment.Jun 1 2018, 4:24 PM

Please note that the scope of the IRC discussion of 2018-06-06 is intended to me more general than what this RFC proposes.

The main questions of the meeting will be:

  • do we need rev_parent_id?
  • if yes, and what exactly should the semantics of rev_parent_id be?
  • if no, what do we replace it with (if anything)?

I am in the process of preparing a write-up to frame the discussion. Should be done by monday.

Anomie added a comment.Jun 1 2018, 5:10 PM
  • do we need rev_parent_id?
  • if yes, and what exactly should the semantics of rev_parent_id be?

I think the controlling factor is the size difference display on Special:Contributions, which people seem to expect to show the difference between the revision shown and the previous revision currently visible in the article's history. To calculate the difference, we need to know the sizes of those previous revisions.

We have the same size difference display on article history pages, but in that case we're fetching the page's revisions in order so we already have the previous revision for every revision displayed (except perhaps the oldest).

We could replace the 'N' indicator (based on rev_parent_id == 0) with a change tag, particularly if we really want the semantics to be "this (re-)created the page".

  1. Answer "yes" and "points to the currently-previous revision of the page by rev_timestamp".
  2. Special:Contributions will have to make a separate query for each row displayed to fetch the previous revision by timestamp.
  3. Someone come up with a single SQL query to get the previous revisions for an arbitrary set of rev_ids without using rev_parent_id.
  4. Add a "rev_len_diff" field, and update that (instead of rev_parent_id) on undeletions and so on.
  5. Change the expected behavior, so if revision B is inserted between A and C then C will still show the difference as C-A rather than C-B.
    1. Keep rev_parent_id but never update it after creation.
    2. Add a "rev_len_diff" field, without updating after creation.
  6. Remove the size difference from Special:Contributions.

I suspect on-wiki patrollers would object to #5 and #6.

I suspect on-wiki patrollers would object to #5 and #6.

I'm sure they would object to #6. I don't see why they would object to #5. For patrolling, you want to know what the *edit* did.

daniel added a comment.Jun 4 2018, 5:39 PM

@Anomie said:

I think the controlling factor is the size difference display on Special:Contributions, which people seem to expect to show the difference between the revision shown and the previous revision currently visible in the article's history.

Particularly in the case of Special:Contributions, I'm not sure that is correct. I at least would expect the number to indicate what the user did, that is how much of the page that edit by that user changed. That would require the size difference to be based on the parent revision in the moment of the edit.

daniel added a comment.EditedJun 4 2018, 6:15 PM

In preparation of the RFC/IRC discussion scheduled for 2018-06-06, here are the main open points that where identified as during discussion on this ticket and among TechCom. Though I agree with @Lahwaacz that this RFC has unfortunately drifted a lot, I still prefer to continue the discussion here, rather than creating a new ticket and spreading it out more. That would create more confusion, I'm afraid.

So it seems that:

  • rev_parent_id was originally created to represent a page's "visible" history as a linked list.
  • alternatively, it could be interpreted as a log of the author's action, allowing to see the edit as originally made. This however could also be done differently (using page props or revision tags or the page creation log).
  • rev_parent_id is mainly used to display a size difference, and show it on the history and the contributions page.
  • a secondary use case is to identify a page's first revision.
  • due to bugs in the undeletion code, we have a lot of confusing histories

It seems we have three options:

  1. consolidate the current behavior (and fix broken data) - that is, use rev_parent_id as a linked list consistent with the sorting order of (visible) revisions by timestamp (and id) used on the history page.
  2. change the behavior so that rev_parent_id preserves the author's action.
  3. Remove rev_parent_id entirely. A. We could compute size differences on the fly on history page, but not on the contributions page (at least not easily/efficiently). B. We could record the size difference when the revision is created, perhaps in a change tag parameter. It would even be possible to change this to record the size of the diff, instead of the difference in size, which is more meaningful.

Considerations:

  • Option 1 is the easiest: we don't have to do much (except perhaps for fixing bugs and corrupt data, as @Anomie suggested), and it provides consistent and "obvious" behavior for the history page.
  • The diff links shown along with the size difference (on the history page as well as the contributions page) use the diff=prev, which is implemented by Title::getRelativeRevisionID() to be based on the timestamp, not rev_parent_id. This would need to be changed to use rev_parent_id to work with option 2. Note that making diff=next also work with option 2 would need an index on rev_parent_id.
  • Options 1 however seems to be rather redundant, and produce misleading/inconsistent information in some cases.
  • We have no way to back-fill option 2 (or 3.B) work with old revisions. So if we go with option 2, we'd really end up with a mix of 1 and 2.
  • It's also unclear how to handle option 2 when importing revisions (possibly without their parent) from another wiki.
  • It would be really nice to record the original user action ("original parent" and "size of diff"), for research and perhaps also for legal/licensing reasons. Without this information, actions a user never made (e.g. blanking or completely overwriting a page) can be attributed to a user, with no way to tell that they did not actually make this edit.

So, the questions remain:

  • Can we do without rev_parent_id entirely? Can we live without the size info on Special:Contributions, or can we query it cheaply enough?
  • Do we want to record the original "edit as intended", and if yes, where?
  • On a side note, what's the best way to detect a page's first revision, and do we still need that if we have the page creation log (and recentchanges).
Anomie added a comment.Jun 4 2018, 7:13 PM

Particularly in the case of Special:Contributions, I'm not sure that is correct. I at least would expect the number to indicate what the user did, that is how much of the page that edit by that user changed.

Given how change-averse the on-wiki communities can be, perhaps we should actually ask the people who use it for patrolling which behavior they expect before we go changing things.

B. We could record the size difference when the revision is created, perhaps in a change tag parameter.

I note that discussion in T185355: Normalize change tag schema seems to have been mostly in favor of dropping ct_params (except you, who wanted to "abuse" it), but it wasn't done at the time because Translate was still populating it. Translate no longer does so.

I still think that if you want a "revision_props" table to act like page_props, you should create that instead of conflating it with the user-visible change tags.

It would even be possible to change this to record the size of the diff, instead of the difference in size, which is more meaningful.

Possible, but perhaps not easy to do well. Sometimes the diff engine gets confused and shows a large diff for a small change.

  • The diff links shown along with the size difference (on the history page as well as the contributions page) use the diff=prev, which is implemented by Title::getRelativeRevisionID() to be based on the timestamp, not rev_parent_id. This would need to be changed to use rev_parent_id to work with option 2. Note that making diff=next also work with option 2 would need an index on rev_parent_id.

That proposal seems even more likely to break people's workflows. When paging through diffs you couldn't rely anymore on it not skipping revisions.

  • Options 1 however seems to be rather redundant, and produce misleading/inconsistent information in some cases.

It's a performance optimization, not "redundant".

And only "misleading/inconsistent" if you're still using the wrong definition for the field (or what the "size difference" is indicating).

  • Do we want to record the original "edit as intended", and if yes, where?

That question probably needs wider input than just whoever is likely to show up to the meeting.

Bawolff added a subscriber: Bawolff.Jun 4 2018, 7:14 PM

 A. We could compute size differences on the fly on history page, but not on the contributions page (at least not easily/efficiently)

There are edges cases where history pages need this - e.g. if you are filtering by a tag.

daniel added a comment.Jun 4 2018, 7:27 PM

For the record, my intent is to explore options and their consequences. I'm not proposing to actually implement any of them without consulting the community.

It's a performance optimization, not "redundant".

Well, redundancy is not necessarily bad - it can be used to improve resilience or, as you say, performance.

And only "misleading/inconsistent" if you're still using the wrong definition for the field (or what the "size difference" is indicating).

Well, the fact that I have a size difference and a diff link next to each other, and they my show completely different, even opposite things, seems at least confusing to me. But ok, that would not be the case of rev_parent_id was completely consistent with the sorted history.

The diff may still show an edit the user never made, though. I find that rather disturbing. I'd love to have at least *some* way to detect this.

GeoffreyT2000 updated the task description. (Show Details)Jun 5 2018, 2:22 PM
Anomie added a comment.Jun 7 2018, 2:51 PM

Use page creation log for the N marks.

Another option would be to apply an appropriate change tag to revisions that (re-)create the page.

Doing a selective undeletion on a Wikidata item is pointless, as Wikidata items cannot be moved nor can deleted Wikidata items be "recreated". The only way to make a deleted Wikidata item become blue-linked again is to undelete it. Existing Wikidata items should never have any deleted revisions. If there are any existing Wikidata items that have deleted revisions remaining due to a previous selective undeletion, then a bot should undelete them all as soon as possible (ASAP) before the alternative proposal gets approved.

daniel added a subscriber: Tgr.Jun 14 2018, 1:01 PM

@Tgr The meeting minutes say you are going to write up a proposal for fixing the parent id: tgr proposes: add a new field to store diff size, recompute it when the (timewise) previous revision changes. Use page creation log for the N marks. Use rev_parent_id for logical parents (and do not erase the information we have from the time period when it was actually used as a logical parent.).

Are you planning to do that?

Tgr added a comment.Jun 14 2018, 7:13 PM

I intend to, yeah.

Change 441045 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] Change PageUpdater::hasEditConflict to take a parameter.

https://gerrit.wikimedia.org/r/441045

Tgr added a comment.Jun 21 2018, 12:04 PM

Beyond the use cases already considered, we also need to look at how Revision[Record]::getParentId() is used in hooks at such. It seems at least FlaggedRevs would break if that changed to be the logical parent.

Change 441045 was initially targeted incorrectly to this task, but it has since been fixed to target T197685 instead. So I have removed the "Patch for Review" project and reviewer gerritbot, for now. Also, we actually already have a script named populateParentId for populating null rev_parent_id fields. We just needed that script to allow fixing existing rev_parent_id fields, as well as populating or fixing ar_parent_id fields.

Vvjjkkii renamed this task from RFC: Use ar_page_id to determine the parent IDs for undeleted revisions to 1qdaaaaaaa.Jul 1 2018, 1:12 AM
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
CommunityTechBot renamed this task from 1qdaaaaaaa to RFC: Use ar_page_id to determine the parent IDs for undeleted revisions.Jul 1 2018, 8:02 AM
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added a subscriber: Aklapper.
CommunityTechBot raised the priority of this task from High to Needs Triage.Jul 3 2018, 2:02 AM
GeoffreyT2000 updated the task description. (Show Details)Jul 6 2018, 2:59 PM
GeoffreyT2000 renamed this task from RFC: Use ar_page_id to determine the parent IDs for undeleted revisions to RFC: Overhaul undelete feature with a "pagearchive" table.Jul 8 2018, 7:26 PM
GeoffreyT2000 updated the task description. (Show Details)
GeoffreyT2000 updated the task description. (Show Details)Jul 8 2018, 8:52 PM
GeoffreyT2000 renamed this task from RFC: Overhaul undelete feature with a "pagearchive" table to RFC: Overhaul the undelete feature with a "pagearchive" table.Jul 8 2018, 9:14 PM
GeoffreyT2000 updated the task description. (Show Details)
GeoffreyT2000 updated the task description. (Show Details)Jul 8 2018, 9:20 PM
GeoffreyT2000 updated the task description. (Show Details)Jul 8 2018, 9:37 PM

I have renamed the task and rearranged the description to make it more focused on the alternative proposal. The new interface for undeletion will distinguish page IDs more clearly than the current interface, which lists all deleted revisions regardless of the value of their ar_page_id fields. It would also be simpler and easier to use. Instead of selecting revisions, all one has to do is to select a page ID to be undeleted and then click the "Restore" button.

daniel added a comment.EditedJul 10 2018, 10:51 AM

@GeoffreyT2000 You are proposing a change to how the deletion/undeletion system works conceptually. This is not something that can be decided by a technical RFC alone. Your proposal sounds sensible enough to me to be seriously considered, but for it to get resourcing for development, you'd have to show considerable demand from the admin community (or rather, communities). My point is: you are proposing a change in behavior, introducing and changing user-side concepts and processes. That's a feature request. It needs to be decided by product owners and UX people, not by software engineers. So it'S not a good fir for a TechCOm RFC.

Once there is community demand, and a developer team has agreed to work on an implementation, then it is time to consider a technical RFC. But that's only necessary if the solution is strategic, cross-cutting, or hard to undo. Which may or may not be the case, depending on how exactly this is implemented.

GeoffreyT2000 updated the task description. (Show Details)
GeoffreyT2000 updated the task description. (Show Details)Aug 5 2018, 3:23 PM
GeoffreyT2000 renamed this task from RFC: Overhaul the undelete feature with a "pagearchive" table to RFC: How should we fix the undeletion system?.Oct 10 2018, 2:16 AM
Scott added a subscriber: Scott.Mar 21 2019, 12:12 PM