Unify various deletion systems
Open, NormalPublic

Description

Background

We currently have two systems for the deletion of revisions:

  1. Page deletion
  2. Revision delete

Page deletion

This is MediaWiki's original deletion system. Exposed through the interface as "Delete page" (action=delete) and "Restore page" (Special:Undelete).

Database process:
Moves a page and its revisions to the "archive" database table.

Visibility:
Revisions from deleted (or "archived") pages are not shown in page history, or user contributions. Administrators may view them via Special:Undelete/<title> or Special:DeletedContributions/<user>.

Limitations:
The database process for page deletion is inefficient. This cannot be improved because the problem is not how we do it, it is what we do (moving rows between tables). This concept is considered bad practice for database operations. This is why, in order to reduce its negative impact on database stability, replication lag, and performance - "Page deletion" can be limited via the $wgDeleteRevisionsLimit configuration. When limited, only users with the bigdelete may access the feature on pages with more than this number of revisions.

On Wikimedia wikis, the limit has been set at 5,000 revisions. And the right has mostly been reserved to Stewards and Developers. When used with caution, these users are then sometimes able to perform the deletion through a simple request procedure. However, even with this user right, the underlying process is highly inefficient and can cause a longer lasting impact on the database performance in the minutes/hours that follow. As such, all database transactions have additional limits on Wikimedia wikis, that abort these when this is about to happen.

Pages with revisions a lot more than 5,000 as such cannot be deleted through this process. The only way to do so in a way that does not disrupt database performance would be to batch the deletion. However, it is unknown whether it is feasible to do this in a safe manner, given the possible database failure and rollback scenarios it would have to account for.

See also:

Revision delete

This is a newer mechanism introduced in 2009. Exposed on the "View history" and "User contributions" views as "Change visibility of selected revisions". And works by ticking the relevant check boxes first.

Database process:
Changes the numerical value in the rev_delete field for the relevant revisions in the database. This can be done in batches.

Visibility:
Revisions that have been "deleted" (or "hidden") still have a placeholder row shown in the interface on "Page history" and "User contributions".

The "Revision delete" feature allows admins to decide which aspect(s) of a revision to hide, and from whom. In particular, it is capable of separately controlling the visibility of the textual content, the edit summary, or the user's name/IP. And it can hide it from either non-admins only, or from everyone (suppression, aka "oversight").

Limitations:
I couldn't find any limitation in the code (which is concerning), but the interfaces (History page, Contributions page) do have a limitation on how many revisions they offer at once. And in any event, there are general transaction limits that will still apply. Regardless of whether this needs a limit, though, it could be batched internally if needed (either in-request or using the JobQueue). And as last fallback, the user themselves has the option to manually "batch" as well (e.g. increase history to show 500 rows at once, and shift-select it as one chunk). Which could work in extreme cases when stewards/developers need to intervene.

See also https://www.mediawiki.org/wiki/Help:RevisionDelete.

Problem

The "Revision delete" system seems to scale fairly well, and if/when it shows problems, there's a clear path for how to make it work for larger pages.

The "Page delete" system on the other hand has severe limitations. Even if we ignore the edge case of pages with 5000+ revisions, the underlying concept is still problematic. Database operation for smaller page that move rows between tables is something DBAs would prefer never happens, and should be migrated away from.

Issues:

Solution

Requirements
  • Administrators must still be able to delete entire pages in a way that is as easy as "Page deletion" is today.
  • The technical implementation of that action must not move rows between tables.
  • The viewing of "Page history" and "User contributions" (and related APIs) must not display revisions of deleted pages (by default), the same as today.
Proposal 1:

Nothing specific yet, but it seems I (@Krinkle) and others find it worth exploring to see if we can re-implement the logic behind "Page deletion" by using the same code and database logic that is used by "Revision delete". This would involve the following:

  • Add a bit-field value for revision.rev_delete to represent "archived".
  • Update page/user revision views (Page history, User contributions) to make sure revisions with this flag are not shown by default.
  • Add a way to see them. (e.g. re-using Special:DeletedContributions, or through a switch on Special:Contribs itself, same for history).
  • TODO: Decide what to do with the page entity itself (meta data). E.g. a page_deleted flag (possibly including a state for "deletion in progress", to be batch-friendly).
  • TODO: Decide how/if to migrate archive into revision.rev_delete=archived.
Original task description from bugzilla.wikimedia.org user FT2.wiki:

At present we now have 4 means of deleting material from either the public or from administrators. Material can either be

  • Deleted from the public with traditional deletion
  • Deleted from the public (part or full) with RevisionDeleted
  • Deleted from admin view with Oversight
  • Suppressed from admin view with RevisionDeleted

    This collection means that any review of editor actions or conduct, or article matters on the wiki, now faces two big problems in evaluating the existance or seriousness or any issue:
  • It's incredibly easy to overlook some edits or actions in the review, which should be taken account of.
  • It's more complex and takes examination of several screens, to review a matter.
  • Each of these has different mechanisms for viewing edits they affect; there is no consistency of links, formats, access methods, etc.
  • A third issue at a technical level - it's a lot to maintain, and allows for inconsistent software behavior (or bugs fixed in one of these but not spotted in the other), and requires more developer time etc.

I would like to suggest that in fact, all we now need is RevisionDeleted, with the following options:

  • What to hide - revision text, edit summary, user name/IP
  • Whether admins can or can't access the hidden data
  • Whether admins or users who cannot access the hidden data, should nonetheless be able to see it exists even if they can't read it (there are cases when this is safe, and cases when it isn't).

This proposal is that RevisionDeleted is amended slightly to show the above options, and then both traditional deleted revisions and oversighted revisions are converted to RevisionDeleted entries as a background task (ie a script written that achieves this in the job queue over time). Following this:

  • Delete and oversight both redirect to RevDel for their actions
  • Delete/undelete and oversight url's both redirect to the appropriate lookup link for any historical URL used to view an old deleted/oversighted edit.

The issue here is not so much one of software development, as of a once-off conversion task of old data stored in one system to be moved to another.

Details

Reference
bz18493
bzimport raised the priority of this task from to Low.
bzimport set Reference to bz18493.
bzimport added a subscriber: Unknown Object (MLST).

FT2.wiki wrote:

(2, 3, 4, ... I can't count!)

FT2.wiki wrote:

(Note - this may be agreeable but not practical until admin use of RevDel is enabled, and any significant issues from the rollout of RevDel are addressed)

See bug 17444.

happy.melon.wiki wrote:

Would it be sensible to set this bug up as a tracking bug? Its general principle (phase out other forms of oversight in favour of a tweaked-up RevDel) is blocked by most of the other bugs out there. We certainly need *a* tracker for all this stuff; while compartmentalisation is good, we have a lot of plates spinning at the moment.

mike.lifeguard+bugs wrote:

(In reply to comment #4)

Would it be sensible to set this bug up as a tracking bug? Its general
principle (phase out other forms of oversight in favour of a tweaked-up RevDel)
is blocked by most of the other bugs out there. We certainly need *a* tracker
for all this stuff; while compartmentalisation is good, we have a lot of plates
spinning at the moment.

I've sort-of done that by shifting the depends to blocks on bug 18598.

FT2.wiki wrote:

See also bug 20290, covering a "hide placeholder" option in RevDel. Added the dependency.

thatcher131 wrote:

It's really about time to remove the Oversight extension entirely; that's under discussion. That would leave only 3 deletion options, which are really 2 (admin deletion and RevDel with optional suppression).

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 27 2015, 4:58 PM
Meno25 removed a subscriber: Meno25.Feb 22 2016, 5:45 PM
Restricted Application added a subscriber: JEumerus. · View Herald TranscriptFeb 22 2016, 5:45 PM
Krinkle updated the task description. (Show Details)
Krinkle raised the priority of this task from Low to Needs Triage.
Krinkle added a subscriber: Krinkle.
Krinkle moved this task from Inbox to Under discussion on the TechCom-RFC board.Jun 27 2018, 8:38 PM
Krinkle moved this task from Under discussion to Request IRC meeting on the TechCom-RFC board.
Krinkle triaged this task as Normal priority.
Krenair updated the task description. (Show Details)Jun 27 2018, 9:24 PM
Tgr added a subscriber: Tgr.Jul 8 2018, 5:19 PM
Krinkle updated the task description. (Show Details)Jul 8 2018, 9:51 PM

TechCom is hosting a Public IRC Discussion of this RFC on 2018-07-11 in the #wikimedia-office channel at 2pm PST(22:00 UTC, 23:00 CET)

There is one small nitpick, it is said that "Database operation for smaller page that move rows between tables is something DBAs would prefer never happens, and should be migrated away from." Actually, from a pure DBA point of view, moving rows deleted to a separate table is good because it is basically a bad way of implementing partitioning and requires less optimization to avoid virtually deleted rows. It is when I put on the Database engineer hat that I hate that- it is prone to cause data loss, inconsistencies and more traffic and writes than needed. It doesn't change the overall sentiment, but at least highlights one of the few things good with moving rows around (instead of virtually delete them with SET deleted = 1/INSERT latest version with deleted status, which is the standard model of doing it in most scenarios).

daniel added a subscriber: daniel.EditedJul 11 2018, 9:14 PM

A few thoughts:

We can't just do rev_deleted = archived and remove the page entry, since we would lose the page title that way. So I see two options:

  1. Have a page_archive table, and move rows between page and page_archive. Note that page_archive may have several entries for the same title. Also, there are two cases for undeletion (this is already the case now): the title exists, or does not exist in the page table. If the title does exist, this is effectively a history merge, and should perhaps be handled as such. In any case, in this case, rev_page_id of all the revisions being restored needs to be updated to the id of the existing page.
  2. Have a page_deleted field that can be set to archived. Then the question is what should happen when a page with the same name is created. Perhaps page_deleted can just be cleared, but the archived revisions remain archived? That would be close to current behavior. And there would never be two different page IDs associated with a given title, deleted or not. Which may or may not be a good thing.

EDIT:

  • conceptually, option (1) means that deleted revisions stay bound to a page ID, and when page with the same title is created (or a page is renamed to that title), the old revisions are not assigned to that page. They stay separate. New functionality would have to be added to allow users to view or undeleted revisions of "deleted pages with the same title". Deleted revisions of an existing page will behave the same as "oversighted" revisions of an existing page, and follow renames.
  • option (2) on the other hand means deleted revisions are bound to a page ID, and stay bound to to it across renames. Renaming a page will no longer "leave behind" its deleted revisions. Creating and deleting a page with the same title multiple times would result in one big history of deleted revisions (all bound to the same page ID), as opposed to multiple such histories (each with its own page id).

Have a page_deleted field that can be set to archived. Then the question is what should happen when a page with the same name is created

Presenting the "title" table, as a foreign key of page (which will also solve numerous issues with *links tables). Title is the equivalent of the normalization of comment. Although needs more thought.

Title is the equivalent of the normalization of comment

That's certainly possible, but I don't see the point. The archive table is the only one that repeats the same title over and over. If we get rid of that, a title can only exists one per namespace in the page table. If we have a page_archive table, it can also exist another time per deletion of a page. If we go for page_deleted, the title can only exist once per namespace, and would typically only exist twice (for the page and corresponding talk page).

The archive table is the only one that repeats the same title over and over.

Please, please have a look at the pagelinks, templatelinks and categorylinks tables (we could reduce their size 10x)

If we get rid of that, a title can only exists one per namespace in the page table.

Why? We can have the same comment for several revisions (millions of times). We can have the same title for several pages. Title is a combination of namespace + text. You just have page (page_id, namespace, title, deleted) VALUES (1, 0, 36, 1), (2, 0, 36, 0), (3, 1, 37, 0) while title being (title_id, namespace, title) VALUES (36, 0, 'The adventures of Tom Sawyer' -- this would be the url /wiki/The_adventures_of_Tom_Sawyer),(37, 1, 'The adventures of Tom Sawyer' -- this would be the url /wiki/Talk:The_adventures_of_Tom_Sawyer). The details are not that important. Instead of "deleted" you could have a "page_version", a monotonically increasing value of pages, so you don't need to update the ones that are deleted, only get the latest one. Those are only options, we need to see the performance impact and which operations we want to favor over others.

Basically, the idea is that title and page are 2 entities that happen to be related, but one is a set of coherent text with revisions and history, and the other is an alias, which is shared by several pages as they are renamed, deleted and recreated.

The archive table is the only one that repeats the same title over and over.

Please, please have a look at the pagelinks, templatelinks and categorylinks tables (we could reduce their size 10x)

Yes, for that having normalized titles would make a lot of sense. And I'm all for doing that, but it doesn't seem to be relevant here. Except perhaps in that we could add page_title_id at the same time as adding page_deleted, if we go for that option.

If we get rid of that, a title can only exists one per namespace in the page table.

Why? We can have the same comment for several revisions (millions of times). We can have the same title for several pages. Title is a combination of namespace + text. You just have page (page_id, namespace, title, deleted) VALUES (1, 0, 36, 1), (2, 0, 36, 0), (3, 1, 37, 0) while title being (title_id, namespace, title) VALUES (36, 0, 'The adventures of Tom Sawyer' -- this would be the url /wiki/The_adventures_of_Tom_Sawyer),(37, 1, 'The adventures of Tom Sawyer' -- this would be the url /wiki/Talk:The_adventures_of_Tom_Sawyer).

Yes, the same title-text can occur once per namespace, as I said. The same title (namespace+text) can occur only once in the page table, it's a unique key.

The details are not that important. Instead of "deleted" you could have a "page_version", a monotonically increasing value of pages, so you don't need to update the ones that are deleted, only get the latest one. Those are only options, we need to see the performance impact and which operations we want to favor over others.

With page_version we'd always have to find the "newest" entry for a title in the page table, which is nasty in joins. And the page table would become much larger. And listing pages would become much more expensive. I don't think that's a good idea.

We are updating the page row for every edit anyway, to write the new value of page_latest. Updating page_deleted at the same time seems unproblematic.

Basically, the idea is that title and page are 2 entities that happen to be related, but one is a set of coherent text with revisions and history, and the other is an alias, which is shared by several pages as they are renamed, deleted and recreated.

Treating the title as an alias that can be re-assigned follows from both options I presented. For that, it does not matter whether the title is normalized or not. The key here is that deleted revisions stay bound to the page ID, while presently, they stay bound to the page title. This is a change in behavior that will break some existing workflows, and would need alternatives to be implemented.

The idea that a title can refer to multiple pages at once (one non-deleted, and multiple deleted) is what the page_archive option achieves.

Reminder: TechCom is hosting a Public IRC Discussion of this RFC on 2018-07-18 in the #wikimedia-office channel at 2pm PST(22:00 UTC, 23:00 CET)

Meeting minutes: https://tools.wmflabs.org/meetbot/wikimedia-office/2018/wikimedia-office.2018-07-18-21.00.html

There is the question of how Special:Contributions and Special:DeletedContributions will work. Krinkle believes the community will require feature parity, i.e. the ability to view only deleted contributions, and the ability to view only non-deleted contributions. It should be feasible to add a new merged mode which shows both types of contributions sorted by timestamp. One possible query plan is to store the proposed "archived" flag in a separate boolean field rev_archived, then have only a (rev_user,rev_archived,rev_timestamp) index, and then implement the merged mode using "rev_archived IN (0,1)".

We could have two contributions indexes, (rev_user, rev_timestamp) and (rev_user, rev_archived, rev_timestamp), this is CPU efficient but requires more memory and disk space. Or we could have only (rev_user,rev_timestamp), this is memory efficient but the Special:DeletedContributions replacement would require a lot of table scanning.

There is the question of what happens to the page table. A page_deleted field would require a non-unique index (page_deleted,page_title_id).

An improvement on the current "delete/selective undelete" workflow would be to provide a selective deletion feature as a kind of history splitting. The user would select the revisions to be archived, and then a new page row would be created for those revisions, and the revisions selected for archiving would be moved into the new page. That way, there would be no need to include rev_archived in the action=history index, it would implicitly be in rev_page. The new page could be moved and then undeleted under some other title, providing a full history splitting workflow.

For feature parity, a history merge feature needs to be provided. There is the question of whether to allow undeletion of an archived page when a non-deleted page has the same title, should this cause an implicit history merge? Or should this use case be handled entirely with Special:MergeHistory?

JJMC89 added a subscriber: JJMC89.Thu, Jul 19, 12:43 AM

There is the question of whether to allow undeletion of an archived page when a non-deleted page has the same title, should this cause an implicit history merge? Or should this use case be handled entirely with Special:MergeHistory?

Not allowing undeletion of deleted revisions (an archived page) would just add extra steps.

  1. Delete the undeleted revisions (the existing page)
  2. Undelete everything

Currently, does it cause an implicit history merge?

Special:MergeHistory (mergehistory right) is not guaranteed to be available on all wikis.

Not allowing undeletion of deleted revisions (an archived page) would just add extra steps.

  1. Delete the undeleted revisions (the existing page)
  2. Undelete everything

I'm imagining that undeletion would be done on a page granularity. So if you delete a page, and then it is recreated with the same title, then deleted again, that would make two deleted pages, and you would not be able to undelete both in the same operation, you would have to merge their histories first.

Currently, does it cause an implicit history merge?

Yes, undelete currently causes an implicit history merge, this implementation accident has been used as a poor-man's history merge tool since its inception.

Special:MergeHistory (mergehistory right) is not guaranteed to be available on all wikis.

I can fix that in like one minute.

I'm imagining that undeletion would be done on a page granularity. So if you delete a page, and then it is recreated with the same title, then deleted again, that would make two deleted pages, and you would not be able to undelete both in the same operation, you would have to merge their histories first.

Yes, undelete currently causes an implicit history merge, this implementation accident has been used as a poor-man's history merge tool since its inception.

Given that enwiki's most prolific history merger, @Anthony_Appleyard, has hardly used Special:MergeHistory (1), I question whether or not the tool is sufficient for editor's needs.

Special:MergeHistory (mergehistory right) is not guaranteed to be available on all wikis.

I can fix that in like one minute.

The Wikimedia cluster might be fine in this regard. (I haven't checked.) I was referring to other installations. I am a sysop on one where no groups have mergehistory, so I have to use the poor-man's version.

@JJMC89 It is part of the sysop grant by default in the MediaWiki software.

DefaultSettings.php
$wgGroupPermissions['sysop']['mergehistory'] = true;

I'm not sure, but do you mean that a third-party wiki has given you sysop but explicitly taken out the mergehistory right from said group? If so, that seems odd, given that, as you say, you can still do it via delete/undelete. Perhaps they disabled it by accident?

JJMC89 added a comment.EditedFri, Jul 20, 4:53 AM

@JJMC89 It is part of the sysop grant by default in the MediaWiki software.

I know.

I'm not sure, but do you mean that a third-party wiki has given you sysop but explicitly taken out the mergehistory right from said group? If so, that seems odd, given that, as you say, you can still do it via delete/undelete. Perhaps they disabled it by accident?

I don't have access to the configs, only what I can see on Special:ListGroupRights (no mergehistory). The wiki has been around since before mergehistory existed. Could that be it? (I don't manage any installs, so I'm ignorant on the impact of updates here.)

Given what @tstarling wrote in T20493#4436799, would I be able to do it with delete/undelete?

I'm imagining that undeletion would be done on a page granularity. So if you delete a page, and then it is recreated with the same title, then deleted again, that would make two deleted pages, and you would not be able to undelete both in the same operation, you would have to merge their histories first.

Yes, undelete currently causes an implicit history merge, this implementation accident has been used as a poor-man's history merge tool since its inception.

Given that enwiki's most prolific history merger, @Anthony_Appleyard, has hardly used Special:MergeHistory (1), I question whether or not the tool is sufficient for editor's needs.

I can't speak to anyone else, but at least for me, I have been unable to use Special:MergeHistory because its documentation is completely useless. In the face of this, undeletion is the only realistic method for performing history merges.

(1) One long-term fault is that an admin can't delete some edits of a page, but he must delete all the edits and then undelete the edits that he wants to stay undeleted. That wastes his time and Wikipedia's system time. This need arises if he is history-merging X (older page) to Y (newer page), and first he must lose from the end of X any stray late edits (e.g. redirects and BattyBot edits) made after the cut-and-paste event. This process is liable to accidents if the page already has deleted edits at the start of this process.
(2) Currently, moving a page only moves the undeleted (visible) edits. It would be useful if it was also possible to move only the deleted edits, when fishing deleted edits out from under visible edits , to prevent the sort of accident described at the end of (1).

I think there is still not 100% clear agreement on the abstraction "what is a page" "what is a revision" "what does a deleted revision belong to" "move a page with deleted revisions"-while those question cannot be answered without having into account the limitations imposed by reality, those are questions that should be answered first with very detailed "use cases" before proposing a specific implementation. Let's document the non-trivial workflow of what should be possible first, and only later the storage model. Let's have into account readers, wiki admins and researchers (among many others) reconstructing the history of a page, which also get impacted by the inconsistencies of the current model.

The problems with the current undeletion interface will be solved with T193690, which also deals with fixing problems with rev_parent_id and ar_parent_id fields. The same page ID will never be used for more than one deleted page title, nor for both a deleted and an existing page. Also, the ar_namespace, ar_title, and ar_page_id fields will all be moved to a new pagearchive table as pa_namespace, pa_title, and pa_page_id; and the archive table will get a new ar_pa_id column.

Also, in the context of T193690, we could add a "SplitHistory" class containing a function that does the following for a given title A, a given deleted page ID n (with pa_id p), and a given cut-off revision ID r:

  1. Add a new row to the page table with title A and page_id m, and immediately delete it.
  2. Add a new row to the pagearchive table with q for the pa_id field and m for the pa_page_id field.
  3. Change the ar_parent_id field for the row in the archive table with ar_rev_id r to zero (this must be done because parent IDs are now preserved on undeletions as of MW 1.31).
  4. Change the ar_pa_id field for revision ID r and all later deleted revisions with ar_pa_id p to q.
  5. Update the pa_rev_count fields (used to display the number of restored revisions in the log entry) for rows p and q in the pagearchive table.
  6. Generate a log entry for the history split.

Then, Special:SplitHistory would do the following for a given page A with ID n and a given cut-off revision ID r:

  1. Delete page A.
  2. Apply the steps above for page ID n and revision ID r.
  3. Undelete page A with either the original page ID n, or the new page ID m, depending on whether the user chooses to move earlier or later revisions.
  4. Move A to another title B without redirect.
  5. Undelete page A with the other page ID (n if page B has ID m; m if page B has ID n).

Finally, history merges would be done by using either Special:MergeHistory or Special:MergeAndMove.

Sub tasks are for tasks that represent required parts of a larger task. The RFC for unification of rev-delete and page-archive is expected to come to its own conclusion, and not blocked on T193690. If you think T193690 represents a subset of the problem and that it would be obsoleted by the unification, then we could merge the task instead. Or, if you mean that the unification should be done first, set it as parent instead of sub task, or if related only, use a textual reference in the task's description.