Page MenuHomePhabricator

RFC: Unify the various deletion systems
Open, MediumPublic

Description

  • Affected components:
    • The Page deletion (archiving) feature of MediaWiki core.
    • The Revision delete (RevDel) feature of MediaWiki core.
  • Engineer for initial implementation: TBD.
  • Code steward: TBD.

Motivation

We currently have two systems that are provide a ways to make content no longer publicly accessible:

  1. Page deletion.
  2. Revision delete.

The "Revision delete" system seems to scale fairly well currently. It has a natural way to limit or divide its internal database interactions. If it were to show scale problems, we would have a clear path for how to make it scale further.

The "Page delete" system on the other hand has severe limitations. Even if we ignore the edge case of pages with 5000+ revisions, the underlying concept is still problematic. Database operations for smaller page that move rows between tables is something DBAs would prefer never happens, even at a small scale, and should be migrated away from as soon as possible.

The objective is to unify these two systems and end up with something that is as good as the best of both.

Issues:

Requirements
  • Administrators must still be able to delete entire pages in a way that is as easy as "Page deletion" is today.
  • Administrators must still be able to selectively hide revisions in a way that is as easy as "Revision deletion" offers today.
  • The technical implementation of that action must not move rows between tables.
  • The viewing of "Page history" and "User contributions" (and related APIs) must not display revisions of deleted pages (by default), the same as today.

Exploration

Status quo: Page deletion

This is MediaWiki's original deletion system. Exposed through the interface as "Delete page" (action=delete) and "Restore page" (Special:Undelete).

Database process:
Moves a page and its revisions to the "archive" database table.

Visibility:
Revisions from deleted (or "archived") pages are not shown in page history, or user contributions. Administrators may view them via Special:Undelete/<title> or Special:DeletedContributions/<user>.

Limitations:
The database process for page deletion is inefficient. This cannot be improved because the problem is not how we do it, it is what we do (moving rows between tables). This concept is considered bad practice for database operations. This is why, in order to reduce its negative impact on database stability, replication lag, and performance - "Page deletion" can be limited via the $wgDeleteRevisionsLimit configuration. When limited, only users with the bigdelete may access the feature on pages with more than this number of revisions.

On Wikimedia wikis, the limit has been set at 5,000 revisions. And the right has mostly been reserved to Stewards and Developers. When used with caution, these users are then sometimes able to perform the deletion through a simple request procedure. However, even with this user right, the underlying process is highly inefficient and can cause a longer lasting impact on the database performance in the minutes/hours that follow. As such, all database transactions have additional limits on Wikimedia wikis, that abort these when this is about to happen.

Pages with revisions a lot more than 5,000 as such cannot be deleted through this process. The only way to do so in a way that does not disrupt database performance would be to batch the deletion. However, it is unknown whether it is feasible to do this in a safe manner, given the possible database failure and rollback scenarios it would have to account for.

See also:

Status quo: Revision delete

This is a newer mechanism introduced in 2009. Exposed on the "View history" and "User contributions" views as "Change visibility of selected revisions". And works by ticking the relevant check boxes first.

Database process:
Changes the numerical value in the rev_delete field for the relevant revisions in the database. This can be done in batches.

Visibility:
Revisions that have been "deleted" (or "hidden") still have a placeholder row shown in the interface on "Page history" and "User contributions".

The "Revision delete" feature allows admins to decide which aspect(s) of a revision to hide, and from whom. In particular, it is capable of separately controlling the visibility of the textual content, the edit summary, or the user's name/IP. And it can hide it from either non-admins only, or from everyone (suppression, aka "oversight").

Limitations:
I couldn't find any limitation in the code (which is concerning), but the interfaces (History page, Contributions page) do have a limitation on how many revisions they offer at once. And in any event, there are general transaction limits that will still apply. Regardless of whether this needs a limit, though, it could be batched internally if needed (either in-request or using the JobQueue). And as last fallback, the user themselves has the option to manually "batch" as well (e.g. increase history to show 500 rows at once, and shift-select it as one chunk). Which could work in extreme cases when stewards/developers need to intervene.

See also https://www.mediawiki.org/wiki/Help:RevisionDelete.

Proposal

Nothing specific yet, but it seems I (@Krinkle) and others find it worth exploring to see if we can re-implement the logic behind "Page deletion" by using the same code and database logic that is used by "Revision delete". This would involve the following:

  • Add a bit-field value for revision.rev_delete to represent "archived".
  • Update page/user revision views (Page history, User contributions) to make sure revisions with this flag are not shown by default.
  • Add a way to see them. (e.g. re-using Special:DeletedContributions, or through a switch on Special:Contribs itself, same for history).
  • TODO: Decide what to do with the page entity itself (meta data). E.g. a page_deleted flag (possibly including a state for "deletion in progress", to be batch-friendly).
  • TODO: Decide how/if to migrate archive into revision.rev_delete=archived.
Original task description from bugzilla.wikimedia.org user FT2.wiki:

At present we now have 4 means of deleting material from either the public or from administrators. Material can either be

  • Deleted from the public with traditional deletion
  • Deleted from the public (part or full) with RevisionDeleted
  • Deleted from admin view with Oversight
  • Suppressed from admin view with RevisionDeleted

This collection means that any review of editor actions or conduct, or article matters on the wiki, now faces two big problems in evaluating the existance or seriousness or any issue:

  • It's incredibly easy to overlook some edits or actions in the review, which should be taken account of.
  • It's more complex and takes examination of several screens, to review a matter.
  • Each of these has different mechanisms for viewing edits they affect; there is no consistency of links, formats, access methods, etc.
  • A third issue at a technical level - it's a lot to maintain, and allows for inconsistent software behavior (or bugs fixed in one of these but not spotted in the other), and requires more developer time etc.

I would like to suggest that in fact, all we now need is RevisionDeleted, with the following options:

  • What to hide - revision text, edit summary, user name/IP
  • Whether admins can or can't access the hidden data
  • Whether admins or users who cannot access the hidden data, should nonetheless be able to see it exists even if they can't read it (there are cases when this is safe, and cases when it isn't).

This proposal is that RevisionDeleted is amended slightly to show the above options, and then both traditional deleted revisions and oversighted revisions are converted to RevisionDeleted entries as a background task (ie a script written that achieves this in the job queue over time). Following this:

  • Delete and oversight both redirect to RevDel for their actions
  • Delete/undelete and oversight url's both redirect to the appropriate lookup link for any historical URL used to view an old deleted/oversighted edit.

The issue here is not so much one of software development, as of a once-off conversion task of old data stored in one system to be moved to another.

Details

Reference
bz18493

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

The archive table is the only one that repeats the same title over and over.

Please, please have a look at the pagelinks, templatelinks and categorylinks tables (we could reduce their size 10x)

Yes, for that having normalized titles would make a lot of sense. And I'm all for doing that, but it doesn't seem to be relevant here. Except perhaps in that we could add page_title_id at the same time as adding page_deleted, if we go for that option.

If we get rid of that, a title can only exists one per namespace in the page table.

Why? We can have the same comment for several revisions (millions of times). We can have the same title for several pages. Title is a combination of namespace + text. You just have page (page_id, namespace, title, deleted) VALUES (1, 0, 36, 1), (2, 0, 36, 0), (3, 1, 37, 0) while title being (title_id, namespace, title) VALUES (36, 0, 'The adventures of Tom Sawyer' -- this would be the url /wiki/The_adventures_of_Tom_Sawyer),(37, 1, 'The adventures of Tom Sawyer' -- this would be the url /wiki/Talk:The_adventures_of_Tom_Sawyer).

Yes, the same title-text can occur once per namespace, as I said. The same title (namespace+text) can occur only once in the page table, it's a unique key.

The details are not that important. Instead of "deleted" you could have a "page_version", a monotonically increasing value of pages, so you don't need to update the ones that are deleted, only get the latest one. Those are only options, we need to see the performance impact and which operations we want to favor over others.

With page_version we'd always have to find the "newest" entry for a title in the page table, which is nasty in joins. And the page table would become much larger. And listing pages would become much more expensive. I don't think that's a good idea.

We are updating the page row for every edit anyway, to write the new value of page_latest. Updating page_deleted at the same time seems unproblematic.

Basically, the idea is that title and page are 2 entities that happen to be related, but one is a set of coherent text with revisions and history, and the other is an alias, which is shared by several pages as they are renamed, deleted and recreated.

Treating the title as an alias that can be re-assigned follows from both options I presented. For that, it does not matter whether the title is normalized or not. The key here is that deleted revisions stay bound to the page ID, while presently, they stay bound to the page title. This is a change in behavior that will break some existing workflows, and would need alternatives to be implemented.

The idea that a title can refer to multiple pages at once (one non-deleted, and multiple deleted) is what the page_archive option achieves.

Reminder: TechCom is hosting a Public IRC Discussion of this RFC on 2018-07-18 in the #wikimedia-office channel at 2pm PST(22:00 UTC, 23:00 CET)

Meeting minutes: https://tools.wmflabs.org/meetbot/wikimedia-office/2018/wikimedia-office.2018-07-18-21.00.html

There is the question of how Special:Contributions and Special:DeletedContributions will work. Krinkle believes the community will require feature parity, i.e. the ability to view only deleted contributions, and the ability to view only non-deleted contributions. It should be feasible to add a new merged mode which shows both types of contributions sorted by timestamp. One possible query plan is to store the proposed "archived" flag in a separate boolean field rev_archived, then have only a (rev_user,rev_archived,rev_timestamp) index, and then implement the merged mode using "rev_archived IN (0,1)".

We could have two contributions indexes, (rev_user, rev_timestamp) and (rev_user, rev_archived, rev_timestamp), this is CPU efficient but requires more memory and disk space. Or we could have only (rev_user,rev_timestamp), this is memory efficient but the Special:DeletedContributions replacement would require a lot of table scanning.

There is the question of what happens to the page table. A page_deleted field would require a non-unique index (page_deleted,page_title_id).

An improvement on the current "delete/selective undelete" workflow would be to provide a selective deletion feature as a kind of history splitting. The user would select the revisions to be archived, and then a new page row would be created for those revisions, and the revisions selected for archiving would be moved into the new page. That way, there would be no need to include rev_archived in the action=history index, it would implicitly be in rev_page. The new page could be moved and then undeleted under some other title, providing a full history splitting workflow.

For feature parity, a history merge feature needs to be provided. There is the question of whether to allow undeletion of an archived page when a non-deleted page has the same title, should this cause an implicit history merge? Or should this use case be handled entirely with Special:MergeHistory?

JJMC89 added a subscriber: JJMC89.Jul 19 2018, 12:43 AM

There is the question of whether to allow undeletion of an archived page when a non-deleted page has the same title, should this cause an implicit history merge? Or should this use case be handled entirely with Special:MergeHistory?

Not allowing undeletion of deleted revisions (an archived page) would just add extra steps.

  1. Delete the undeleted revisions (the existing page)
  2. Undelete everything

Currently, does it cause an implicit history merge?

Special:MergeHistory (mergehistory right) is not guaranteed to be available on all wikis.

Not allowing undeletion of deleted revisions (an archived page) would just add extra steps.

  1. Delete the undeleted revisions (the existing page)
  2. Undelete everything

I'm imagining that undeletion would be done on a page granularity. So if you delete a page, and then it is recreated with the same title, then deleted again, that would make two deleted pages, and you would not be able to undelete both in the same operation, you would have to merge their histories first.

Currently, does it cause an implicit history merge?

Yes, undelete currently causes an implicit history merge, this implementation accident has been used as a poor-man's history merge tool since its inception.

Special:MergeHistory (mergehistory right) is not guaranteed to be available on all wikis.

I can fix that in like one minute.

I'm imagining that undeletion would be done on a page granularity. So if you delete a page, and then it is recreated with the same title, then deleted again, that would make two deleted pages, and you would not be able to undelete both in the same operation, you would have to merge their histories first.

Yes, undelete currently causes an implicit history merge, this implementation accident has been used as a poor-man's history merge tool since its inception.

Given that enwiki's most prolific history merger, @Anthony_Appleyard, has hardly used Special:MergeHistory (1), I question whether or not the tool is sufficient for editor's needs.

Special:MergeHistory (mergehistory right) is not guaranteed to be available on all wikis.

I can fix that in like one minute.

The Wikimedia cluster might be fine in this regard. (I haven't checked.) I was referring to other installations. I am a sysop on one where no groups have mergehistory, so I have to use the poor-man's version.

@JJMC89 It is part of the sysop grant by default in the MediaWiki software.

DefaultSettings.php
$wgGroupPermissions['sysop']['mergehistory'] = true;

I'm not sure, but do you mean that a third-party wiki has given you sysop but explicitly taken out the mergehistory right from said group? If so, that seems odd, given that, as you say, you can still do it via delete/undelete. Perhaps they disabled it by accident?

JJMC89 added a comment.EditedJul 20 2018, 4:53 AM

@JJMC89 It is part of the sysop grant by default in the MediaWiki software.

I know.

I'm not sure, but do you mean that a third-party wiki has given you sysop but explicitly taken out the mergehistory right from said group? If so, that seems odd, given that, as you say, you can still do it via delete/undelete. Perhaps they disabled it by accident?

I don't have access to the configs, only what I can see on Special:ListGroupRights (no mergehistory). The wiki has been around since before mergehistory existed. Could that be it? (I don't manage any installs, so I'm ignorant on the impact of updates here.)

Given what @tstarling wrote in T20493#4436799, would I be able to do it with delete/undelete?

I'm imagining that undeletion would be done on a page granularity. So if you delete a page, and then it is recreated with the same title, then deleted again, that would make two deleted pages, and you would not be able to undelete both in the same operation, you would have to merge their histories first.

Yes, undelete currently causes an implicit history merge, this implementation accident has been used as a poor-man's history merge tool since its inception.

Given that enwiki's most prolific history merger, @Anthony_Appleyard, has hardly used Special:MergeHistory (1), I question whether or not the tool is sufficient for editor's needs.

I can't speak to anyone else, but at least for me, I have been unable to use Special:MergeHistory because its documentation is completely useless. In the face of this, undeletion is the only realistic method for performing history merges.

(1) One long-term fault is that an admin can't delete some edits of a page, but he must delete all the edits and then undelete the edits that he wants to stay undeleted. That wastes his time and Wikipedia's system time. This need arises if he is history-merging X (older page) to Y (newer page), and first he must lose from the end of X any stray late edits (e.g. redirects and BattyBot edits) made after the cut-and-paste event. This process is liable to accidents if the page already has deleted edits at the start of this process.
(2) Currently, moving a page only moves the undeleted (visible) edits. It would be useful if it was also possible to move only the deleted edits, when fishing deleted edits out from under visible edits , to prevent the sort of accident described at the end of (1).

I think there is still not 100% clear agreement on the abstraction "what is a page" "what is a revision" "what does a deleted revision belong to" "move a page with deleted revisions"-while those question cannot be answered without having into account the limitations imposed by reality, those are questions that should be answered first with very detailed "use cases" before proposing a specific implementation. Let's document the non-trivial workflow of what should be possible first, and only later the storage model. Let's have into account readers, wiki admins and researchers (among many others) reconstructing the history of a page, which also get impacted by the inconsistencies of the current model.

The problems with the current undeletion interface will be solved with T193690, which also deals with fixing problems with rev_parent_id and ar_parent_id fields. The same page ID will never be used for more than one deleted page title, nor for both a deleted and an existing page. Also, the ar_namespace, ar_title, and ar_page_id fields will all be moved to a new pagearchive table as pa_namespace, pa_title, and pa_page_id; and the archive table will get a new ar_pa_id column.

Also, in the context of T193690, we could add a "SplitHistory" class containing a function that does the following for a given title A, a given deleted page ID n (with pa_id p), and a given cut-off revision ID r:

  1. Add a new row to the page table with title A and page_id m, and immediately delete it.
  2. Add a new row to the pagearchive table with q for the pa_id field and m for the pa_page_id field.
  3. Change the ar_parent_id field for the row in the archive table with ar_rev_id r to zero (this must be done because parent IDs are now preserved on undeletions as of MW 1.31).
  4. Change the ar_pa_id field for revision ID r and all later deleted revisions with ar_pa_id p to q.
  5. Update the pa_rev_count fields (used to display the number of restored revisions in the log entry) for rows p and q in the pagearchive table.
  6. Generate a log entry for the history split.

Then, Special:SplitHistory would do the following for a given page A with ID n and a given cut-off revision ID r:

  1. Delete page A.
  2. Apply the steps above for page ID n and revision ID r.
  3. Undelete page A with either the original page ID n, or the new page ID m, depending on whether the user chooses to move earlier or later revisions.
  4. Move A to another title B without redirect.
  5. Undelete page A with the other page ID (n if page B has ID m; m if page B has ID n).

Finally, history merges would be done by using either Special:MergeHistory or Special:MergeAndMove.

Sub tasks are for tasks that represent required parts of a larger task. The RFC for unification of rev-delete and page-archive is expected to come to its own conclusion, and not blocked on T193690. If you think T193690 represents a subset of the problem and that it would be obsoleted by the unification, then we could merge the task instead. Or, if you mean that the unification should be done first, set it as parent instead of sub task, or if related only, use a textual reference in the task's description.

dbarratt updated the task description. (Show Details)Aug 30 2018, 5:37 PM

As this is touching many questions about history merging and splitting, I think there is some connection to T113004 here. especially concerning restructuring the database for pages and revisions.

Rxy added a subscriber: Rxy.Oct 17 2018, 2:25 AM
Halfak added a subscriber: Halfak.Jan 4 2019, 10:04 PM
Lofhi added a subscriber: Lofhi.Jan 6 2019, 6:48 PM

...
Then, Special:SplitHistory would do the following for a given page A with ID n and a given cut-off revision ID r:

  1. Delete page A.
  2. Apply the steps above for page ID n and revision ID r.
  3. Undelete page A with either the original page ID n, or the new page ID m, depending on whether the user chooses to move earlier or later revisions.
  4. Move A to another title B without redirect.
  5. Undelete page A with the other page ID (n if page B has ID m; m if page B has ID n).

This seems to rely on there being no pre-existing deleted edits for ID n or ID m .

Scott added a subscriber: Scott.Mar 21 2019, 12:11 PM

Has any of the people that are in charge of producing each version of MediaWiki ever considered creating an extension that allows administrators to select revisions to delete through the page deletion system? This idea would obviously be different from Manual:RevisionDelete which deletes revisions through the revision deletion system, as opposed to the page deletion system.

I remember having this idea for years, and yet I can remember how surprised I was to find that there was no extension to my knowledge that allowed for revisions to be deleted through the page deletion system. Since there are specific sites that disallow pages in specific namespaces to be restored once deleted, unless the said users are in specific user-groups that aren't available to the general community.

The said extension would be pretty much identical to Manual:Page restoration only the exact opposite. As it would be added to the page deletion system and would allow for revisions to be deleted, instead of revisions to be restored.

Krinkle updated the task description. (Show Details)Jun 27 2019, 9:18 PM
Krinkle renamed this task from Unify various deletion systems to Unify the various deletion systems.Jul 24 2019, 8:58 PM
C.Syde65 added a comment.EditedJul 26 2019, 6:18 AM

Personally I think that an ability should be added to the traditional deletion system, allowing for selective deletions. So it would basically be the same as the traditional undeletion system, but it would use the same page as the traditional deletion system via ?action=delete.

Having an ability to selectively delete revisions through the traditional deletion system would save users the trouble of having to delete entire pages and then restore the wanted revisions.

From my experience, Revision delete is generally reserved for hiding sensitive information and other TOU breaking content that other users shouldn't be able to just stumble across. Unlike the more traditional deletion system that can just be used for whatever reason.

But then again, an issue with Revision delete is that it doesn't delete revisions from the page history. It just crosses them out. So I think a better alternative to this issue would be to make a separate user permission that would allow revisions to be deleted through the page history just like delete and partial undelete.

Another issue with Revision delete is that because it only reserved for sensitive or TOU breaking content on some sites, some sites don't allow general access to the Revision delete system. So users with nothing more than general access are limited to just the traditional deletion system.

And a serious con that I've noticed is that there are few sites that have namespaces where undeletion isn't possible, and therefore deletion and partial undeletion would be out of the question. If there was a separate ability that would allow revision deletions through the page deletion system, then that would solve that problem.

daniel moved this task from Under discussion to Old on the TechCom-RFC board.Jul 31 2019, 5:23 AM

Putting this into the RFC backlog, pending product level input from Platform Engineering or Growth-Team.

Krinkle updated the task description. (Show Details)Aug 1 2019, 1:16 PM
Krinkle added a comment.EditedAug 1 2019, 1:22 PM

@C.Syde65 Thanks. I've added the use case of "allow selective hiding of revisions" to the requirements section. This was unintentionally omitted by me due to my bias toward using the technical internals of "Revision delete" as the basis for the new unified system (which naturally has this ability already). I've added it now to make it more explicitly.

As for how it would look for end-users, that is what this task is about. I think from a technical perspective, using the "traditional delete system" for anything, will not be an option long-term because of the very drastic performance and availability risks it has. It's time to let that go. However, understand that I'm referring to its technical internals - I'm not talking about the user interface of traditional page deletion, and not talking about the impact on "View history".

One of the open questions here is whether we need the ability for revisions to be entirely omitted from the history page pagination (which is currently possible by deleting the whole page and selectively undeleting all-but-one revision). My "Proposal 1" currently suggests that we do keep this ability, and that it would become one of the options of "Revision delete" - just like how we have several options already about visibility of user name, timestamp and content. For the user-interface, this could look like a checkbox on "Special:RevisionDelete", or it could look like "Special:Undelete" - that's a separate question.

One of the open questions here is whether we need the ability for revisions to be entirely omitted from the history page pagination (which is currently possible by deleting the whole page and selectively undeleting all-but-one revision). My "Proposal 1" currently suggests that we do keep this ability, and that it would become one of the options of "Revision delete" - just like how we have several options already about visibility of user name, timestamp and content. Another could be to omit the entry entirely. For the user-interface, this could look like a checkbox on "Special:RevisionDelete", or it could look like "Special:Undelete" - that's a separate question.

I think we absolutely do need the ability to hide revisions from pagination in the history (and therefore hide them from the count of total revisions). Sometimes people move articles by cut-and-paste over redirects (instead of using the page move button), and it's regular practice when fixing these cut-and-paste moves to completely obliterate these redirect revisions from the history. See:
https://en.wikipedia.org/wiki/Wikipedia_talk:WikiProject_History_Merge#Special:MergeHistory_links_for_rapid-processing_of_predictable_bot-cut-pastes

Also see the logs here for a particularly painful case, which is made even worse by T45911:
https://en.wikipedia.org/w/index.php?title=Special:Log/Graham87&offset=20190725114337&limit=13&type=&user=Graham87

@Krinkle So for users with the (delete) permission. Would that automatically give them access to (deleterevision) or would that still require a separate permission? I'm asking because not every site would be chuffed with allowing all users with the ability to delete and undelete pages to be able to revision delete them as well. I have to admit that (deleterevision) doesn't really look user friendly, given that it just crosses the revisions out preventing them from being viewable, whereas if a similar ability was part of the traditional deletion system, it would delete the revisions the same way it deletes entire pages. I've had reasons to delete selective revisions rather than having to delete entire pages and restore the wanted revisions, especially since there are some pages in some namespaces on certain sites that don't allow undeletions. So naturally, not having an ability to selectively delete revisions through the traditional deletion system is quite frustrating. And I've always said to myself "If you can restore selective revisions through the traditional undeletion system, why can't you delete selective revisions through the traditional deletion system." Therefore it would balance out the deletion and undeletion systems, giving them the same number of options. And another thing is that entire pages cannot be deleted through (deleterevision) as the most recent revision cannot be partially deleted. Since deleting selective revisions through the traditional deletion system would only allow full deletions on each revision, it wouldn't be limited to working on earlier revisions, unlike the revision deletion system.

Izno added a subscriber: Izno.Sep 1 2019, 3:55 PM
Krinkle updated the task description. (Show Details)Apr 4 2020, 1:27 AM
Krinkle moved this task from Old to P2: Resource on the TechCom-RFC board.
Krinkle renamed this task from Unify the various deletion systems to RFC: Unify the various deletion systems.Apr 4 2020, 2:33 AM

Above is written:

(X) "The Page deletion (archiving) feature of MediaWiki core.
(Y) "The Revision delete (RevDel) feature of MediaWiki core.

As a WIkipedia admin I am familiar with two sorts of deletion:
The function invoked by clicking the tab "Delete" at the top of the page display :: I call this "deleting".
The function invoked by clicking the tab "History" at the top of the page display, then selecting on the small square boxes, then clicking on "Change visibility of selected revisions" :: I call this "hiding".

I assume that X is "deleting" and Y is "hiding".

One difference between these two is that when a page is moved, a non-"deleted" edit is moved and a "deleted" edit is not moved; but moving is not affected by whether or not each edit is "hidden".

My main inconvenience is that I cannot "delete" only some edits of a page, but I must waste the system's time by deleting the whole page and all its edits, and then undelete some edits. One main need to delete only some edits is when history-merging, to temporarily delete irrelevant redirect edits at the start or end of an edit history.

@Anthony_Appleyard Yep, that's exactly the kind of problem I hope we can solve. Both kinds of deletion can: hide revisions from most users, can still be opened by an administrator while deleted, can be reversed. And when reversing, both tools allow you to select which revisions to make public via checkboxes.

Technically there are many differences. But, from a user perspective there are not so many differences. Except that 1) RevDel leaves a grey marker in the history page whereas PageDel makes it invisible in the timeline, 2) RevDel has checkboxes for deletion and for undeletion, whereas PageDel only has checkboes for undeletion, not for deletion.

Today, RevDel has three options: hide text, hide summary, hide user. Imagine if something like "Hide in timeline" was an extra option, which would make the edit completely hidden on the History page. The same as today if you delete/undelete the whole page. Would that be useful?

Another possibility could be to make "Hide in timetime" a secret option that we do not show anywhere, but internally connect it to what "Delete page" is today. That way, everything would stay exactly as today for users – two separarate systems. And behind the scenes the technology would be one system.

If "deleting" and "hiding" are amalgamated into one system, we will strongly need something to replace this difference :: when a page is moved, a non-"deleted" edit is moved and a "deleted" edit is not moved; but moving is not affected by whether or not each edit is "hidden".

To put that another way: Complex history merges and history splits require the ability to move revisions from one title to another title.

Currently, that is achieved using a combination of page deletion, selective restoration, and page moves. Simple histmerges can be accomplished using Special:MergeHistory, but histsplits and complex histmerges can not. To continue supporting these actions, an option should be introduced to allow revisions to be moved to an arbitrary page. Without it, the only way to perform history splits and complex history merges would be to export the pages, manually edit the XML, and re-import and rewrite the history (generally a bad idea).

AntiCompositeNumber wrote ::

... Complex history merges and history splits ... using a combination of page deletion, selective restoration, and page moves. Simple histmerges can be accomplished using Special:MergeHistory, but histsplits and complex histmerges can not. ...

And by ability to move only the non-deleted edits of a page.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Ability to move only the deleted edits of a page, would also be useful, to quickly shift pre-existing deleted edits to another (usually, similar) name out of the way of planned work.

If "deleting" and "hiding" are amalgamated into one system, we will strongly need something to replace this difference :: when a page is moved, a
non-"deleted" edit is moved and a "deleted" edit is not moved; but moving is not affected by whether or not each edit is "hidden".

One property that the new system should, in my mind, definitely have is retaining page identity across delete/undelete. That is, there can me multiple deleted pages with the same title. The deleted revisions would no longer be treated as all belonging to the same page. Moving a page will only move the current page, not deleted pages with the same title. Moving a page would carry along any hidden/suppressed revisions. The idea of "deleting" a revision is confusing.

To put that another way: Complex history merges and history splits require the ability to move revisions from one title to another title.
Currently, that is achieved using a combination of page deletion, selective restoration, and page moves.

That seems like a feature-by-accident. This functionality should be entirely independent of deletions. History splits in particular should be rare - ideally, they would really just consist of undoing a previous history merge (or import - importing can result in a history merge). Am I right about this? Is there any other situation where a history split is needed?

History splits in particular should be rare - ideally, they would really just consist of undoing a previous history merge (or import - importing can result in a history merge). Am I right about this? Is there any other situation where a history split is needed?

A history split is needed when there are redirect revisions at the start of a revision history that need to be moved out of the way before a history merge can be done. Like this log:
https://en.wikipedia.org/w/index.php?title=Special:Log&page=Kottarakkara

There are also cases of revisions with duplicate timestamps/other weirdness (due to history merges/imports) where it'd be nice to move individual revisions out of the way.

Anthony_Appleyard added a comment.EditedApr 5 2020, 1:12 PM

The idea of "deleting" a revision is confusing.

It is not confusing to me. Very often in history merging I have had first to delete intruding redirect edits that were blocking the history-merging process.

I need history-splits quite often. For example, if someone has made a cut-and-paste move to a file where there was already an article about something else.

The existing system works and best leave-well-enough-alone. It merely needs some improvements, such as:-
(1) Ability to delete some edits of a page, instead of having to delete them all and then undelete some.
(2) Ability to selectively move some deleted edits of a page, to get them out of the way of planned work.

The technical implementation of that action must not move rows between tables.

I have heard much about these rows and tables. Please where is a detailed explanation of what these rows and tables are and what information is stored in them?

These various moves and renamings that happen: when does making a move or a renaming need merely changing a pointer, and when does it need a big bulk copying of information?

The technical implementation of that action must not move rows between tables.

I have heard much about these rows and tables. Please where is a detailed explanation of what these rows and tables are and what information is stored in them?

These various moves and renamings that happen: when does making a move or a renaming need merely changing a pointer, and when does it need a big bulk copying of information?

This is the revision table (current page history) and archive table (deleted page history). Deleting/restoring a page involves moving rows from one table to the other. The rev_page field contains the reference to the page title. Renaming a page doesn't update it (relation between page id and page title is in the page table). However, the archive table (deleted revisions) does not contain the page id, but the namespace and title of the page at the moment those revisions were deleted. That's why moving a page doesn't update the deleted revisions.

daniel added a comment.EditedApr 5 2020, 3:10 PM

In my mind, the much simpler system (both in the database, as well as for users) would be:

  • when deleting a page, move the page record to a page_archive table. This is always just one row. (We could also use a flag on the page table, but that would be problematic for backwards compatibility - old code would still pick up deleted pages in existing queries that don't check the flag).
  • to remove a revision, use the existing rev_deleted bitfield. We can just add a flag for completely hiding the revision from the history.

In other words: there would be hidden revisions, and deleted pages, and revisions of deleted pages. No deleted revisions.

When undeleting a page, you may have to pick which of multiple deleted pages with the same title you want to undelete. Undeletion would affect all revisions assotiated with that, not with other pages that might have had the same name in the past. It would include all hidden revision, but they would retain their hidden status. Same with moving: moving affects all revisions of the page, hidden or not. The history of pages would be entirely unaffected by moving and deletion/undeletion.

I realize that this is contrary to how things are done at the moment. But it would resolve a lot of edge cases in behavior, nasty surprises (e.g. previously deleted revisions coming back after undeletion of the page), and would make operations on the database side much more efficient.

For context: as an engineer working for the wmf, I don't decide on how this is supposed to work. I can just point out different models and mechanisms, and their consequences in terms of technical efficiency and stability. As a community member and former wikipedia admin, I can tell you what behavior would make sense to me.

History splits in particular should be rare

They're fairly common on Commons because of COM:OVERWRITE. Someone will upload a useful derivative over an existing image, and they'll have to be split apart. That usually involves both the file revisions and the page revisions.

As a side note: this proposal only mentions page <-> archive deletions. Is it also going to apply to image <-> filearchive deletions for files?

when deleting a page, move the page record to a page_archive table. This is always just one row.

No - if you (re)move a record from the page table, the revision.rev_page reference would become invalid. Hence, you need to move the affected revisions somewhere else, which is exactly what happens now.

When undeleting a page, you may have to pick which of multiple deleted pages with the same title you want to undelete

Indeed many times I have had to do that, when disentangling the messes that users make sometimes. See https://en.wikipedia.org/wiki/Talk:Black_sheep/Archive_1#Moves,_merges,_history for a long complicated job that I had once.,

daniel added a comment.EditedApr 5 2020, 5:27 PM

No - if you (re)move a record from the page table, the revision.rev_page reference would become invalid. Hence, you need to move the affected revisions somewhere else, which is exactly what happens now.

Why would it become invalid? It would still be the pages ID, which would not change. It would no longer refer to an entry in the page table. But MediaWiki doesn't use foreign key constraints, so that's not a concern.

Note btw that ar_page_id exists even now.

daniel added a comment.Apr 5 2020, 5:30 PM

As a side note: this proposal only mentions page <-> archive deletions. Is it also going to apply to image <-> filearchive deletions for files?

If I Was King, yes, via T96384: Integrate file revisions with description page history. But I'm not king, just an engineer ;)

Lahwaacz added a comment.EditedApr 5 2020, 5:35 PM

But MediaWiki doesn't use foreign key constraints

That is true only for the MySQL backend. MediaWiki still supports PostgreSQL which enforces foreign key constraints. In any case, intentionally breaking relations between tables at the design level most likely indicates bad database design.

daniel added a comment.EditedApr 5 2020, 7:01 PM

That is true only for the MySQL backend. MediaWiki still supports PostgreSQL which enforces foreign key constraints. In any case, intentionally breaking relations between tables at the design level most likely indicates bad database design.

PostgreSQL doesn't have to enforce this constraint. Nothing forces us to even consider this a relationship between tables, rather than an abstract ID.

But I agree: the clean design would be a flag on the page table. But as stated above, making that backwards compatible is not trivial. We could have a new page_plus table, and keep pages as a view that filters out deleted pages. That would be clean by the textbook, but I'm not sure if it would be great for performance, how it would work with updates across db systems, upgrading from earlier versions, etc.

My experience with working on MediaWiki over the last 15 year has taught me that due to requirements for scalability, backwards compatibility, and extensibility, the textbook solution often isn't the right choice.

Anthony_Appleyard added a comment.EditedApr 6 2020, 4:42 AM

Moving rows about within or between tables would be easier if each table was not a single solid block of data, but with each row separate and the central body of the table as a list of n pointers, if the table has n rows. Then, for i = 1(1)n the i-th pointer points to the i-th row. See https://en.wikipedia.org/wiki/Iliffe_vector

History-splitting and other jobs would be easier with selective move :: select some of the edits of a page's history, and then move only the selected edits to some desired other name.

And also the same, but moving only the deleted edits which are listed under a page-name. That would make it easier to get pre-existing deleted edits out of the way before doing a job.

daniel added a comment.EditedApr 6 2020, 9:35 AM

Moving rows about within or between tables would be easier if each table was not a single solid block of data, but with each row separate and a list of pointers; for i = 1(1)n the i-th pointer points to the i-th row. See https://en.wikipedia.org/wiki/Iliffe_vector

We are currently in the process of normalizing things like the content model, the user name, and the comment text. The actual content has been moved out of the revision table more than a decade ago. The migration isn't entirely complete yet, but at this point, a row in the revision table is really little more than a set of pointers.

History-splitting and other jobs would be easier with selective move :: select some of the edits of a page's history, and then move only the selected edits to some desired other name.

Yes, in my opinion, splitting the history would simply be changing the rev_page field on a set of revisions.