Page MenuHomePhabricator

RFC: Unify the various deletion systems
Open, MediumPublic

Description

  • Affected components:
    • The Page deletion (archiving) feature of MediaWiki core.
    • The Revision delete (RevDel) feature of MediaWiki core.
  • Engineer for initial implementation: TBD.
  • Code steward: TBD.

Motivation

We currently have two systems that are provide a ways to make content no longer publicly accessible:

  1. Page deletion.
  2. Revision delete.

The "Revision delete" system seems to scale fairly well currently. It has a natural way to limit or divide its internal database interactions. If it were to show scale problems, we would have a clear path for how to make it scale further.

The "Page delete" system on the other hand has severe limitations. Even if we ignore the edge case of pages with 5000+ revisions, the underlying concept is still problematic. Database operations for smaller page that move rows between tables is something DBAs would prefer never happens, even at a small scale, and should be migrated away from as soon as possible.

The objective is to unify these two systems and end up with something that is as good as the best of both.

Issues:

Requirements
  • Administrators must still be able to delete entire pages in a way that is as easy as "Page deletion" is today.
  • Administrators must still be able to selectively hide revisions in a way that is as easy as "Revision deletion" offers today.
  • The technical implementation of that action must not move rows between tables.
  • The viewing of "Page history" and "User contributions" (and related APIs) must not display revisions of deleted pages (by default), the same as today.

Exploration

Status quo: Page deletion

This is MediaWiki's original deletion system. Exposed through the interface as "Delete page" (action=delete) and "Restore page" (Special:Undelete).

Database process:
Moves a page and its revisions to the "archive" database table.

Visibility:
Revisions from deleted (or "archived") pages are not shown in page history, or user contributions. Administrators may view them via Special:Undelete/<title> or Special:DeletedContributions/<user>.

Limitations:
The database process for page deletion is inefficient. This cannot be improved because the problem is not how we do it, it is what we do (moving rows between tables). This concept is considered bad practice for database operations. This is why, in order to reduce its negative impact on database stability, replication lag, and performance - "Page deletion" can be limited via the $wgDeleteRevisionsLimit configuration. When limited, only users with the bigdelete may access the feature on pages with more than this number of revisions.

On Wikimedia wikis, the limit has been set at 5,000 revisions. And the right has mostly been reserved to Stewards and Developers. When used with caution, these users are then sometimes able to perform the deletion through a simple request procedure. However, even with this user right, the underlying process is highly inefficient and can cause a longer lasting impact on the database performance in the minutes/hours that follow. As such, all database transactions have additional limits on Wikimedia wikis, that abort these when this is about to happen.

Pages with revisions a lot more than 5,000 as such cannot be deleted through this process. The only way to do so in a way that does not disrupt database performance would be to batch the deletion. However, it is unknown whether it is feasible to do this in a safe manner, given the possible database failure and rollback scenarios it would have to account for.

See also:

Status quo: Revision delete

This is a newer mechanism introduced in 2009. Exposed on the "View history" and "User contributions" views as "Change visibility of selected revisions". And works by ticking the relevant check boxes first.

Database process:
Changes the numerical value in the rev_delete field for the relevant revisions in the database. This can be done in batches.

Visibility:
Revisions that have been "deleted" (or "hidden") still have a placeholder row shown in the interface on "Page history" and "User contributions".

The "Revision delete" feature allows admins to decide which aspect(s) of a revision to hide, and from whom. In particular, it is capable of separately controlling the visibility of the textual content, the edit summary, or the user's name/IP. And it can hide it from either non-admins only, or from everyone (suppression, aka "oversight").

Limitations:
I couldn't find any limitation in the code (which is concerning), but the interfaces (History page, Contributions page) do have a limitation on how many revisions they offer at once. And in any event, there are general transaction limits that will still apply. Regardless of whether this needs a limit, though, it could be batched internally if needed (either in-request or using the JobQueue). And as last fallback, the user themselves has the option to manually "batch" as well (e.g. increase history to show 500 rows at once, and shift-select it as one chunk). Which could work in extreme cases when stewards/developers need to intervene.

See also https://www.mediawiki.org/wiki/Help:RevisionDelete.

Proposal

Nothing specific yet, but it seems I (@Krinkle) and others find it worth exploring to see if we can re-implement the logic behind "Page deletion" by using the same code and database logic that is used by "Revision delete". This would involve the following:

  • Add a bit-field value for revision.rev_delete to represent "archived".
  • Update page/user revision views (Page history, User contributions) to make sure revisions with this flag are not shown by default.
  • Add a way to see them. (e.g. re-using Special:DeletedContributions, or through a switch on Special:Contribs itself, same for history).
  • TODO: Decide what to do with the page entity itself (meta data). E.g. a page_deleted flag (possibly including a state for "deletion in progress", to be batch-friendly).
  • TODO: Decide how/if to migrate archive into revision.rev_delete=archived.
Original task description from bugzilla.wikimedia.org user FT2.wiki:

At present we now have 4 means of deleting material from either the public or from administrators. Material can either be

  • Deleted from the public with traditional deletion
  • Deleted from the public (part or full) with RevisionDeleted
  • Deleted from admin view with Oversight
  • Suppressed from admin view with RevisionDeleted

This collection means that any review of editor actions or conduct, or article matters on the wiki, now faces two big problems in evaluating the existance or seriousness or any issue:

  • It's incredibly easy to overlook some edits or actions in the review, which should be taken account of.
  • It's more complex and takes examination of several screens, to review a matter.
  • Each of these has different mechanisms for viewing edits they affect; there is no consistency of links, formats, access methods, etc.
  • A third issue at a technical level - it's a lot to maintain, and allows for inconsistent software behavior (or bugs fixed in one of these but not spotted in the other), and requires more developer time etc.

I would like to suggest that in fact, all we now need is RevisionDeleted, with the following options:

  • What to hide - revision text, edit summary, user name/IP
  • Whether admins can or can't access the hidden data
  • Whether admins or users who cannot access the hidden data, should nonetheless be able to see it exists even if they can't read it (there are cases when this is safe, and cases when it isn't).

This proposal is that RevisionDeleted is amended slightly to show the above options, and then both traditional deleted revisions and oversighted revisions are converted to RevisionDeleted entries as a background task (ie a script written that achieves this in the job queue over time). Following this:

  • Delete and oversight both redirect to RevDel for their actions
  • Delete/undelete and oversight url's both redirect to the appropriate lookup link for any historical URL used to view an old deleted/oversighted edit.

The issue here is not so much one of software development, as of a once-off conversion task of old data stored in one system to be moved to another.

Details

Reference
bz18493

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
In T20493#4446127, @GeoffreyT2000 wrote:

...
Then, Special:SplitHistory would do the following for a given page A with ID n and a given cut-off revision ID r:

  1. Delete page A.
  2. Apply the steps above for page ID n and revision ID r.
  3. Undelete page A with either the original page ID n, or the new page ID m, depending on whether the user chooses to move earlier or later revisions.
  4. Move A to another title B without redirect.
  5. Undelete page A with the other page ID (n if page B has ID m; m if page B has ID n).

This seems to rely on there being no pre-existing deleted edits for ID n or ID m .

Has any of the people that are in charge of producing each version of MediaWiki ever considered creating an extension that allows administrators to select revisions to delete through the page deletion system? This idea would obviously be different from Manual:RevisionDelete which deletes revisions through the revision deletion system, as opposed to the page deletion system.

I remember having this idea for years, and yet I can remember how surprised I was to find that there was no extension to my knowledge that allowed for revisions to be deleted through the page deletion system. Since there are specific sites that disallow pages in specific namespaces to be restored once deleted, unless the said users are in specific user-groups that aren't available to the general community.

The said extension would be pretty much identical to Manual:Page restoration only the exact opposite. As it would be added to the page deletion system and would allow for revisions to be deleted, instead of revisions to be restored.

Krinkle renamed this task from Unify various deletion systems to Unify the various deletion systems.Jul 24 2019, 8:58 PM

Personally I think that an ability should be added to the traditional deletion system, allowing for selective deletions. So it would basically be the same as the traditional undeletion system, but it would use the same page as the traditional deletion system via ?action=delete.

Having an ability to selectively delete revisions through the traditional deletion system would save users the trouble of having to delete entire pages and then restore the wanted revisions.

From my experience, Revision delete is generally reserved for hiding sensitive information and other TOU breaking content that other users shouldn't be able to just stumble across. Unlike the more traditional deletion system that can just be used for whatever reason.

But then again, an issue with Revision delete is that it doesn't delete revisions from the page history. It just crosses them out. So I think a better alternative to this issue would be to make a separate user permission that would allow revisions to be deleted through the page history just like delete and partial undelete.

Another issue with Revision delete is that because it only reserved for sensitive or TOU breaking content on some sites, some sites don't allow general access to the Revision delete system. So users with nothing more than general access are limited to just the traditional deletion system.

And a serious con that I've noticed is that there are few sites that have namespaces where undeletion isn't possible, and therefore deletion and partial undeletion would be out of the question. If there was a separate ability that would allow revision deletions through the page deletion system, then that would solve that problem.

Putting this into the RFC backlog, pending product level input from Platform Engineering or Growth-Team.

@C.Syde65 Thanks. I've added the use case of "allow selective hiding of revisions" to the requirements section. This was unintentionally omitted by me due to my bias toward using the technical internals of "Revision delete" as the basis for the new unified system (which naturally has this ability already). I've added it now to make it more explicitly.

As for how it would look for end-users, that is what this task is about. I think from a technical perspective, using the "traditional delete system" for anything, will not be an option long-term because of the very drastic performance and availability risks it has. It's time to let that go. However, understand that I'm referring to its technical internals - I'm not talking about the user interface of traditional page deletion, and not talking about the impact on "View history".

One of the open questions here is whether we need the ability for revisions to be entirely omitted from the history page pagination (which is currently possible by deleting the whole page and selectively undeleting all-but-one revision). My "Proposal 1" currently suggests that we do keep this ability, and that it would become one of the options of "Revision delete" - just like how we have several options already about visibility of user name, timestamp and content. For the user-interface, this could look like a checkbox on "Special:RevisionDelete", or it could look like "Special:Undelete" - that's a separate question.

One of the open questions here is whether we need the ability for revisions to be entirely omitted from the history page pagination (which is currently possible by deleting the whole page and selectively undeleting all-but-one revision). My "Proposal 1" currently suggests that we do keep this ability, and that it would become one of the options of "Revision delete" - just like how we have several options already about visibility of user name, timestamp and content. Another could be to omit the entry entirely. For the user-interface, this could look like a checkbox on "Special:RevisionDelete", or it could look like "Special:Undelete" - that's a separate question.

I think we absolutely do need the ability to hide revisions from pagination in the history (and therefore hide them from the count of total revisions). Sometimes people move articles by cut-and-paste over redirects (instead of using the page move button), and it's regular practice when fixing these cut-and-paste moves to completely obliterate these redirect revisions from the history. See:
https://en.wikipedia.org/wiki/Wikipedia_talk:WikiProject_History_Merge#Special:MergeHistory_links_for_rapid-processing_of_predictable_bot-cut-pastes

Also see the logs here for a particularly painful case, which is made even worse by T45911:
https://en.wikipedia.org/w/index.php?title=Special:Log/Graham87&offset=20190725114337&limit=13&type=&user=Graham87

@Krinkle So for users with the (delete) permission. Would that automatically give them access to (deleterevision) or would that still require a separate permission? I'm asking because not every site would be chuffed with allowing all users with the ability to delete and undelete pages to be able to revision delete them as well. I have to admit that (deleterevision) doesn't really look user friendly, given that it just crosses the revisions out preventing them from being viewable, whereas if a similar ability was part of the traditional deletion system, it would delete the revisions the same way it deletes entire pages. I've had reasons to delete selective revisions rather than having to delete entire pages and restore the wanted revisions, especially since there are some pages in some namespaces on certain sites that don't allow undeletions. So naturally, not having an ability to selectively delete revisions through the traditional deletion system is quite frustrating. And I've always said to myself "If you can restore selective revisions through the traditional undeletion system, why can't you delete selective revisions through the traditional deletion system." Therefore it would balance out the deletion and undeletion systems, giving them the same number of options. And another thing is that entire pages cannot be deleted through (deleterevision) as the most recent revision cannot be partially deleted. Since deleting selective revisions through the traditional deletion system would only allow full deletions on each revision, it wouldn't be limited to working on earlier revisions, unlike the revision deletion system.

Krinkle moved this task from Old to P2: Resource on the TechCom-RFC board.
Krinkle renamed this task from Unify the various deletion systems to RFC: Unify the various deletion systems.Apr 4 2020, 2:33 AM

Above is written:

(X) "The Page deletion (archiving) feature of MediaWiki core.
(Y) "The Revision delete (RevDel) feature of MediaWiki core.

As a WIkipedia admin I am familiar with two sorts of deletion:
The function invoked by clicking the tab "Delete" at the top of the page display :: I call this "deleting".
The function invoked by clicking the tab "History" at the top of the page display, then selecting on the small square boxes, then clicking on "Change visibility of selected revisions" :: I call this "hiding".

I assume that X is "deleting" and Y is "hiding".

One difference between these two is that when a page is moved, a non-"deleted" edit is moved and a "deleted" edit is not moved; but moving is not affected by whether or not each edit is "hidden".

My main inconvenience is that I cannot "delete" only some edits of a page, but I must waste the system's time by deleting the whole page and all its edits, and then undelete some edits. One main need to delete only some edits is when history-merging, to temporarily delete irrelevant redirect edits at the start or end of an edit history.

@Anthony_Appleyard Yep, that's exactly the kind of problem I hope we can solve. Both kinds of deletion can: hide revisions from most users, can still be opened by an administrator while deleted, can be reversed. And when reversing, both tools allow you to select which revisions to make public via checkboxes.

Technically there are many differences. But, from a user perspective there are not so many differences. Except that 1) RevDel leaves a grey marker in the history page whereas PageDel makes it invisible in the timeline, 2) RevDel has checkboxes for deletion and for undeletion, whereas PageDel only has checkboes for undeletion, not for deletion.

Today, RevDel has three options: hide text, hide summary, hide user. Imagine if something like "Hide in timeline" was an extra option, which would make the edit completely hidden on the History page. The same as today if you delete/undelete the whole page. Would that be useful?

Another possibility could be to make "Hide in timetime" a secret option that we do not show anywhere, but internally connect it to what "Delete page" is today. That way, everything would stay exactly as today for users – two separarate systems. And behind the scenes the technology would be one system.

If "deleting" and "hiding" are amalgamated into one system, we will strongly need something to replace this difference :: when a page is moved, a non-"deleted" edit is moved and a "deleted" edit is not moved; but moving is not affected by whether or not each edit is "hidden".

To put that another way: Complex history merges and history splits require the ability to move revisions from one title to another title.

Currently, that is achieved using a combination of page deletion, selective restoration, and page moves. Simple histmerges can be accomplished using Special:MergeHistory, but histsplits and complex histmerges can not. To continue supporting these actions, an option should be introduced to allow revisions to be moved to an arbitrary page. Without it, the only way to perform history splits and complex history merges would be to export the pages, manually edit the XML, and re-import and rewrite the history (generally a bad idea).

AntiCompositeNumber wrote ::

... Complex history merges and history splits ... using a combination of page deletion, selective restoration, and page moves. Simple histmerges can be accomplished using Special:MergeHistory, but histsplits and complex histmerges can not. ...

And by ability to move only the non-deleted edits of a page.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Ability to move only the deleted edits of a page, would also be useful, to quickly shift pre-existing deleted edits to another (usually, similar) name out of the way of planned work.

If "deleting" and "hiding" are amalgamated into one system, we will strongly need something to replace this difference :: when a page is moved, a
non-"deleted" edit is moved and a "deleted" edit is not moved; but moving is not affected by whether or not each edit is "hidden".

One property that the new system should, in my mind, definitely have is retaining page identity across delete/undelete. That is, there can me multiple deleted pages with the same title. The deleted revisions would no longer be treated as all belonging to the same page. Moving a page will only move the current page, not deleted pages with the same title. Moving a page would carry along any hidden/suppressed revisions. The idea of "deleting" a revision is confusing.

To put that another way: Complex history merges and history splits require the ability to move revisions from one title to another title.
Currently, that is achieved using a combination of page deletion, selective restoration, and page moves.

That seems like a feature-by-accident. This functionality should be entirely independent of deletions. History splits in particular should be rare - ideally, they would really just consist of undoing a previous history merge (or import - importing can result in a history merge). Am I right about this? Is there any other situation where a history split is needed?

History splits in particular should be rare - ideally, they would really just consist of undoing a previous history merge (or import - importing can result in a history merge). Am I right about this? Is there any other situation where a history split is needed?

A history split is needed when there are redirect revisions at the start of a revision history that need to be moved out of the way before a history merge can be done. Like this log:
https://en.wikipedia.org/w/index.php?title=Special:Log&page=Kottarakkara

There are also cases of revisions with duplicate timestamps/other weirdness (due to history merges/imports) where it'd be nice to move individual revisions out of the way.

The idea of "deleting" a revision is confusing.

It is not confusing to me. Very often in history merging I have had first to delete intruding redirect edits that were blocking the history-merging process.

I need history-splits quite often. For example, if someone has made a cut-and-paste move to a file where there was already an article about something else.

The existing system works and best leave-well-enough-alone. It merely needs some improvements, such as:-
(1) Ability to delete some edits of a page, instead of having to delete them all and then undelete some.
(2) Ability to selectively move some deleted edits of a page, to get them out of the way of planned work.

The technical implementation of that action must not move rows between tables.

I have heard much about these rows and tables. Please where is a detailed explanation of what these rows and tables are and what information is stored in them?

These various moves and renamings that happen: when does making a move or a renaming need merely changing a pointer, and when does it need a big bulk copying of information?

The technical implementation of that action must not move rows between tables.

I have heard much about these rows and tables. Please where is a detailed explanation of what these rows and tables are and what information is stored in them?

These various moves and renamings that happen: when does making a move or a renaming need merely changing a pointer, and when does it need a big bulk copying of information?

This is the revision table (current page history) and archive table (deleted page history). Deleting/restoring a page involves moving rows from one table to the other. The rev_page field contains the reference to the page title. Renaming a page doesn't update it (relation between page id and page title is in the page table). However, the archive table (deleted revisions) does not contain the page id, but the namespace and title of the page at the moment those revisions were deleted. That's why moving a page doesn't update the deleted revisions.

In my mind, the much simpler system (both in the database, as well as for users) would be:

  • when deleting a page, move the page record to a page_archive table. This is always just one row. (We could also use a flag on the page table, but that would be problematic for backwards compatibility - old code would still pick up deleted pages in existing queries that don't check the flag).
  • to remove a revision, use the existing rev_deleted bitfield. We can just add a flag for completely hiding the revision from the history.

In other words: there would be hidden revisions, and deleted pages, and revisions of deleted pages. No deleted revisions.

When undeleting a page, you may have to pick which of multiple deleted pages with the same title you want to undelete. Undeletion would affect all revisions assotiated with that, not with other pages that might have had the same name in the past. It would include all hidden revision, but they would retain their hidden status. Same with moving: moving affects all revisions of the page, hidden or not. The history of pages would be entirely unaffected by moving and deletion/undeletion.

I realize that this is contrary to how things are done at the moment. But it would resolve a lot of edge cases in behavior, nasty surprises (e.g. previously deleted revisions coming back after undeletion of the page), and would make operations on the database side much more efficient.

For context: as an engineer working for the wmf, I don't decide on how this is supposed to work. I can just point out different models and mechanisms, and their consequences in terms of technical efficiency and stability. As a community member and former wikipedia admin, I can tell you what behavior would make sense to me.

History splits in particular should be rare

They're fairly common on Commons because of COM:OVERWRITE. Someone will upload a useful derivative over an existing image, and they'll have to be split apart. That usually involves both the file revisions and the page revisions.

As a side note: this proposal only mentions page <-> archive deletions. Is it also going to apply to image <-> filearchive deletions for files?

when deleting a page, move the page record to a page_archive table. This is always just one row.

No - if you (re)move a record from the page table, the revision.rev_page reference would become invalid. Hence, you need to move the affected revisions somewhere else, which is exactly what happens now.

When undeleting a page, you may have to pick which of multiple deleted pages with the same title you want to undelete

Indeed many times I have had to do that, when disentangling the messes that users make sometimes. See https://en.wikipedia.org/wiki/Talk:Black_sheep/Archive_1#Moves,_merges,_history for a long complicated job that I had once.,

No - if you (re)move a record from the page table, the revision.rev_page reference would become invalid. Hence, you need to move the affected revisions somewhere else, which is exactly what happens now.

Why would it become invalid? It would still be the pages ID, which would not change. It would no longer refer to an entry in the page table. But MediaWiki doesn't use foreign key constraints, so that's not a concern.

Note btw that ar_page_id exists even now.

As a side note: this proposal only mentions page <-> archive deletions. Is it also going to apply to image <-> filearchive deletions for files?

If I Was King, yes, via T96384: Integrate file revisions with description page history. But I'm not king, just an engineer ;)

But MediaWiki doesn't use foreign key constraints

That is true only for the MySQL backend. MediaWiki still supports PostgreSQL which enforces foreign key constraints. In any case, intentionally breaking relations between tables at the design level most likely indicates bad database design.

That is true only for the MySQL backend. MediaWiki still supports PostgreSQL which enforces foreign key constraints. In any case, intentionally breaking relations between tables at the design level most likely indicates bad database design.

PostgreSQL doesn't have to enforce this constraint. Nothing forces us to even consider this a relationship between tables, rather than an abstract ID.

But I agree: the clean design would be a flag on the page table. But as stated above, making that backwards compatible is not trivial. We could have a new page_plus table, and keep pages as a view that filters out deleted pages. That would be clean by the textbook, but I'm not sure if it would be great for performance, how it would work with updates across db systems, upgrading from earlier versions, etc.

My experience with working on MediaWiki over the last 15 year has taught me that due to requirements for scalability, backwards compatibility, and extensibility, the textbook solution often isn't the right choice.

Moving rows about within or between tables would be easier if each table was not a single solid block of data, but with each row separate and the central body of the table as a list of n pointers, if the table has n rows. Then, for i = 1(1)n the i-th pointer points to the i-th row. See https://en.wikipedia.org/wiki/Iliffe_vector

History-splitting and other jobs would be easier with selective move :: select some of the edits of a page's history, and then move only the selected edits to some desired other name.

And also the same, but moving only the deleted edits which are listed under a page-name. That would make it easier to get pre-existing deleted edits out of the way before doing a job.

Moving rows about within or between tables would be easier if each table was not a single solid block of data, but with each row separate and a list of pointers; for i = 1(1)n the i-th pointer points to the i-th row. See https://en.wikipedia.org/wiki/Iliffe_vector

We are currently in the process of normalizing things like the content model, the user name, and the comment text. The actual content has been moved out of the revision table more than a decade ago. The migration isn't entirely complete yet, but at this point, a row in the revision table is really little more than a set of pointers.

History-splitting and other jobs would be easier with selective move :: select some of the edits of a page's history, and then move only the selected edits to some desired other name.

Yes, in my opinion, splitting the history would simply be changing the rev_page field on a set of revisions.

A very old idea, T5843: Implement semi-deletion, may also be taken into account when we start to do this (it is not required, but a future schema change may make some reservation to make it easier to happen). Probably I will write a longer comment about this later.

A very old idea, T5843: Implement semi-deletion, may also be taken into account when we start to do this (it is not required, but a future schema change may make some reservation to make it easier to happen). Probably I will write a longer comment about this later.

Do I understand correctly that effectively, "semi-deletion" essentially means that the deleted page can be restored by anyone? The history would also be available. It would in a way be a page with no current revision.

"semi-deletion" essentially means that the deleted page can be restored by anyone

Or alternatively, can be restored by users with a new right such as undelete-semi, which can be assigned to a non-admin group by community.

A very old idea, T5843: Implement semi-deletion, may also be taken into account when we start to do this (it is not required, but a future schema change may make some reservation to make it easier to happen). Probably I will write a longer comment about this later.

Sounds like a solution in search of a problem?

Schema-wise, I don't think there's an overlap with fixing the deletion system. You'd probably want a page_is_blank flag in the page table, and that's it.

Probably I will write a longer comment about this later.

Quoted from task description:

  • Add a bit-field value for revision.rev_delete to represent "archived".
  • Update page/user revision views (Page history, User contributions) to make sure revisions with this flag are not shown by default.
  • Add a way to see them. (e.g. re-using Special:DeletedContributions, or through a switch on Special:Contribs itself, same for history).
  • TODO: Decide what to do with the page entity itself (meta data). E.g. a page_deleted flag (possibly including a state for "deletion in progress", to be batch-friendly).
  • TODO: Decide how/if to migrate archive into revision.rev_delete=archived.

Once semi-deletion is a thing, the rev_delete field (and probably page_deleted) needs to be 3-state instead of boolean.

Some of my ideas:

  • Page table will now include deleted pages. (Also note in replica we may want to suppress pages with all revisions are oversighted, since they may contain PII).
  • When moving a page you can choose whether to also move its deleted (including semi-deleted) revisions. When deleted revisions are not moved a new page ID is assigned to the old title.
  • Then if there are revisions of the target title, rev_page of these revisions will be updated and the target title is removed from page table before the title of page row of the origin page is updated to the target title.
  • Page id will otherwise be stable across deletion and move that does not involve history merging/spliting.

I also have a use case for a new page table that contains deleted pages, which will be written in another (new) task.

What problem semi-deletion is planning to solve?

What problem semi-deletion is planning to solve?

Please discuss it at T5843: Implement semi-deletion and https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(idea_lab)#Soft_Delete, not here. The only thing relevant to this task is if semi-deletion is a thing the proposed revision.rev_delete field need to be 3-state instead of a bit field, and if we have added it as boolean we need another schema change which is a pain.

By the way I will soon create a task about a usecase of including deleted pages in the page table, which is also relevant to this task. (Edit: I have security-protected that task for now).

We should still keeps the current revdel/suppression and deletion system functionally separate. For example:

  1. Someone created an article for a BLP
  2. A user added disparaging information to the article, it got revdeled
  3. A user added PII to article, it got oversighted
  4. Article is deleted for lack of notability
  5. Article is recreated
  6. Article is moved elsewhere; currently we do not move deleted revisions with it. (But we can let user choose whether to move the deleted revisions together with it? By doing so, user should know existence of deleted revisions beforehand.)
  7. Older revision is restored, but we don't want revision hidden in step 2 and 3 to come visible (i.e. they should be restored too but continued to be revdeled or suppressed)

The above is not related to where we store deleted revisions. I still believe archive (and filearchive) table should be killed, so deleting page will not involve moving rows between tables.

Might that problem be solvable by making the "is deleted or not" flag for each revision have different settings for revision status and page status?