Page MenuHomePhabricator

Move page deletion to a RevDelete mechanism; kill archive table (fire optional)
Open, NormalPublic

Description

Implement the "new field" option at https://www.mediawiki.org/wiki/Requests_for_comment/Page_deletion . I.e., add a new field to the page table, pg_deleted (analogous to rv_deleted). Also add a new field to the revision table, rv_logid, to store the log_id of the deletion event. Upon restoring a page, only the revisions pertaining to the pertinent deletion event would be restored.

So, suppose you revision delete some revisions from a page (log_id 1). Then you delete the whole page (log_id 2). Then you undelete the page. You only restore the revisions that have rv_logid 2, and leave the rv_logid 1 revisions deleted.

This will render the archive table obsolete, so it can be eliminated.


Version: 1.22.0
Severity: normal

Details

Reference
bz55398

Event Timeline

bzimport raised the priority of this task from to Normal.Nov 22 2014, 2:11 AM
bzimport added a project: Wikimedia-Rdbms.
bzimport set Reference to bz55398.
bzimport added a subscriber: Unknown Object (MLST).

I just want to confirm, is there consensus for implementing this in the core in v1.22 or 1.23? I don't mind coding it, but I want to make sure it'll get merged. Thanks.

(In reply to comment #1)

I just want to confirm, is there consensus for implementing this in the core
in
v1.22 or 1.23? I don't mind coding it, but I want to make sure it'll get
merged. Thanks.

This will not go into MediaWiki 1.22 (which may be branched tomorrow, given the November 2013 release date), and I have seen no recent discussion on wikitech-l.

https://www.mediawiki.org/wiki/Requests_for_comment
https://www.mediawiki.org/wiki/Version_lifecycle
https://www.mediawiki.org/wiki/Project:Release_management/Release_timeline

(In reply to comment #2)

This will not go into MediaWiki 1.22 (which may be branched tomorrow, given

Actually, a week from now (at R-5 weeks, not R-6), though still too soon to make major changes.

http://lists.wikimedia.org/pipermail/wikitech-l/2013-October/072538.html

Definitely won't make it into 1.22, but maybe 1.23.

My recommendation would be to get sign-offs from Sean P. (new Wikimedia DBA) because you're proposing a change to the revision table and Aaron S. because you're working on an area where he has the most experience, I believe.

Adding those two users.

Theortically OK purely from a database maintenance perspective. The revision table has a primary key and can be altered online, plus disk space is both available and reclaimable now that innodb_file_per_table is mostly turned on.

The MW code would need to be able to handle a slow switch over (possibly days on enwiki).

As for the changing queries, first need to see and test examples offline on an altered slave with production data.

aaron added a comment.Oct 22 2013, 3:36 AM

From a design perspective, I'd rather have a new deleted_page table. It avoids title uniqueness annoyances, the addition of a bunch of WHERE clauses (which is also more backwards compatible with anything querying the table).

Aaron: Would you consider it a bug or a feature that:
(1) When a page is deleted and then restored, it gets a new page ID;
(2) When a page is deleted and then recreated (i.e. a new page with the same page title is created), the new page has a new page ID (rather than the same page ID as the deleted page); and
(3) When a page is deleted, and then a new page is created with the same page title, the two revision histories have different page IDs (in rev_page and ar_page_id)?

The "new field" proposal would change all three of the above, for good or bad.
Any page recreations or restorations would put the revisions under the same page ID as the deleted page with the same page title. Thus, once a revision has a certain page ID, it will have that page ID forever. In this way, revisions deleted from a page that remains active (i.e. a revision deletion event) will be treated the same way as revisions deleted along with all the other revisions in the page (i.e. a page deletion event).

Relevant questions would be, what inconveniences are posed by having (a) page and (b) revision page IDs for a page title change with recreations and restorations; and what inconveniences are posed by having those page IDs *not* change? For example, are references to those page IDs stored in other database tables (of the core or extensions), so that those fields would need to be updated too when creation, restoration and deletion events occur? Are there some bots or other third-party tools that store page IDs and make API queries using them, whose work would be easier if the page IDs stayed the same? It might sometimes be desirable to query by page ID rather than page title, since page titles can change when pages are moved.

Despite all these revisions having the same page ID, it would always be possible to undo a deletion or undeletion event easily, because the revision IDs of the group of revisions deleted/restored in a log event would be stored in a logging table field (e.g. log_params). If it were desired to split off some revisions from the page and move them to another page's revision history, that could be done too, using that same data; and it could be undone just as easily.

So, for instance, suppose a vandal moves "foo" to "bar" and then the page is deleted; then "bar" is recreated, so that the two revision histories share a page ID. The "foo" revisions could be moved back to "foo" using the data in log_params. Also, because the logid of the deletion event would be stored in the revision table (in the indexed field rev_logid), one could easily select just those rows.

What are some examples of scenarios that would involve "title uniqueness annoyances" if the new field proposal were implemented?

Correction to comment #8. I accidentally posed the same question in both (2) and (3). The first paragraph should read instead:

...
(3) When a page "foo" (page_id 1) is deleted, and then a new page with the same page title "foo" (page_id 2) is created and then deleted, and then a new page with the same page title "foo" (page_id 3) is created, these three revision histories have different page IDs (in rev_page and
ar_page_id)?
...

It's already the case that “When a page is deleted and then restored, it gets a new page ID”. That the new page has a different page_id than the deleted one will be required (and no problem). That the two threads of deleted pages have page_id looks actually as a feature (but note it will need to be merged if restored).

How do we go about reaching a decision on what to implement? Do we have to have more discussion and reach a consensus here, or on wikitech-l, or at the RFC?

Or can someone (e.g. Aaron S., Sean P., or another lead developer) make the decision without further discussion? Was that already done in comment #7? I was just wondering, because that comment was phrased in terms of "I'd rather have..." instead of "The way it's going to be is..."

I would have preferred to go with the new field option, but if the new table option will make it possible to start coding without further delay and then get the change merged into the core, I'm ready to do that. On the other hand, if there's more to discuss, feel free; I was just wondering if we were ready to move on from discussion to implementation.

aaron added a comment.Oct 27 2013, 2:56 AM

I think the discussion with/after comment #8 should be moved to the RfC talk page.

I just wanted to remind everyone that this is on the agenda for today's architecture meeting at 10 PM UTC. https://www.mediawiki.org/wiki/Architecture_meetings/RFC_review_2013-11-20

It was decided at the meeting that I would write SQL queries for the "new table" option, and possibly a prototype, for review at a later architecture meeting. I suppose the best place to discuss the prototype specifications would be at bug 11402, where Aaron's earlier patches implementing that option were posted.

Nathan Larson: This issue has been assigned to you a year ago.
Could you please provide a status update and inform us whether you are still working (or still plan to work) on this issue?
Only in case you do not plan to work on this issue anymore, should the assignee be set back to default? Thanks.

(In reply to Andre Klapper from comment #15)

Nathan Larson: This issue has been assigned to you a year ago.
Could you please provide a status update and inform us whether you are still
working (or still plan to work) on this issue?
Only in case you do not plan to work on this issue anymore, should the
assignee be set back to default? Thanks.

Resetting assignee to default. I had asked a few questions about design decisions, and didn't get an answer, so I indefinitely suspended my work on it.