Page MenuHomePhabricator

Deleting of pages with high number of revisions makes server cry
Closed, DuplicatePublic


If a deletion is made on a page with a really high number of revisions (like [[w:en:Wikipedia:Sandbox]] for example) the server might stop responding for an indefinite (at lest a long) time, as the process deleting is filling up all resources.

Version: 1.20.x
Severity: normal



Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 9:56 PM
bzimport set Reference to bz11402.
bzimport added a subscriber: Unknown Object (MLST).

The 'quick fix' here would be a count-and-cutoff warning to discourage people from deleting the sandbox if it's not really needed.

A more serious fix would be to actually make such deletions faster...

Deleting a whole page with a lot of revisions is expensive because the revision records need to be moved around. Currently this means copying data from 'revision' to 'archive' table, then deleting the 'revision' rows. This second part can cause contention as other updates to the revision table get delayed by (?) index locks.

Using revision-flagged deletion might or might not help here -- you'd still have to update all those rows.

Some sort of queue where the page is locked at a higher level and the database is dealt with in pieces might make sense... or that might be awful... dunno yet. :)

(In reply to comment #1)

Using revision-flagged deletion might or might not help here -- you'd still
have to update all those rows.

Well I assume updating hundreds to thousands of rows is still a lot faster than inserting, then deleting the same number of rows: you're touching only half as many rows, you're not changing tables' sizes and the server doesn't have as much data to process (only one bitfield per row vs. the entire row)

Some sort of queue where the page is locked at a higher level and the database
is dealt with in pieces might make sense... or that might be awful... dunno
yet. :)

I recall having suggested the job queue for this somewhere. My idea was that you first delete the page from the page table, thus making it appear to be deleted, and putting the rev_deleted UPDATE queries (the same query a lot of times, with a sensible LIMIT) in the job queue. The only downside is that oldid and diff links will be accessible for a while after the page's deletion.

Deletion overhaul to reduce overhead

attachment deletionNew.patch ignored as obsolete

RevisionDelete fix

attachment deletionNew.patch ignored as obsolete

More minor tweaks

attachment deletionNew.patch ignored as obsolete

This seems a bit overengineered. :)

The bad:

  1. It leaves us with three separate ways in which something might be "deleted":
  • archive/text
  • page/revision/text with rev_deleted
  • deleted_page/revision/text

plus deleted revisions for files.

None of them seem to integrate very cleanly with one another. Leaving things in the 'revision' table seems like it will look pretty weird, and there seem to be weird hackarounds for the referential integrity checks in the postgres mode... it just feels a bit icky. :)

  1. Special:Restore looks like yet another interface that.... mostly duplicates Special:Undelete? Definitely should integrate these more...

The thing I do kind of like is the idea that distinct "deletion events" could be stored, and then looked at, separately -- each "incarnation" of a page which is created, then deleted can be stored in the table with its separate page ID and revision history. In theory it'd be nice to be _able_ to merge them back together, of course...

Also we have some funky hacks to prevent restoration of pages with restricted view as the top revision. Honestly we shouldn't have to special case things like that... we should make sure that's cleaned up, if there's still a reason for it.

There also seems to be duplication between this Title::isValidRestoreTarget() and existing things like isValidMoveTarget()... they look pretty much the same and should probably be sharing code.

So some general thoughts...

  • Tone down the UI code sprawl -- we should have a single clear interface for finding and managing deleted items.
  • Ensure there's clean separation between UI and backend code. I see a lot of stuff mixed together in Special:Restore and it's hard to tell what's going on.
  • Try to clean up the referential integrity issues -- either some nice clean way that doesn't make things get deleted, or just remove the foreign keys if they're not valid... but it'd be nice to have validity checks if they can apply!

'archive/text' stuff is there for B/C, not for new deletions. I don't mind leaving stuff in 'revision', plus, it is needed to kill the o(N) overhead.

There is no 'Special:Restore' page on the wiki. Only 'Special:Undelete'. The old class is kept for B/C and the fact that all of the queries there are different as well as a good deal of formatting (merging would be uglier than leaving separate).

Merging of different incarnations (when rarely needed) can be done done via MergeHistory.

Not sure what you mean about cleaning up top revs ;)

Some common function for isValidMoveTarget() and stuff can prolly eliminate some duplication.

Patch 2.0

Removed some title.php duplication. Separated out some db logic from UI.

attachment deletionNew.patch ignored as obsolete

I'll ask greg about removing those ref SQL clauses

Cleanup ref contraints on PG

attachment deletionNew.patch ignored as obsolete

Schema cleanup/undelete paging


*Bulk BZ Change: +Patch to open bugs with patches attached that are missing the keyword*

sumanah wrote:

Aaron, you could now update your patch & commit it yourself, if it's still good & needed. :-)

This is a low priority item, we think. I'm unassigning from Aaron as he hasn't worked on this in a while, and won't have time to update the patch in the near future.

  • Bug 40174 has been marked as a duplicate of this bug. ***

I'm working on a prototype that implements the "new table" option. I got away from the "archive" terminology altogether and named the new table "deleted_page".

If we keep revision.rev_page the same when the page is deleted and then undeleted, that works fine, as long as *all* the revisions are undeleted. However, suppose we only want to undelete one of the revisions for page 1. That won't work because as soon as we delete the deleted_page row for page 1, and add the page row for page 1, then all the revisions with rev_page 1 will be restored. MediaWiki won't be able to tell them apart (although it would work if page 1 were deleted, then recreated as page 2, and then some of the revisions from the still-deleted page 1 were restored into page 2, because then rev_page on the still-deleted revisions would still be 1, and there would still be a page_deleted row for page 1.)

There needs to be some way of differentiating deleted revisions from undeleted revisions with the same rev_page. Should revision.rev_deleted be set to 7 for those, or should we have a new constant? E.g., should we set const DELETED_PAGE = 16; in Revision.php and use that to indicate that the whole page was deleted and only some deletions were restored?

Setting rev_deleted to 7 seems more aligned with the goals of backward compatibility and merging the deletion and revisiondeletion systems.

See cases #3, #4, and #5 at bug 58986 for examples of why revision.rev_page may sometimes need to be changed after this system is implemented.