Page MenuHomePhabricator

Mediawiki page deletions should happen in batches of revisions
Closed, ResolvedPublicPRODUCTION ERROR

Description

We sometimes encounter situations such as T198156: Server-side deletion of User:LorenzoMilano/sandbox that could be solved, according to @jcrespo and @Marostegui if deletions happened in batches of revisions. It'd also help reduce or suppress db lag once the deletion is complete. Thank you.

Batches implies multiple transactions and wait for slaves- which breaks the ability to quickly revert. This means that atomicity would have to be implemented on application side, or change the model of deletion completely, as suggested on T20493.

When attempting to delete or undelete a page with many revisions, the operation fails with an exception of type "Wikimedia\Rdbms\DBTransactionSizeError".

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
ToBeFree updated the task description. (Show Details)Jul 1 2018, 2:42 AM
ToBeFree renamed this task from faaaaaaaaa to Deletions should happen in batches of revisions.Jul 1 2018, 2:44 AM
ToBeFree lowered the priority of this task from High to Medium.
jcrespo renamed this task from Deletions should happen in batches of revisions to Mediawiki page deletions should happen in batches of revisions.Jul 11 2018, 9:50 AM
jcrespo updated the task description. (Show Details)
Krinkle updated the task description. (Show Details)Jul 17 2018, 2:36 AM

It can't be done in the request because request threads can die at any time. But it could be done in the job queue, which has the means to retry jobs which do not complete successfully, so it could be reasonably reliable.

One way to do it would be to move the old revisions first, say in chronological order, leaving only the current revision, and then to finally delete the page and do the normal page deletion consistency updates. That way, it would be recoverable even if the job queue were lost, you would just start a new deletion job for the same page. The final page deletion could use existing internal interfaces.

Even with T20493 (unified deletion), we still need a way to reliably split up deletions over multiple transactions, since rev_archived will need to be updated, to remove the revisions from the filtered contributions view. It will be faster but still could break replication if you try to do it in a single transaction. So we could implement the job now, and after T20493, change it to update rev_archived instead of moving rows to the archive table.

@tstarling Thank you for your evaluation, it is very useful! My one comment is that, even if the jobqueue is to take care of it, the user would expect some immediate feedback (not sure how true that is?- maybe a UI message would be enough) to see a logical delete, and that may still need some storage design implications, even if not T20493 fully. I guess it could be done without it with a "Page X is scheduled for delation, please be patient"- and the user will be able to see revisions disappearing. Also, this may be so unusual it may not be necessary.

If jobqueue is too complicated in the current status, a dedicated maintenance script to delete a page could be enough to solve T198156, whose loop actions could be reused later for the queue implementation.

I'm imagining that we would do it without a progress bar, just a message like you say. It's a UX improvement compared to an exception message. But I wonder whether we should still require the bigdelete right before launching a job? Currently $wgDeleteRevisionsLimit is 5000, and above that number of revisions, the bigdelete right is required. Is 5000 also an appropriate threshold for queueing a job?

1000 rows is our usual standard for batching, but of course, it always depends on the transaction write timeout, which may change in the future. I would use that, but if it was me, I would make it independent of the right- assuming it works reliably well every time (it is not fragile and causes no confusion).

tstarling assigned this task to BPirkle.Aug 7 2018, 2:00 AM

For testing after the new feature reaches group0, I note that there are a few pages with a large number of revisions on test.wikipedia.org:

wikiadmin@10.64.16.191(testwiki)> select * from ( select page_namespace,page_title,count(*) as n from page,revision where page_id=rev_page group by rev_page ) as counts order by n desc limit 10;
+----------------+----------------------------------------+-------+
| page_namespace | page_title                             | n     |
+----------------+----------------------------------------+-------+
|              2 | Mike.lifeguard/04-edit.t               | 22153 |
|              0 | Main_Page                              | 11014 |
|              2 | Mike.lifeguard/05-revert.t             | 10584 |
|              2 | Mike.lifeguard/07-unicode.t/2          |  8596 |
|              2 | Mike.lifeguard/07-unicode.t/3          |  8529 |
|              2 | Mike.lifeguard/07-unicode.t/éółŽć      |  8490 |
|              2 | Mike.lifeguard/34-secure.t             |  7332 |
|              4 | Sandbox                                |  3455 |
|              0 | Sandbox/MediaWiki_perl_module          |  3065 |
|              0 | Test_page_for_app_testing/Section1     |  2766 |
+----------------+----------------------------------------+-------+
10 rows in set (0.44 sec)

Apparently there's a client library, MediaWiki::Bot in CPAN, written by Mike.lifeguard, which edits a page every time its tests are run. IIRC CPAN runs tests by default when you install a library. So those pages can be safely deleted.

During the period in which a page's revisions are being archived, but the page is not yet deleted, it seems possible to perform operations on a page. For example, I can edit (creating additional revisions) or delete (creating a parallel deletion operation with its own jobs) while the original deletion is being processed.

If so, here is my proposed order of operations. I would appreciate feedback on anything I may be missing:

  1. Page deletion is initiated
  2. WikiPage::doDeleteArticleReal performs sanity checks and invokes the ArticleDelete hook
  3. WikiPage::doDeleteArticleReal calls new function WikiPage::doBatchedDeleteArticle (naming suggestions welcome)
  4. WikiPage::doDeleteArticleBatched archives as many revisions as allowed, per new global $wgDeleteRevisionsBatchSize, from oldest to newest per rev_timestamp
  5. WikiPage::doDeleteArticleBatched determines if all revisions have been archived

5a) if yes, WikiPage::doDeleteArticleBatched finishes the page deletion, invokes hook ArticleDeleteComplete, and we are done
5b) if no, WikiPage::doDeleteArticleBatched queues one (and only one) job

  1. job executes and calls WikiPage::doDeleteArticleBatched. Goto #4.

Using one (and only one) job per deletion operation, rather than queuing multiple jobs at the beginning for archival of all revisions , fluidly handles changes that occur while the deletion operation is being performed. If the page is edited and additional revisions are added during the deletion, they will simply be processed by the final job. If an additional deletion operation is started while the first is still in progress, then multiple jobs will be in progress at once. However, they will interleave archive revisions from oldest to newest, each picking up where the other left off, which causes no harm. At the end, only one job will be able to perform the actual page delete. The other(s) will recognize that there is nothing further to do and silently no-op.

I have enough implemented locally to see that the above appears to work okay in a controlled test. Moving a page while it is being deleted still gets a little wonky. But before I sort that out I thought I'd post the idea to see if anyone wanted to tell me I was crazy. :-)

One consequence: each new job is queued by the previous job, and must then wait to be executed by the queue. If the job queue is large, this may take awhile compared to queuing many/all archival jobs at the beginning. Spreading things out this way may be desirable (by preventing archival jobs for pages with many revisions from dominating the queue) or undesirable (by making deletions take longer). If anyone has any feedback on that aspect, please let me know that as well. Hybrid approaches are possible if necessary.

Change 456035 had a related patch set uploaded (by BPirkle; owner: BPirkle):
[mediawiki/core@master] Use job queue for deletion of pages with many revisions

https://gerrit.wikimedia.org/r/456035

The above patchset is not ready to be merged (and I will review it accordingly in Gerrit). At a minimum, it does not yet communicate back to the user that page deletion is not immediate for cases where the job queue is invoked. There may be other shortcomings, as I have not yet tested to my own satisfaction.

I am posting the patchset in its current state so that anyone interested can see where I'm currently going with this and provide feedback.

Patchset updated, hopefully this is much closer.

Page moves while a deletion operation is in progress are now happy, with no change in the user experience from what people saw before.

If a page deletion does not involve the job queue, the user experience is unchanged from what is has been. If the job queue is involved, the user sees a message with the following (English) text (along with a link to return to the Main Page):

The page "DeleteMe" is scheduled for deletion. Please be patient.

I welcome any wording improvement suggestions.

Question: before these changes, if the delete successfully completed, then the ArticleDeleteAfterSuccess hook was called from Article::doDelete. After these changes, it is still called there if the job queue is not involved. But if the job queue is involved, I'm never calling it. Should I call it...

  1. ...immediately, during the initial web request, before the page is actually fully deleted?
  2. ...immediately, and add a parameter indicating whether the deletion was immediate or will be delayed?
  3. ...after the page is fully deleted, (which could be quite some time)?
  4. ...never?

The other similar hook is ArticleDeleteComplete. It was called from WikiPage::doDeleteArticleReal when the page was successfully deleted. It is now called from WikiPage::doDeleteArticleBatched when the page is fully deleted (either in the initial web request or by the final job). So code wanting to know about deletions from a data perspective can use that, and be informed when a page deletion is complete, no matter how the deletion was performed.

I see that ArticleDeleteAfterSuccess invokes \Wikibase\ClientHooks::onArticleDeleteAfterSuccess - maybe I should refactor that to be called via ArticleDeleteComplete instead, so that it runs after page deletion is complete?

I like the idea of the "this page is scheduled for deletion" message. Given that English is not my mothertongue I'll let others decide which could be the best wording. Once the message appears one should feel free to go do other stuff, close the page and let the JobQueue do their magic.

Probably editting and all other operations on the page during the deletion should be locked to avoid the page from being editted while the deletion is in progress, or protected or another deletion being scheduled at the same time.

As for the JobQueue questions, they're outside my knowledge. I'll let @aaron answer them.

In any case I'd like to thank you for this work.

Thank you for the feedback, @MarcoAurelio

From a technical/data perspective, editing or scheduling another deletion while the deletion is in progress is not a problem.

  • Edits: any new revisions will be archived before the final page deletion is complete, so nothing is lost
  • Deletes: jobs from additional scheduled deletions will happily interleave with jobs from the first scheduled deletion. At the end, one job will perform the final page deletion, while any others silently realize there is nothing to do anymore and stop.

But from a user perspective, it may not be nice to let people make edits on pages that are scheduled for deletion. I wonder how often this will happen. Will it be rare that a deletion is over the threshold and uses the job queue, or will it be frequent?

Right now the threshold to use the job queue is set to 1000 revisions (in DefaultSettings.php) but that is easily adjustable if we feel a different number is more appropriate.

Note that if edits will be disabled during a short period of time there will be stick vandalism problem.

Answering to patch 5 Tim's comments, but didn't want to affect the review process: https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/456035/ It would be nice to have a cli mediawiki maintenance that does what you suggest, just calling the function in a loop without creating lag- while users should be able to delete after the patch arbitrarily large pages last time I checked there was only a "Delete all wiki pages" script.

Change 461210 had a related patch set uploaded (by BPirkle; owner: BPirkle):
[mediawiki/extensions/BlueSpiceFoundation@master] Compatibility with T198176 (use job queue for deletion of pages with many revisions)

https://gerrit.wikimedia.org/r/461210

Change 461211 had a related patch set uploaded (by BPirkle; owner: BPirkle):
[mediawiki/extensions/Flow@master] Compatibility with T198176 (use job queue for deletion of pages with many revisions)

https://gerrit.wikimedia.org/r/461211

Change 461212 had a related patch set uploaded (by BPirkle; owner: BPirkle):
[mediawiki/extensions/GlobalCssJs@master] Compatibility with T198176 (use job queue for deletion of pages with many revisions)

https://gerrit.wikimedia.org/r/461212

Change 461214 had a related patch set uploaded (by BPirkle; owner: BPirkle):
[mediawiki/extensions/SplitPrivateWiki@master] Compatibility with T198176 (use job queue for deletion of pages with many revisions)

https://gerrit.wikimedia.org/r/461214

Change 461216 had a related patch set uploaded (by BPirkle; owner: BPirkle):
[mediawiki/extensions/Translate@master] Compatibility with T198176 (use job queue for deletion of pages with many revisions)

https://gerrit.wikimedia.org/r/461216

Change 461219 had a related patch set uploaded (by BPirkle; owner: BPirkle):
[mediawiki/extensions/UserMerge@master] Compatibility with T198176 (use job queue for deletion of pages with many revisions)

https://gerrit.wikimedia.org/r/461219

@jcrespo , would the deleteBatch.php maintenance script meet your needs? I've been using it in my own testing to delete single pages or specified groups of pages. My last patchset modified it to use "immediate mode", which performs the deletion without use of the job queue.

If deleteBatch.php will be able to use to delete large pages in revision batches when the patch is merged, yes, it will fulfill fully my needs. I couldn't use it before because it batches pages, but not revisions within a page.

Change 456035 merged by jenkins-bot:
[mediawiki/core@master] Use job queue for deletion of pages with many revisions

https://gerrit.wikimedia.org/r/456035

Change 461216 merged by jenkins-bot:
[mediawiki/extensions/Translate@master] Compatibility with T198176 (use job queue for deletion of pages with many revisions)

https://gerrit.wikimedia.org/r/461216

Change 461212 merged by jenkins-bot:
[mediawiki/extensions/GlobalCssJs@master] Compatibility with T198176 (use job queue for deletion of pages with many revisions)

https://gerrit.wikimedia.org/r/461212

Change 461211 merged by jenkins-bot:
[mediawiki/extensions/Flow@master] Compatibility with T198176 (use job queue for deletion of pages with many revisions)

https://gerrit.wikimedia.org/r/461211

Change 461214 merged by jenkins-bot:
[mediawiki/extensions/SplitPrivateWiki@master] Compatibility with T198176 (use job queue for deletion of pages with many revisions)

https://gerrit.wikimedia.org/r/461214

Isn't this ready to be resolved now?

Krinkle added a subscriber: Krinkle.

I think it'd be good to see in production first before closing the task, given it's a fairly major change. But I'll move it on the prod-error dashboard for now, so that it doesn't look like it's (currently) waiting for new work to be done.

Change 461219 merged by jenkins-bot:
[mediawiki/extensions/UserMerge@master] Compatibility with T198176 (use job queue for deletion of pages with many revisions)

https://gerrit.wikimedia.org/r/461219

Change 461210 merged by jenkins-bot:
[mediawiki/extensions/BlueSpiceFoundation@master] Compatibility with T198176 (use job queue for deletion of pages with many revisions)

https://gerrit.wikimedia.org/r/461210

Izno added a comment.Oct 20 2018, 8:38 PM

This works in production though apparently it was unexpected.

Oshwah added a subscriber: Oshwah.Oct 22 2018, 10:36 PM
TheDJ added a subscriber: TheDJ.Oct 23 2018, 8:16 PM

Ppl are reporting that even deletions with just 4 revisions can take up to 15 secs now. That doesnt seem normal and/or desirable to me. Is this expected ?

@TheDJ , no, it is not expected. The performance issue is being investigated under T207530.

I could successfully delete the page listed at T198156 after the patches by @BPirkle and @tstarling - The process was not very intuitive however, as I received a "success" message when deleting, but the page still sat there publicly viewable until the job queue finished to archive the page revisions. I somewhat expected this from happening and didn't retry the deletion, but maybe others won't expect this and would retry deletion over and over, or file bugs because they got a deletion successful-type message but the page was not deleted. As such, a warning such as those we see when we use the MediaWiki-extensions-Translate might help people know that the deletion of large pages is not going to be instant, asking them to be patient until the gnomes in the background JobQueue finishes its job. Just a suggestion. Otherwise I would like to thank the two people I mentioned above for expeditiously implementing this batched-deletion feature on MW core and extensions. Thank you.

MarcoAurelio- great suggestion. I would also add to check the effect if a page is retried to be deleted several times (a common occurence in the past due to the way deletion requests are handled)- I know there was work on preventing issues- but it would be nice to recheck on production.

revi added a subscriber: revi.Oct 27 2018, 6:19 AM

@MarcoAurelio and @jcrespo : retrying deletion of a page several times *should* be fine, and was tested locally (as was editing a page that was in the process of being deleted). With that said, rechecking on production is prudent. Please let me know if you see any issues.

If the current messaging is insufficient, we can certainly consider additional messaging.

One additional possibility: the number of pages per batch is controlled by $wgDeleteRevisionsBatchSize, which defaults to 1000. Pages with fewer than than many revisions will be deleted immediately, without involving the job queue. The value 1000 was chosen simply because that what's we were using for other batching operations. I don't know the statistics on how many revisions deleted pages tend to have, but if changing that number has a significant impact on how often batching occurs (vs immediate deletion), and therefore how often people potentially get confused by deletion over time, then that value could be (carefully) adjusted.

If we do a messaging change and/or a batch size adjustment, I suggest we do it under another task focused on that purpose.

I'm going to mark the current task as resolved, but please feel free to open an additional task or tasks related to various improvement possibilities.

BPirkle closed this task as Resolved.Nov 6 2018, 4:05 PM
mmodell added a subscriber: mmodell.

This is a really great change and I just wanted to give some props for a job well done! Thanks @BPirkle

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:09 PM