Page MenuHomePhabricator

Server-side deletion of User:LorenzoMilano/sandbox
Closed, ResolvedPublic

Description

Please delete en:User:LorenzoMilano/sandbox per admin request. There are too many revisions for the stewards to delete per @Ajraddatz.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 26 2018, 2:56 AM
JJMC89 renamed this task from Server-site deletion of User:LorenzoMilano/sandbox to Server-side deletion of User:LorenzoMilano/sandbox.Jun 26 2018, 3:19 AM
Dcljr updated the task description. (Show Details)Jun 26 2018, 3:31 AM

Does this means there's a limit to what stewards can delete too? Seems it was never documented.

revi added a subscriber: revi.Jun 26 2018, 7:58 AM

Does this means there's a limit to what stewards can delete too? Seems it was never documented.

Technically we can delete it, but I think we are more likely to hit execution limit (or whatever it is called) of 3 seconds.

revi added a project: DBA.Jun 26 2018, 8:01 AM

Hi DBA, adding DBA per our last interaction (around April). I think ajr didn’t bother to attempt deletion because it has +40000 revisions; do you want me to try deletion first?

I guess you'd need to delete it in small batches

revi added a comment.Jun 26 2018, 8:06 AM

One page with 40000 revisions - I've never heard of the way to delete one page in small batches. Is it using sort of API?

Samtar added a subscriber: Samtar.Jun 26 2018, 8:11 AM

One page with 40000 revisions - I've never heard of the way to delete one page in small batches. Is it using sort of API?

I don't know how that can be done. What I meant is that I guess the only possible way of doing it without hitting a timeout would be doing it in batches. Unfortunately, I do not know the details on the "how" :(

revi added a subscriber: jcrespo.Jun 26 2018, 8:27 AM

Hmm, OK, then I think it's fair to attempt to delete it via ?action=delete, because I don't know, and you don't know it either. I guess it will fail, but I think it's worth trying (just in case it succeed). @jcrespo told me by April that process should fail instead of 2010-ish site crash.

Just try, we will see what happens.

revi added a comment.Jun 26 2018, 8:28 AM

Deleting...

revi added a comment.Jun 26 2018, 8:29 AM
To avoid creating high replication lag, this transaction was aborted because the write duration (3.9338247776031) exceeded the 3 second limit. If you are changing many items at once, try doing multiple smaller operations instead.

[WzH5NwpAMFgAAHkA@kEAAABT] 2018-06-26 08:29:08: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError"

As predicted :-) but it is doing what it is supposed to do, instead of crashing it for all users. Batching is the key here, but it may need code and functionality I am not sure it exists.

revi added a comment.Jun 26 2018, 8:40 AM

AFAIK such tool (batching for deletion or sorta) does not exist, and everyone prefers shiny new tools over the (unseen) maintenance works, so I dunno if such tool will ever exist.

Nope. API requests also raising red flags here.

You may wish to use deleteBatch.php to do this, although just for a single page.

@MarcoAurelio that deletes several pages in batches, what we need is a way to delete revisions for a single page in batches. I am not sure there is really existing code allowing that -and it is not simple because we cannot leave a page up with partial revisions or metadata still pointing to it, but we also cannot use a single transaction or it will create lag when sent over the network.

@jcrespo Unless operations knows a way to do that, deleting the page at once is the only option for now even if it creates lag. I've created T198176 to request that batch-deletion thing.

deleting the page at once is the only option

That is not really an option- as you saw, it fails to work. I don't think we have an option with existing code.

Maybe not via the wiki interface or API, but using deleteBatch for that single page will delete it. It'll create lag though, and I understand that DBAs hate lag, but for now I don't see any other option with the current state of things :-(

Ping @Reedy just in case he knows something else.

deleteBatch will fail too, sorry, it just runs doDeleteArticle, which is the same thing you run on api or browser.

1997kB added a subscriber: 1997kB.Jun 26 2018, 9:40 AM

So probably T198176 would be the only way of solving this at this point.

So @MarcoAurelio, to clarify, this is not a case of "we think things should be done in X way, so we block proceeding", this is a "I think we don't have the functionality to do so, it needs to be implemented" (T198176). I am 100% agree this should be doable and without needing sysadmin overseeing- I think by implementing something similar the the current user rename process. I am also surprised this didn't exist already (or maybe it does but I don't know about it).

Thank you @jcrespo - I understand that this is blocked because apparently nobody can delete that page, not because of not being able to perform the deletion by a particular choice. There's no choice here, apparently.

Maybe if we dive into mediawiki/maintenance or rEWMA we could find something? I don't know. Is this possible to be done via eval.php? If my memory doesn't fail, I think @Krinkle once deleted something for me using that script (I think it was a batch of outdated MediaWiki messages, definitely not pages with large edit histories, but I wonder if that could work for this case?).

Regards.

There's deleteBatch.php. You need to store the page title to a file, then run mwscript deleteBatch.php --wiki=wiki -r "reason". This should work, I think.

There's deleteBatch.php. You need to store the page title to a file, then run mwscript deleteBatch.php --wiki=wiki -r "reason". This should work, I think.

According to T198156#4314572 it'd fail :-(

Did somebody try it? I don't see all the details, but API has a execution limit while a CLI script can run for 10 hours, if it is necessary.

jcrespo added a comment.EditedJun 27 2018, 11:26 AM

@Urbanecm It is my understanding that that is for batch deleting of pages- multiple pages in batches. The problem here is a single page- revisions for each page in several batches. Of course, try (as I could be mistaken), but I checked the code and it doesn't do the batching we want and it will also fail.

CLI script can run for 10 hours

The timeout here is not the api call, but the write transaction limit being 3 seconds. The idea of requesting T198176 is so that such script can happen or even it could still be an api call/browser action, for example, if it was handled by the job queue.

Zoranzoki21 added a subscriber: Zoranzoki21.EditedJun 27 2018, 11:40 AM

Can this be done with deleteOldRevisions.php script and then to page be deleted on the standard way?

I also checked deleteOldRevisions.php and it purges revisions (without batching) but does not actually remove the page- implying it needs more work- I don't think that is ever supposed to be used on wmf production, and more as a generic "delete wiki" myabe for wikitech-static or other uses, as it will leave lots of other inconsistent stuff there. I think modifying the DeleteArticle method to do it in small chunks is the best bet (and fastest). Again, I am not an expert on this, so don't trust much what I am saying.

Again, deleting the rows of the revision table is something that I can easily do, it is the other dependencies what I don't trust doing manually (because I could break other things). I hope someone has a good script to use- I don't think this is the first time this need has appeared, hasn't it?

The deleteOldRevisions.php script technically does something similar to page deletion (in that it removes stuff from the revision table), but conceptually for very different reasons. It's mainly intended to be used by third-party wikis that want to reduce database size by deleting all non-current revisions, trimming the history to only each page's latest revision.

As for deleteBatch.php, @MarcoAurelio is right. This script is about batching in the sense of deleting multiple pages, not about deleting an individual page's history in chunks. All ways to delete pages in MediaWiki use the same methods, and are subject to transaction limits. It would be undesirable to bypass those limits, given they exist for the purposes of stable database replication, and low lag.

We could look for similar requests in the past, but afaik there is no way for this right now. It would have to be created and engineered as a new maintenance script.

@JJMC89 @Ajraddatz Regarding the user request, I think renaming the page to an archive sub page is perhaps the better option until another solution is available.

In previous cases, the page in question has been blanked instead of deleted. That could probably be done as an interim solution here as well.

@Krinkle Do you think some eval.php magic could do it or even if we do that via that script transaction limits would still apply? Regards.

@Krinkle Do you think some eval.php magic could do it or even if we do that via that script transaction limits would still apply? Regards.

The method some people use via eval.php would use the same methods as action=delete or deleteBatch.php would, and is also subject to transaction limits. While one could manually make changes in the database, these limits exist for a reason.

The only way to solve this properly is through T198176, which would involve new code for splitting the list of a page's revisions in smaller groups, and deleting each group separately. However, I don't think that work can be prioritised. There are a number of problems that need to be solved and thought about to do that. For example, whether this can be done atomically, and what to do if something fails half-way through.

This deletion-size problem stems from the bigger problem which is that MediaWiki implements its archiving mechanism by moving rows from one table ("revision") to another ("archive") which is very inefficient and (as we now know) does not scale well for very large pages. There is no solution to this, which likely means that T198176 is impossible to solve in a good way.

Instead, I think it would make sense to invest effort to make "revision delete" better. For example, we could introduce a new level (level 4?) or revision deletion that behaves exactly like an archived revision. At which point the "Page delete" and "Restore" interfaces could stay the same as today, but internally do a revdel action, possibly without limits (for stewards)?

Izno added a subscriber: Izno.Jun 28 2018, 12:55 PM
Vvjjkkii renamed this task from Server-side deletion of User:LorenzoMilano/sandbox to zaaaaaaaaa.Jul 1 2018, 1:02 AM
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed subscribers: MarcoAurelio, Aklapper.
Restricted Application added a subscriber: Dereckson. · View Herald TranscriptJul 1 2018, 1:02 AM
Wong128hk renamed this task from zaaaaaaaaa to Server-side deletion of User:LorenzoMilano/sandbox.Jul 1 2018, 1:37 AM
Wong128hk raised the priority of this task from High to Needs Triage.
Wong128hk updated the task description. (Show Details)
Wong128hk edited subscribers, added: MarcoAurelio, Aklapper; removed: Dereckson.
Daimona added a subscriber: Daimona.Jul 1 2018, 4:55 PM

I came across this task while reverting today's vandal. If I understood correctly the last lines of @Krinkle's proposal, then I support it. It would be to add a new option on Special:RevisionDelete to allow a "stronger" deletion, like one would get by deleting the whole page and not restoring some versions, right? Such versions would be completely deleted and won't appear in page history. If so, I've also been thinking of a feature like that in the past.

aaron added a subscriber: aaron.Jul 4 2018, 4:30 PM

@Krinkle Do you think some eval.php magic could do it or even if we do that via that script transaction limits would still apply? Regards.

The method some people use via eval.php would use the same methods as action=delete or deleteBatch.php would, and is also subject to transaction limits. While one could manually make changes in the database, these limits exist for a reason.

There are two settings for the transaction timeout, one for jobs and one for web requests (index/api). I never implemented one for CLI scripts though.

@aaron So, in theory, would it be possible to run deleteBatch.php or whichever other script that deletes a page, just for that page and there won't be any timeout or am I not understanding corrently all of the above? Thank you.

There is a programmed read only time for enwiki on the 18th (T197134). It would be nice to run such a script at that same time to not set enwiki in read only time again at a different time.

It'd be good if we could take that oportunity to do so, yeah.

@aaron So, in theory, would it be possible to run deleteBatch.php or whichever other script that deletes a page, just for that page and there won't be any timeout or am I not understanding corrently all of the above? Thank you.

It would seem, from Aaron's comment, that our maintenance scripts are indeed not protected by TransactionProfiler restrictions. However, just because the protection isn't there by default, does not make it a good idea to do. It has always been trivial in one or two lines of PHP code to bypass the TransactionProfiler restriction even in a web request from administrators.

The problem is not the restriction, the problem is what the restriction is there for. Doing a big delete larger than a certain size without any form of batching, causes replication lag, which means potentially hundreds of wikis go into read-only mode for a period of time, which can cause edits and other changes to be rejected during that time. That is very bad, and that is why these restrictions exist.

The only way to do a big delete is:

  1. It is changed to not be a big delete. In other words, a new feature that would be developed by engineers that would internally do the deletion in a batched way. This requires non-trivial work and is unlikely to be prioritised over other work we are doing, given how small its impact is of simply leaving the page where it is (possibly with a blank revision on top, with any sensitive content hidden through the selective RevisionDelete process).
  1. Or; For a system administrator to bypass the transaction restriction (eg. via a maintenance script) during a time where the databases are already in read-only mode. A few times a year, databases are put in read-only mode for maintenance. It is not always practical to complicate these maintenance windows with site requests, but I suppose it is possible to make an exception – if it has approval from a DBA.

enwiki read only is scheduled for 18th at 6 am - T197134 If someone has an already tested script that will take less than 5 minutes to run and can be there to run it and check its execution at 6, we can do it there and then, if not it will have to wait until next time we go to read only (dbas cannot do the planned maintenance and this at the same time, but maybe someone can).

@jcrespo Could you ask in ops@lists... if there's anyone that whishes to do it? Thanks.

@jcrespo Could you ask in ops@lists... if there's anyone that whishes to do it? Thanks.

I think it was already too late to get someone to run it.
I believe we don't only need someone who runs it, but someone who is comfortable running and possibly debugging a script that can do pretty harmful things :-)
It probably requires some dry-run and tests before going for a full run.
There will be more read-only times in the future for sure and we can probably be better prepared to run it :-)

API:Mergehistory says that part of a page's history can be merged to another page. Could the revisions be merged to a new page in batches, deleting and restoring one edit each time, or does that also fail?

MarcoAurelio closed this task as Resolved.Oct 25 2018, 1:28 PM
MarcoAurelio claimed this task.

Done.

Restricted Application added a project: User-MarcoAurelio. · View Herald TranscriptOct 25 2018, 1:28 PM