- Mentioned In
- T200534: Delete "User:Zyksnowy/sandbox", a page with more than 5000 edits on Wikidata
T198176: Mediawiki page deletions should happen in batches of revisions
- Mentioned Here
- T197134: Announce 30 minutes read-only time for enwiki 18th July 06:00AM UTC
T198176: Mediawiki page deletions should happen in batches of revisions
I don't know how that can be done. What I meant is that I guess the only possible way of doing it without hitting a timeout would be doing it in batches. Unfortunately, I do not know the details on the "how" :(
Hmm, OK, then I think it's fair to attempt to delete it via ?action=delete, because I don't know, and you don't know it either. I guess it will fail, but I think it's worth trying (just in case it succeed). @jcrespo told me by April that process should fail instead of 2010-ish site crash.
To avoid creating high replication lag, this transaction was aborted because the write duration (3.9338247776031) exceeded the 3 second limit. If you are changing many items at once, try doing multiple smaller operations instead. [WzH5NwpAMFgAAHkA@kEAAABT] 2018-06-26 08:29:08: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError"
As predicted :-) but it is doing what it is supposed to do, instead of crashing it for all users. Batching is the key here, but it may need code and functionality I am not sure it exists.
@MarcoAurelio that deletes several pages in batches, what we need is a way to delete revisions for a single page in batches. I am not sure there is really existing code allowing that -and it is not simple because we cannot leave a page up with partial revisions or metadata still pointing to it, but we also cannot use a single transaction or it will create lag when sent over the network.
Maybe not via the wiki interface or API, but using deleteBatch for that single page will delete it. It'll create lag though, and I understand that DBAs hate lag, but for now I don't see any other option with the current state of things :-(
Ping @Reedy just in case he knows something else.
So @MarcoAurelio, to clarify, this is not a case of "we think things should be done in X way, so we block proceeding", this is a "I think we don't have the functionality to do so, it needs to be implemented" (T198176). I am 100% agree this should be doable and without needing sysadmin overseeing- I think by implementing something similar the the current user rename process. I am also surprised this didn't exist already (or maybe it does but I don't know about it).
Thank you @jcrespo - I understand that this is blocked because apparently nobody can delete that page, not because of not being able to perform the deletion by a particular choice. There's no choice here, apparently.
Maybe if we dive into mediawiki/maintenance or rEWMA we could find something? I don't know. Is this possible to be done via eval.php? If my memory doesn't fail, I think @Krinkle once deleted something for me using that script (I think it was a batch of outdated MediaWiki messages, definitely not pages with large edit histories, but I wonder if that could work for this case?).
@Urbanecm It is my understanding that that is for batch deleting of pages- multiple pages in batches. The problem here is a single page- revisions for each page in several batches. Of course, try (as I could be mistaken), but I checked the code and it doesn't do the batching we want and it will also fail.
CLI script can run for 10 hours
The timeout here is not the api call, but the write transaction limit being 3 seconds. The idea of requesting T198176 is so that such script can happen or even it could still be an api call/browser action, for example, if it was handled by the job queue.
I also checked deleteOldRevisions.php and it purges revisions (without batching) but does not actually remove the page- implying it needs more work- I don't think that is ever supposed to be used on wmf production, and more as a generic "delete wiki" myabe for wikitech-static or other uses, as it will leave lots of other inconsistent stuff there. I think modifying the DeleteArticle method to do it in small chunks is the best bet (and fastest). Again, I am not an expert on this, so don't trust much what I am saying.
Again, deleting the rows of the revision table is something that I can easily do, it is the other dependencies what I don't trust doing manually (because I could break other things). I hope someone has a good script to use- I don't think this is the first time this need has appeared, hasn't it?
The deleteOldRevisions.php script technically does something similar to page deletion (in that it removes stuff from the revision table), but conceptually for very different reasons. It's mainly intended to be used by third-party wikis that want to reduce database size by deleting all non-current revisions, trimming the history to only each page's latest revision.
As for deleteBatch.php, @MarcoAurelio is right. This script is about batching in the sense of deleting multiple pages, not about deleting an individual page's history in chunks. All ways to delete pages in MediaWiki use the same methods, and are subject to transaction limits. It would be undesirable to bypass those limits, given they exist for the purposes of stable database replication, and low lag.
We could look for similar requests in the past, but afaik there is no way for this right now. It would have to be created and engineered as a new maintenance script.
The method some people use via eval.php would use the same methods as action=delete or deleteBatch.php would, and is also subject to transaction limits. While one could manually make changes in the database, these limits exist for a reason.
The only way to solve this properly is through T198176, which would involve new code for splitting the list of a page's revisions in smaller groups, and deleting each group separately. However, I don't think that work can be prioritised. There are a number of problems that need to be solved and thought about to do that. For example, whether this can be done atomically, and what to do if something fails half-way through.
This deletion-size problem stems from the bigger problem which is that MediaWiki implements its archiving mechanism by moving rows from one table ("revision") to another ("archive") which is very inefficient and (as we now know) does not scale well for very large pages. There is no solution to this, which likely means that T198176 is impossible to solve in a good way.
Instead, I think it would make sense to invest effort to make "revision delete" better. For example, we could introduce a new level (level 4?) or revision deletion that behaves exactly like an archived revision. At which point the "Page delete" and "Restore" interfaces could stay the same as today, but internally do a revdel action, possibly without limits (for stewards)?
I came across this task while reverting today's vandal. If I understood correctly the last lines of @Krinkle's proposal, then I support it. It would be to add a new option on Special:RevisionDelete to allow a "stronger" deletion, like one would get by deleting the whole page and not restoring some versions, right? Such versions would be completely deleted and won't appear in page history. If so, I've also been thinking of a feature like that in the past.
It would seem, from Aaron's comment, that our maintenance scripts are indeed not protected by TransactionProfiler restrictions. However, just because the protection isn't there by default, does not make it a good idea to do. It has always been trivial in one or two lines of PHP code to bypass the TransactionProfiler restriction even in a web request from administrators.
The problem is not the restriction, the problem is what the restriction is there for. Doing a big delete larger than a certain size without any form of batching, causes replication lag, which means potentially hundreds of wikis go into read-only mode for a period of time, which can cause edits and other changes to be rejected during that time. That is very bad, and that is why these restrictions exist.
The only way to do a big delete is:
- It is changed to not be a big delete. In other words, a new feature that would be developed by engineers that would internally do the deletion in a batched way. This requires non-trivial work and is unlikely to be prioritised over other work we are doing, given how small its impact is of simply leaving the page where it is (possibly with a blank revision on top, with any sensitive content hidden through the selective RevisionDelete process).
- Or; For a system administrator to bypass the transaction restriction (eg. via a maintenance script) during a time where the databases are already in read-only mode. A few times a year, databases are put in read-only mode for maintenance. It is not always practical to complicate these maintenance windows with site requests, but I suppose it is possible to make an exception – if it has approval from a DBA.
enwiki read only is scheduled for 18th at 6 am - T197134 If someone has an already tested script that will take less than 5 minutes to run and can be there to run it and check its execution at 6, we can do it there and then, if not it will have to wait until next time we go to read only (dbas cannot do the planned maintenance and this at the same time, but maybe someone can).
I think it was already too late to get someone to run it.
I believe we don't only need someone who runs it, but someone who is comfortable running and possibly debugging a script that can do pretty harmful things :-)
It probably requires some dry-run and tests before going for a full run.
There will be more read-only times in the future for sure and we can probably be better prepared to run it :-)
API:Mergehistory says that part of a page's history can be merged to another page. Could the revisions be merged to a new page in batches, deleting and restoring one edit each time, or does that also fail?