Page MenuHomePhabricator

Deleting large watchlist takes > 4 seconds causing rollback due to write time limit
Closed, ResolvedPublicPRODUCTION ERROR

Description

Hi, I mentioned this on https://commons.wikimedia.org/wiki/Commons:Village_pump#Watchlist:Database_error

Since a couple of weeks I get an error message like this one when I attempt to check my watchlist:
**A database query error has occurred. This may indicate a bug in the software.[WXhEMgpAMFAAAL@zCgEAAAAO] 2017-07-26 07:27:42: Fatal exception of type "Wikimedia\Rdbms\DBQueryError"

**This occurs 9 times of 10, both on Internet Explorer and Mozilla.

As suggested on Commons Village pump I tried to clear the watchlist with https://commons.wikimedia.org/wiki/Special:EditWatchlist/clear
The result of it was:

Database error To avoid creating high replication lag, this transaction was aborted because the write duration (4.9122259616852) exceeded the 3 second limit. If you are changing many items at once, try doing multiple smaller operations instead. [WXpB5wpAADoAAJE2SnYAAACW] 2017-07-27 19:42:31: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError"

Thanks for checking this.

Event Timeline

Framawiki subscribed.

the write duration exceeded the 3 second limit: @Jwh what size does your whitelist ? How many pages ?

Transaction spent 4.9122259616852 second(s) in writes, exceeding the 3 limit.

More or less same reason as T142329: Can't access large watchlist on nlwiki: 503 error ?

Aklapper renamed this task from Commons Watchlist: Database error to User with 40000 entries in their Watchlist cannot access it on Commons anymore: Database error.Jul 28 2017, 4:42 PM

Workaround: You could also go to https://commons.wikimedia.org/wiki/Special:EditWatchlist/raw and copy the items into a text editor and then paste the updated list.

zhuyifei1999 added a project: DBA.
zhuyifei1999 subscribed.

@zhuyifei1999 added a project: DBA.

@zhuyifei1999: What do you expect DBAs to do here exactly? In my understanding this can be solved by the user and does not need any DBA help.

@zhuyifei1999: What do you expect DBAs to do here exactly? In my understanding this can be solved by the user and does not need any DBA help.

I assume if there is something wrong with the queries (Fatal exception of type ... DBQueryError) / performance issues (... exceeded the 3 second limit), DBA-s are the go-to people? (Quote "Solving performance issues specific to SQL queries"). If there is nothing that can do about it on MediaWiki-side feel free to revert.

Do you happen to have the query handy?
I am guessing that probably the query needs to be split in smaller batches

Do you happen to have the query handy?
I am guessing that probably the query needs to be split in smaller batches

Here are the DBPerformance log entries for OP's request ID:

Expectation (writeQueryTime <= 1) by MediaWiki::main not met (actual: 4.9364352226257):
query-m: DELETE FROM `watchlist` WHERE wl_user = 'X' [TRX#1ab5f2]

Query affected 80704 row(s):
query-m: DELETE FROM `watchlist` WHERE wl_user = 'X' [TRX#1ab5f2]

[WXpB5wpAADoAAJE2SnYAAACW] /wiki/Special:EditWatchlist/clear   Wikimedia\Rdbms\DBTransactionSizeError from line 1177 of /srv/mediawiki/php-1.30.0-wmf.11/includes/libs/rdbms/loadbalancer/LoadBalancer.php: Transaction spent 4.9122259616852 second(s) in writes, exceeding the 3 limit.

Expectation (readQueryTime <= 5) by MediaWiki::main not met (actual: 54.670217990875):
query-m: ROLLBACK [TRX#1ab5f2]

Sub-optimal transaction on DB(s) [10.64.48.23 (commonswiki) (TRX#1ab5f2)]: 
0	4.936435	query-m: DELETE FROM `watchlist` WHERE wl_user = 'X' [TRX#1ab5f2]
1	54.670218	query-m: ROLLBACK [TRX#1ab5f2]

Note that the number of rows is 80k and the user said their watchlist was ~41k pages, but those two things are consistent with each other: watching a page also watches its talk page, and IIRC this isn't shown directly in the UI, so you'd expect the number of rows to be double the number of displayed pages.

As for the rollback, I looked at the stack trace that causes the rollback and it looks suboptimal. After any kind of DB error, we log the error then rollback the transaction, which seems like a sensible thing to do. However, in the case of DBTransactionSizeError, it appears that the error is about a write taking too long but having finished ; is it wise to abort the transaction in that case? Should we abort it out of fear of more slow writes, or should we not abort it because the rollback itself will also be slow? (cc @aaron )

Workaround: You could also go to https://commons.wikimedia.org/wiki/Special:EditWatchlist/raw and copy the items into a text editor and then paste the updated list.

As suggested on [https://commons.wikimedia.org/wiki/Commons:Village_pump#Watchlist:Database_error] I went to the raw list, >Clear the watchlist >update watchlist
I got the error "...transaction aborted..."
Would it help if I'd upload the list in .DOC here?

Although 40 791 pages aren't peanuts, there are certainly users here with much more.

In Preferences >Watchlist I've ticked:

Hide minor edits from the watchlist
Hide bot edits from the watchlist
Hide categorization of pages
Add pages and files I edit to my watchlist
Add pages and files I move to my watchlist
Add pages I create and files I upload to my watchlist
Add new files I upload to my watchlist

Show Wikidata edits in your watchlist was also on the list, but I unticked it this morning. Since then I got no errors with the Watchlist (w/in 15 minutes: ten queries I guess)

There is not much we can do here on persistenace layer- Mediawiki takes 3+ seconds to run the delete, and apparently it is aborted if it takes more than that- that is query logic- there is nothing hardcoding 3+ updates on the db (although it makes total sense to limit that). The way to move forward is for mediawiki to create several transactions for the clear operation with DELETE ... LIMIT 1000, then wait for replica lag- similarly to the way we do maintenance. The overall time will be higher, but each transaction size will be smaller. Probably on a <100K watchlist that can be done fully synchronously- and people will understand that clearing a large watchlist will take a few seconds. If things are worse, that should be moved to an async job- but a simple for loop should be easy to implement?

I could delete the rows manually or pre-warmup them so it goes faster, but I would perfer if we solve this for the general problem rather than just 1 user.

Many of these queries will get better when we abandon the inefficient way of storing page titles, but that is much more long-term.

Based on logs, this is getting worse on commonswiki, but unlike ruwiki, I cannot just pool a larger server (T171027) to minimize the issue.

It was also reported on rowiki for users with as few as 2K followed pages: https://ro.wikipedia.org/wiki/Wikipedia:Cafenea#Pagini_urmarite

Not sure if the same bug, but I have also seen the error on Recent Changes when pushing for damaging edits 30 days back (URL was: https://ro.wikipedia.org/wiki/Special:Schimb%C4%83ri_recente?hidebots=1&hidecategorization=1&hideWikibase=1&limit=500&days=30&urlversion=2&damaging=verylikelybad )

Bawolff subscribed.

Im going to reopen.

I think these are highly related but separate problems (although i see how they could look the same at first glance). In particular i think different solutions will be required for large writes to watchlist and reads timing out. I think this should track the writes > 4 seconds and T171027 should track reads > 60s

7

Bawolff renamed this task from User with 40000 entries in their Watchlist cannot access it on Commons anymore: Database error to Deleting large watchlist takes > 4 seconds causing rollback due to write time limit.Sep 10 2017, 3:21 PM

Leaving this in To Triage for now, as we continue to decide how the Growth Team will do triage.

Krinkle subscribed.

In last 30 days of WMF Logstash data, only found 1 single hit for the query (DBQueryError OR DBTransactionSizeError) AND (SpecialEditWatchlist OR SpecialWatchlist). And that hit was unrelated (EditWatchlist/raw, not EditWatchlist/clear).

This issue seems important and might just be very infrequent or maybe the error message changed, as such not closing yet, but leaving for the team to confirm/decide.

kostajh subscribed.

This appears to happen pretty frequently (see for example here), we will try to get this scheduled for Q2. Jaime’s suggestion earlier in the comments seems like the correct approach.

Looking more closely at the logs I linked to above, I believe the only problem here is with adding more than 1k records to one's Watchlist via Special:EditWatchlist/raw. Example log. And here, it's a warning that we've gone over the maximum number of rows (1000) affected per query. I'll submit a patch for this.

Otherwise, using Special:EditWatchlist/clear enqueues jobs to handle removing items. Manually removing large numbers of items (say, removing 2490 out of 2500 items in one's watchlist at Special:EditWatchlist/raw) results in individual queries (one per title) to remove from watchlist. If someone sees otherwise in the logs please add your observations here.

Change 464029 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[mediawiki/core@master] Watchlist: Commit after each batch of watchlist items is inserted

https://gerrit.wikimedia.org/r/464029

Manually removing large numbers of items (say, removing 2490 out of 2500 items in one's watchlist at Special:EditWatchlist/raw) results in individual queries (one per title) to remove from watchlist. If someone sees otherwise in the logs please add your observations here.

That seems like a bug not a feature: a batched DELETE with IN () would probably be a bit better.

In any case, though, do those DELETE queries all happen in the same transaction? If they do, you'd still get the error from the original report, because it'd take more than 3 seconds to issue thousands or tens of thousands of individual deletes.

That seems like a bug not a feature: a batched DELETE with IN () would probably be a bit better.

Done in latest patchset.

Change 465472 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[mediawiki/extensions/Wikibase@master] Simplify WatchAction invocation

https://gerrit.wikimedia.org/r/465472

Change 465472 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Simplify WatchAction invocation

https://gerrit.wikimedia.org/r/465472

Change 464029 merged by jenkins-bot:
[mediawiki/core@master] Watchlist: Commit after each batch watchlist insertion and removal

https://gerrit.wikimedia.org/r/464029

Etonkovidova subscribed.

The last error was on Oct24 - logstash DBTransactionSizeError.

I've tested editing watchlist up to ~ 10,000 items - no problems were found.

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:10 PM