Page MenuHomePhabricator

Wikibase\Repo\Store\WikiPageEntityStore::updateWatchlist: Automatic transaction with writes in progress (from DatabaseBase::query (LinkCache::addLinkObj)), performing implicit commit!
Closed, ResolvedPublicPRODUCTION ERROR

Description

This is new in wmf.11: https://logstash.wikimedia.org/goto/17a8c23e6242da9e8b783b7a434e197b (last 2 days, relative, for the error)

Setting to UBN! not because I know it's causing problems for users (I don't), but because it is new logspam that should be dealt with.

Event Timeline

Similar errors also showing up frequently now:

DatabaseBase::deadlockLoop: Automatic transaction with writes in progress (from DatabaseBase::query (LinksUpdate::acquirePageLock)), performing implicit commit!

and

DatabaseBase::deadlockLoop: Automatic transaction with writes in progress (from DatabaseBase::query (Wikibase\Lib\Store\SiteLinkTable::deleteLinksOfItem)), performing implicit commit!

and

DatabaseBase::deadlockLoop: Automatic transaction with writes in progress (from DatabaseBase::query (LinkCache::addLinkObj)), performing implicit commit!

The four of these messages account for 633 (the one in the title), 283 (the first in this comment), 194 (the middle one), and 84 (the last one) log messages in the past 4 hours, which, according to logstash, are the top 3 and #6 log message culprits.

Since 10:30 the number of log errors has increased to 2000/hour, probably related to Wikidata API requests by bots.

This (just the one in the task title) now caused 3,643 fatals in the last hour. Please evaluate ASAP. cc @aude @daniel @Lydia_Pintscher

Lydia_Pintscher moved this task from Proposed to Backlog on the Wikidata-Sprint-2016-07-19 board.
Lydia_Pintscher added a subscriber: hoo.

Adding to current sprint for investigation.

Last hour:

Selection_130.png (210×1 px, 14 KB)

What changed at ~19:26?

Sadly, deadlockLoop doesn't tell us what called it or what it is trying to do. Knowing which transaction was still open is nice, but knowing what tried to open another one would be helpfull.

Oh! This warning is logged when an automatic transaction is flushed because an explicit transaction is started. This used to be a debug log entry, not a production warning. Aaron changed this to a warning in https://gerrit.wikimedia.org/r/#/c/298240/1 which got merged a week ago, on July 15.

So the situation isn't new, the logging is.

I'm not sure why Aaron changed this. As far as I understand, it is in the nature of automatic transactions to be flushed in such a situation. This doesn't usually indicate a problem.

hashar lowered the priority of this task from Unbreak Now! to Medium.Jul 22 2016, 11:22 AM
hashar subscribed.

Lowering priority per @daniel . Namely the message has been around for a while, it is not user facing issue and the priority of the message to recently promoted from debug to warnings.

@greg why do you think this caused fatal errors? I don't see how it could.

It was on the MediaWiki fatals dashboard in logstash :)

Premature commits like this make multi-DB transactions less safe. Only Job/DeferrableUpdate/Maintenance code should be flushing transactions. If something needs its own transaction it should use some sort of deferred update.

@aaron If I interpret what you say correctly, that means that we should not use automatic transactions in web requests at all, at least none that write.

I suggest to have a larger discussion about the appropriate use of transactions, on wikitech-l or as an RFC. In the past, automatic transactions were (ab)used as a kind of write buffer, while use of transactions to achieve atomic updates was rare. It seems like you want to change that. I'm all for saner use of transactions, but it seems we should examine the current usage patterns more closely before deprecating them.

It was on the MediaWiki fatals dashboard in logstash :)

Interesting - why, if these were warnings? Or are DB warnings considered fatal for some reason?

I was inspired by T119736. I'm not sure what the "write buffer" thing is, but premature commits are just broken and make the code less rollback-safe.

Current use is mostly fine. For the last 24hr, <<+channel:wfLogDBError +message:*implicit* -message:*Wikibase* -message:*deadlockLoop*>> just shows a few Banner class notices. So there is little to fix here.

Interesting - why, if these were warnings? Or are DB warnings considered fatal for some reason?

See the dashboard for what it has/doesn't have.

The watchlist one is likely fixed by d484555db6b734ef56edf2d521dbcfb54170c7a6 in core . The extension code looks OK.

The other one is caused by using deadlockLoop(), which I'd suggest removing.

This (although it is not the only one) is causing problems with huge sizes on logs. I would suggest to either disable the logging or sample them or all logs will be broken!

Then you can continue fixing it in the same way before put it back to error level.

Change 301538 had a related patch set uploaded (by Aaron Schulz):
Remove deadlockLoop() from places with few errors

https://gerrit.wikimedia.org/r/301538

Change 301551 had a related patch set uploaded (by Aaron Schulz):
Avoid use of deadlockLoop() in EntityPerPageTable

https://gerrit.wikimedia.org/r/301551

Change 301538 merged by jenkins-bot:
Remove deadlockLoop() from places with few errors

https://gerrit.wikimedia.org/r/301538

Change 301551 merged by jenkins-bot:
Avoid use of deadlockLoop() in EntityPerPageTable

https://gerrit.wikimedia.org/r/301551

All of @aaron's patches have been merged, can we consider this resolved?

I will check the logs.

We will deploy the fixes this week (with wmf13). The improvements should take effect once the branch hits Wikidata (tomorrow).

Indeed, it continues happening on WMF-servers. As we have the tag Wikimedia-production-error (a WMF issue rather than a mediawiki or wikidata-only), I would like to keep this open if you are ok with that to check, for example, it does not happen by other unrelated reasons (both me and Greg can do that).

What I do in such cases is to move it to "Done" on my own board if my part is done, if that is an alternative for cross-team issues? I hope that is ok for you.

I cannot see this anymore since 23 UTC yesterday. If you can confirm its successful deployment, I will mark this as resolved.

jcrespo claimed this task.

I see this and T140968 resolved since Aug 4.

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:11 PM