Page MenuHomePhabricator

Enable innodb_rollback_on_timeout for wmflabs databases
Open, MediumPublic

Description

Hi, I am debugging some odd behavior on Montage, what, in short, should not be possible given the code. It looked like a partially-applied transaction.

Poking around, I seem to have found a culprit. I've noticed from our error logs, Montage gets a fair amount of this error: 'Lock wait timeout exceeded; try restarting transaction'. (Performance has been an issue in the past with wmflabs, but that's a separate story.)

This led me to this great in-depth article. In short, though:

The InnoDB documentation clearly says "InnoDB rolls back only the last statement on a transaction timeout by default". In this case, we do not get the transaction atomicity offered by InnoDB.

And I checked the setting:

MariaDB [s53490__montage]> SHOW GLOBAL VARIABLES LIKE 'innodb_rollback_on_timeout';
+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| innodb_rollback_on_timeout | OFF   |
+----------------------------+-------+

This isn't a setting I can set on a transaction-by-transaction basis. The article I mentioned above talks about how to enable/test the setting, however.

In short, could we please look at getting innodb_rollback_on_timeout enabled for wmflabs?

Event Timeline

Marostegui added a subscriber: Marostegui.

How are you handling those errors in code?

Well, because it's an interactive page, if we can't get a lock within a reasonable amount of time, we don't have much choice but to roll back the transaction and request that the user try again. A partial commit really isn't an option for us.

Hi all, just checking in to see if there are any updates. Let me know if there's something I can help with.

Hey all, circling back again. Any news? Hope you're well!

bd808 moved this task from Backlog to ToolsDB on the Data-Services board.
bd808 added a subscriber: bd808.

@mahmoud "wmflabs databases" is vague. Based on the s53490__montage database name in your paste I guess you are talking about ToolsDB?

Ah, sorry, yes, ToolsDB, provided ToolsDB is what the developer-writable databases provided on toolforge are called. It may affect other databases, but we only write to ToolsDB, so we couldn't ask for any more. :)

@Marostegui, as my best source of real world DBA knowledge, do you have a counter argument for why setting innodb_rollback_on_timeout=ON would be a bad idea for the ToolsDB server?

I'm sure you remember, but for the sake of completeness this is the multi-tenant read-write MariaDB instance (technically active-passive pair) that is hosted inside the Cloud VPS environment. It uses role/templates/mariadb/mysqld_config/tools.my.cnf.erb to provision /etc/my.cnf, so we don't have to worry about configuration changes for it affecting prod db clusters.

Bstorm triaged this task as Medium priority.Jun 2 2020, 4:21 PM

Hey there, been about a month since the last update. Was curious if @Marostegui or anyone else had given some more thought to maybe increasing the ACIDity here? We're still getting pretty regular integrity errors on those Wiki Loves campaigns.

I don't have any strong opinions on this to be honest, I don't have enough data or context on why this given tool needs this, or if the tool is what might be self inducing the errors or not.
As long as you aware that if you enable this and you have big transactions, rolling back them entirely (instead of the last statement of the given transaction) could cause locking/contention issues, I am fine with it.

Hmm, so I guess I could see some users relying on this behavior to get very rudimentary batch processing. I.e., refresh their browser until all the processing is complete. Is it maybe possible to enable it for one app's db at a time? Or does it really have to be all of ToolsDB?

It is a global variable, unfortunately it cannot be enabled on a per-db level.