Page MenuHomePhabricator

Add support for setting individual query timeout in wikimedia/rdbms
Closed, ResolvedPublic

Description

This comes after T195520 and talks with people and investigation.

In #wikimedia-operations 28/05/2018

9:51 AM <addshore> marostegui: another thought was, is there a way to force certain queries to have a shorter timeout that whatever is configured on the db hosts?

9:52 AM <marostegui> Manuel Arostegui addshore: we have a general query killer that kills queries that take more than 60 seconds in production, the problem is that it gets overloaded and cannot kill as fast as possible

9:52 AM addshore: It should be done from the code side

9:52 AM <addshore> ack

Both MySQL and MariaDB currently allow per query timeouts to be passed to the server along with the query.

Since 10.1.1 for mariadb and 5.7.45 for mysql.

In WMF production we appear to be using mariadb 10.1.32 (according to db1109) which would allow us to use this function.

Example use on the command line:

mysql:wikiadmin@db1109 [wikidatawiki]> SET STATEMENT max_statement_time=1 FOR select count(*) from wbc_entity_usage;

ERROR 1969 (70100): Query execution was interrupted (max_statement_time exceeded)

mysql:wikiadmin@db1109 [wikidatawiki]> Ctrl-C -- exit!

Adding this to the mediawiki DB abstraction would allow mediawiki and extensions to define optional per query timeouts.

Relating to T195520 this could be used with selects to the wb_terms table and would probably stop any sort of change in the query execution plan from killing the whole shard and causing various issues on other sites.

This would probably be implemented as a new option for example in $options for the IDB::select method.

Is this something that DBA s would approve of?

Is limiting queries in this way 'allowed'?

I also saw somewhere that either mysql or mariadb has a similar option for max number of scanned rows. This could possibly also be added to the abstraction but would likely not be as valuable as a general timeout.

See also: T198755: Log the query that caused a lock timeout

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

This is a duplicate (sort of) T149421 T64615 T60985 T113552 which in the past, I got a lot of push back.

In WMF production we appear to be using mariadb 10.1.32 (according to db1109) which would allow us to use this function.

The migration to MariaDB 10.1 is not complete, it takes 6 months to a year to prepare and deploy a new major database version. We are aproximately 1/2 to 2/3 of the way. https://dbtree.wikimedia.org/

We need to support both MariaDB and MySQL, here is how that can be done (implementation differs): https://phabricator.wikimedia.org/diffusion/OSMD/browse/master/wmfmariadbpy/WMFMariaDB.py;5d0fa1fcf46f4a0089ff4555fa97ab1739790147$177

I would love to see this being solved from the code side, because, as I said, I think it is the correct place to get it solved.
The query killer is a good workaround but unfortunately, during high load situations (like the wikidata outage) it gets overloaded too and cannot kill as fast as we'd need.

As Jaime said, we've had similar situations already and this has not been addressed, so this kind of scenarios are still brining us down even though we've done as much as we can from the server side.

I haven't looked at the TransactionProfiler class to see how it's implemented, but in MediaWiki core we currently have a default of $wgTrxProfilerLimits['GET']['readQueryTime'] = 5; set.
An implementation in MariaDB/MySQL itself sounds nice. DatabaseMysqlBase already varies on MariaDB vs. MySQL in getSoftwareLink, so we could probably abstract that into a separate method if we also need to handle these two implementations differently.

The migration to MariaDB 10.1 is not complete, it takes 6 months to a year to prepare and deploy a new major database version. We are aproximately 1/2 to 2/3 of the way. https://dbtree.wikimedia.org/

An implementation in MariaDB/MySQL itself sounds nice. DatabaseMysqlBase already varies on MariaDB vs. MySQL in getSoftwareLink, so we could probably abstract that into a separate method if we also need to handle these two implementations differently.

Do we also abstract on some way between mysql / mariadb versions too?

As for implementing something like this in other DBs:

  • I don't see an equivalent for SQLite. The only timeout it seems to have is for waiting for a lock.
  • In PostgreSQL you can do SET SESSION statement_timeout = $milliseconds; (as a separate statement), after which the limit remains in effect until the next set. If a transaction is open you can also do SET LOCAL statement_timeout = $milliseconds; to remain in effect until the transaction ends. There doesn't seem to be any "just for this statement" limiter.
  • I don't see an equivalent in MSSQL, unless the PHP mssql.timeout config setting actually works.
  • I didn't check Oracle.
Vvjjkkii renamed this task from Create Mediawiki DB abstraction for individual query timeouts to n4baaaaaaa.Jul 1 2018, 1:07 AM
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
Marostegui renamed this task from n4baaaaaaa to Create Mediawiki DB abstraction for individual query timeouts.Jul 1 2018, 7:57 PM
Marostegui lowered the priority of this task from High to Medium.
Marostegui updated the task description. (Show Details)
Marostegui added a subscriber: Aklapper.

Poking this ticket and updating the status from T195792#4237259
The update across the fleet to MariaDB 10.1 is now done, and updates are progressing to the next server and OS.

Krinkle renamed this task from Create Mediawiki DB abstraction for individual query timeouts to Add support for setting individual query timeout in wikimedia/rdbms.Nov 10 2021, 4:00 AM

Change 534268 had a related patch set uploaded (by Krinkle; author: Aaron Schulz):

[mediawiki/core@master] rdbms: add query timeout support to Database::select()

https://gerrit.wikimedia.org/r/534268

Ladsgroup added a project: DBA.

I will be doing this as part of T293530: Improve slow read query handling and risk mitigation of outages

Trying to understand what timeout is good. I looked at half a million requests response times (in five different special pages in the three largest wikis). We can't go higher than 60 seconds as after that, query killer triggers and also mediawiki terminates any GET request longer than sixty seconds. Given these data below, I suggest going with 30 seconds and after that we should focus on improving schemas to reduce errors caused by this limit.

Here are the raw data (the left column is percentile and the right one is percentile value in seconds): P17991

Change 534268 merged by jenkins-bot:

[mediawiki/core@master] rdbms: add query timeout support to Database::select()

https://gerrit.wikimedia.org/r/534268

Change 747692 had a related patch set uploaded (by Ladsgroup; author: Aaron Schulz):

[mediawiki/core@wmf/1.38.0-wmf.12] rdbms: add query timeout support to Database::select()

https://gerrit.wikimedia.org/r/747692

Change 747692 merged by jenkins-bot:

[mediawiki/core@wmf/1.38.0-wmf.12] rdbms: add query timeout support to Database::select()

https://gerrit.wikimedia.org/r/747692

Mentioned in SAL (#wikimedia-operations) [2021-12-16T15:03:25Z] <ladsgroup@deploy1002> Synchronized php-1.38.0-wmf.12/includes/libs/rdbms/database/: Backport: [[gerrit:747692|rdbms: add query timeout support to Database::select() (T129093 T195792)]] (duration: 01m 11s)