Page MenuHomePhabricator

High database load from MediaWiki\Extension\OAuth\Backend\Hooks::getUsedConsumerTags
Closed, ResolvedPublic

Description

After upgrading FactGrid from Debian 10 (Buster) to 11 (Bullseye), I am seeing very high database load from this query (according to SHOW PROCESSLIST):

SELECT /* MediaWiki\Extension\OAuth\Backend\Hooks::getUsedConsumerTags  */ DISTINCT ct_tag_id  FROM 

I’ve temporarily worked around it with the following patch (on REL1_39 of mediawiki/extensions/OAuth, specifically based on e0c7ffe159):

diff --git a/src/Backend/Hooks.php b/src/Backend/Hooks.php
index f0f4660c..7a6d0a40 100644
--- a/src/Backend/Hooks.php
+++ b/src/Backend/Hooks.php
@@ -155,6 +155,12 @@ EOK;
 			$allTags[] = Utils::getTagName( $row->oarc_id );
 		}
 
+		// Lucas hack: directly return all tags, skip the below query which is apparently very expensive
+		foreach ( $allTags as $tag ) {
+			$tags[] = $tag;
+		}
+		return true;
+
 		// Step 2: Return only those that are in use.
 		$changeTagDefStore = MediaWikiServices::getInstance()->getChangeTagDefStore();
 		$tagIds = [];

I’ve already tried to optimize the table and it hasn’t helped; I’m still able to reproduce the issue by just git stashing the above patch away.

This is especially weird given that there’s only a single OAuth consumer on this wiki, so if I’m not mistaken, the slow query is literally just:

MariaDB [factgridwikidata]> SELECT DISTINCT ct_tag_id FROM factgridchange_tag WHERE ct_tag_id IN (1);
+-----------+
| ct_tag_id |
+-----------+
|         1 |
+-----------+
1 row in set (2.426 sec)

MariaDB [factgridwikidata]> EXPLAIN SELECT DISTINCT ct_tag_id FROM factgridchange_tag WHERE ct_tag_id IN (1);
+------+-------------+--------------------+------+---------------+--------------+---------+-------+---------+-------------+
| id   | select_type | table              | type | possible_keys | key          | key_len | ref   | rows    | Extra       |
+------+-------------+--------------------+------+---------------+--------------+---------+-------+---------+-------------+
|    1 | SIMPLE      | factgridchange_tag | ref  | ct_tag_id_id  | ct_tag_id_id | 4       | const | 3379541 | Using index |
+------+-------------+--------------------+------+---------------+--------------+---------+-------+---------+-------------+
1 row in set (0.001 sec)

Why is it so slow, even though it should only have to find one row and there’s an index on the ct_tag_id? (2.4 seconds isn’t an eternity, but the query seems to run fairly frequently, so that this is still enough to cause issues for the server – there were even some “unable to connect” errors while the database was overloaded.)

Tentatively tagging mariadb-optimizer-bug, since this seemed to happen since the upgrade from MariaDB 10.3 (specifically 10.3.39-0+deb10u2, if I’m not mistaken) to 10.5 (10.5.23-0+deb11u1) – the MediaWiki code wasn’t changed during this upgrade. (Though I’m also surprised that the result of this DB query, cheap or not, isn’t cached somewhere in OAuth, given how often this hook handler seems to be called.)

(Note: Earlier today I had upgraded FactGrid from Debian 9 / Stretch / MariaDB 10.1 to Debian 10 / Buster / MariaDB 10.3, but as far as I could tell, the query only became slow with the Debian 11 / MariaDB 10.5 upgrade.)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Alternatively, the query seems to also become fast again if I just add a LIMIT, like so:

diff --git a/src/Backend/Hooks.php b/src/Backend/Hooks.php
index f0f4660c..ea93646f 100644
--- a/src/Backend/Hooks.php
+++ b/src/Backend/Hooks.php
@@ -179,7 +179,7 @@ EOK;
 				[ $field ],
 				$conditions,
 				__METHOD__,
-				[ 'DISTINCT' ]
+				[ 'DISTINCT', 'LIMIT' => count( $tagIds ) ] // Lucas hack: LIMIT added for T370578
 			);
 			foreach ( $res as $row ) {
 				$tags[] = $changeTagDefStore->getName( intval( $row->ct_tag_id ) );

(This should be correct in cases where there’s more than one OAuth tag ID, though if not all of them are used I guess it might still be slow because MariaDB won’t hit the limit. But that kinda depends on whether this slowness is even affecting anyone else in the first place, I suppose.)

Change #1055523 had a related patch set uploaded (by Lucas Werkmeister; author: Lucas Werkmeister):

[mediawiki/extensions/OAuth@master] Hooks::getUsedConsumerTags: Add LIMIT

https://gerrit.wikimedia.org/r/1055523

Can you reproduce the optimizer bug? Those usually result in a bad query plan, and this query plan looks fine - it's a single index lookup.

I’m not sure what you mean… I can reliably reproduce the behavior that the query is being slow (ca. 2 seconds, sometimes a bit less) when to me it seems like it shouldn’t be:

MariaDB [factgridwikidata]> SELECT DISTINCT ct_tag_id FROM factgridchange_tag WHERE ct_tag_id IN (1)\G
*************************** 1. row ***************************
ct_tag_id: 1
1 row in set (2.238 sec)

MariaDB [factgridwikidata]> SELECT DISTINCT ct_tag_id FROM factgridchange_tag WHERE ct_tag_id IN (1)\G
*************************** 1. row ***************************
ct_tag_id: 1
1 row in set (2.420 sec)

MariaDB [factgridwikidata]> SELECT DISTINCT ct_tag_id FROM factgridchange_tag WHERE ct_tag_id IN (1)\G
*************************** 1. row ***************************
ct_tag_id: 1
1 row in set (2.254 sec)

MariaDB [factgridwikidata]> SELECT DISTINCT ct_tag_id FROM factgridchange_tag WHERE ct_tag_id IN (1)\G
*************************** 1. row ***************************
ct_tag_id: 1
1 row in set (2.063 sec)

But I don’t know how to peek into the optimizer beyond that. SHOW EXPLAIN looks the same:

MariaDB [(none)]> SHOW EXPLAIN FOR 44979;
+------+-------------+--------------------+------+---------------+--------------+---------+-------+---------+-------------+
| id   | select_type | table              | type | possible_keys | key          | key_len | ref   | rows    | Extra       |
+------+-------------+--------------------+------+---------------+--------------+---------+-------+---------+-------------+
|    1 | SIMPLE      | factgridchange_tag | ref  | ct_tag_id_id  | ct_tag_id_id | 4       | const | 3379541 | Using index |
+------+-------------+--------------------+------+---------------+--------------+---------+-------+---------+-------------+
1 row in set, 1 warning (0.000 sec)

MariaDB [(none)]> SHOW WARNINGS;
+-------+------+--------------------------------------------------------------------------+
| Level | Code | Message                                                                  |
+-------+------+--------------------------------------------------------------------------+
| Note  | 1003 | SELECT DISTINCT ct_tag_id FROM factgridchange_tag WHERE ct_tag_id IN (1) |
+-------+------+--------------------------------------------------------------------------+
1 row in set (0.000 sec)

Hm, in ANALYZE, there’s a huge mismatch between filtered (0%) and r_filtered (100%):

MariaDB [factgridwikidata]> ANALYZE SELECT DISTINCT ct_tag_id FROM factgridchange_tag WHERE ct_tag_id IN (1);
+------+-------------+--------------------+------+---------------+--------------+---------+-------+---------+------------+----------+------------+-------------+
| id   | select_type | table              | type | possible_keys | key          | key_len | ref   | rows    | r_rows     | filtered | r_filtered | Extra       |
+------+-------------+--------------------+------+---------------+--------------+---------+-------+---------+------------+----------+------------+-------------+
|    1 | SIMPLE      | factgridchange_tag | ref  | ct_tag_id_id  | ct_tag_id_id | 4       | const | 3379541 | 7613186.00 |     0.00 |     100.00 | Using index |
+------+-------------+--------------------+------+---------------+--------------+---------+-------+---------+------------+----------+------------+-------------+
1 row in set (2.752 sec)

I wonder if this is because filtering by ct_tag_id=1 is much less selective than MariaDB thinks it will be? (Though it’s expecting to read 3.4 million rows, so it doesn’t sound like it expects it to be hugely selective either.) It’s the most common tag in the table by an order of magnitude:

MariaDB [factgridwikidata]> SELECT ct_tag_id, COUNT(*) FROM factgridchange_tag GROUP BY ct_tag_id;
+-----------+----------+
| ct_tag_id | COUNT(*) |
+-----------+----------+
|         1 |  7613186 |
|         2 |      390 |
|         3 |     5356 |
|         4 |     6375 |
|         5 |      138 |
|         6 |        2 |
|         7 |       47 |
|         8 |      214 |
|         9 |     3662 |
|        10 |    51892 |
|        11 |   108638 |
|        12 |      916 |
|        13 |      109 |
+-----------+----------+
13 rows in set (1.889 sec)

And according to r_rows, MariaDB actually read all those rows from the index, rather than skipping to the next ct_tag_id because of the DISTINCT?

Right, sorry, I misunderstood, I thought the query is reading from change_tag_def. Yeah that looks pretty bad.

I don't think we really need this query. We can just check change_tag_def.ctd_count instead. Just replace all the changeTagDefStore->getId() calls with a call to ChangeTagsStore::tagUsageStatistics() and intersect that with the result from the previous query.

(That will add a delay to new tags showing up on Special:Tags but IMO that's fine.)

I don't think we really need this query. We can just check change_tag_def.ctd_count instead. Just replace all the changeTagDefStore->getId() calls with a call to ChangeTagsStore::tagUsageStatistics() and intersect that with the result from the previous query.

That sounds like a good idea – I’ll see if I can put a patch together later. Thanks!

I’ve also just uploaded an SQL dump of the table in question (I don’t think there’s anything in it that shouldn’t be public), and it looks like I can reproduce the issue in the official Docker images:

terminal A
sudo docker run --rm --env MARIADB_ROOT_PASSWORD=root --name=T370578 mariadb:10.5.23
terminal B
$ sudo docker exec -i T370578 mariadb -uroot -proot <<< 'CREATE DATABASE IF NOT EXISTS T370578; USE T370578;'
$ curl -sL 'https://tmp.lucaswerkmeister.de/factgridchange_tag-T370578.sql.gz' | zcat | sudo docker exec -i T370578 mariadb -uroot -proot
$ time sudo docker exec -i T370578 mariadb -uroot -proot <<< 'SELECT DISTINCT ct_tag_id FROM T370578.factgridchange_tag WHERE ct_tag_id IN (1);'
ct_tag_id
1

real    0m0,840s
user    0m0,007s
sys     0m0,005s
$ time sudo docker exec -i T370578 mariadb -uroot -proot <<< 'SELECT DISTINCT ct_tag_id FROM T370578.factgridchange_tag WHERE ct_tag_id IN (1) LIMIT 1;'
ct_tag_id
1

real    0m0,092s
user    0m0,000s
sys     0m0,011s

(0.8 seconds rather than 2 seconds because my desktop is a bit faster than the server in question. But you can still see the big difference between the versions with and without LIMIT.) I’ll try out later if it’s still reproducible with the latest images.

It looks like the problem is already fixed since MariaDB 10.6.0, but remains reproducible in 10.5.25 (the latest patch release in the 10.5 series so far).

I don't think we really need this query. We can just check change_tag_def.ctd_count instead. Just replace all the changeTagDefStore->getId() calls with a call to ChangeTagsStore::tagUsageStatistics() and intersect that with the result from the previous query.

That sounds like a good idea – I’ll see if I can put a patch together later. Thanks!

Actually, I don’t think we even need the ctd_count. As far as I can tell, change_tag_def already only contains the tags that have been used at least once on the wiki.

lucaswerkmeister@tools-bastion-13:~$ for wiki in aewikimedia btmwiki enwiki wikidatawiki; do sql -N "$wiki" "SELECT COUNT(*) FROM change_tag_def WHERE ctd_name LIKE 'OAuth CID: %'\G" | tail -1; done
0
1
98
166

So effectively, the change_tag_def migration made step 2 of this method obsolete years ago. (And it won’t even lag behind!)

Change #1055545 had a related patch set uploaded (by Lucas Werkmeister; author: Lucas Werkmeister):

[mediawiki/extensions/OAuth@master] Hooks::getUsedConsumerTags: Remove second query

https://gerrit.wikimedia.org/r/1055545

Change #1055545 merged by jenkins-bot:

[mediawiki/extensions/OAuth@master] Hooks::getUsedConsumerTags: Remove second query

https://gerrit.wikimedia.org/r/1055545

Change #1055618 had a related patch set uploaded (by Lucas Werkmeister; author: Lucas Werkmeister):

[mediawiki/extensions/OAuth@REL1_42] Hooks::getUsedConsumerTags: Remove second query

https://gerrit.wikimedia.org/r/1055618

Change #1055619 had a related patch set uploaded (by Lucas Werkmeister; author: Lucas Werkmeister):

[mediawiki/extensions/OAuth@REL1_41] Hooks::getUsedConsumerTags: Remove second query

https://gerrit.wikimedia.org/r/1055619

Change #1055620 had a related patch set uploaded (by Lucas Werkmeister; author: Lucas Werkmeister):

[mediawiki/extensions/OAuth@REL1_39] Hooks::getUsedConsumerTags: Remove second query

https://gerrit.wikimedia.org/r/1055620

Change #1055523 abandoned by Lucas Werkmeister:

[mediawiki/extensions/OAuth@master] Hooks::getUsedConsumerTags: Add LIMIT

Reason:

Obsoleted by the even better patch I186b75a434.

https://gerrit.wikimedia.org/r/1055523

It looks like the problem is already fixed since MariaDB 10.6.0, but remains reproducible in 10.5.25 (the latest patch release in the 10.5 series so far).

Also, let’s untag mariadb-optimizer-bug, I don’t think it’s relevant if it was already fixed and we also have a workaround we can backport.

Change #1055618 merged by jenkins-bot:

[mediawiki/extensions/OAuth@REL1_42] Hooks::getUsedConsumerTags: Remove second query

https://gerrit.wikimedia.org/r/1055618

Change #1055620 merged by jenkins-bot:

[mediawiki/extensions/OAuth@REL1_39] Hooks::getUsedConsumerTags: Remove second query

https://gerrit.wikimedia.org/r/1055620

Change #1055619 merged by jenkins-bot:

[mediawiki/extensions/OAuth@REL1_41] Hooks::getUsedConsumerTags: Remove second query

https://gerrit.wikimedia.org/r/1055619

LucasWerkmeister claimed this task.

The REL1_39 backport seems to be working fine on FactGrid. Thanks @Tgr for the suggestion and review! \o/