Page MenuHomePhabricator

1.34.0-wmf.3 generating lots of temporary tables on MySQL slaves
Closed, ResolvedPublic

Description

Note: I am not sure who to tag here, so please feel free to change the project tags accordingly.
Also, please change the "assigned to" to whoever you feel it should, just assigning it to @thcipriani for now as per SAL.

While investigating T223952: Increased instability in MediaWiki backends (according to load balancers) (even though it looks unrelated, still merits a task) I found that looks like that the 2nd of May something was released that has triggered lots of temporary tables being generated, mainly on s1 (enwiki) and s8 (wikidata in a smaller scale).
Which has never stopped ever since:
This is an s1 slave:
db1089
https://grafana.wikimedia.org/d/000000273/mysql?panelId=7&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1089&var-port=9104&from=1556791578243&to=1556877978243

s1temp.png (500×1 px, 53 KB)

This is another s1 slave
db1080
https://grafana.wikimedia.org/d/000000273/mysql?panelId=7&fullscreen&orgId=1&from=1556791578243&to=1556877978243&var-dc=eqiad%20prometheus%2Fops&var-server=db1080&var-port=9104

s1temp2.png (500×1 px, 62 KB)

This is a s8 slave
db1109
https://grafana.wikimedia.org/d/000000273/mysql?panelId=7&fullscreen&orgId=1&from=1556791578243&to=1556877978243&var-dc=eqiad%20prometheus%2Fops&var-server=db1109&var-port=9104

s8temp.png (500×1 px, 46 KB)

The timing matches:

19:27 thcipriani@deploy1001: rebuilt and synchronized wikiversions files: all wikis to 1.34.0-wmf.3

sys schema isn't providing great insights of what queries could be generating that. I might take a look at performance_schema but meanwhile, if someone can check what was released on that version that might have potentially introduced this, that'd be great.

Event Timeline

Marostegui added a subscriber: Ladsgroup.

This is list of changes that went live with wmf.3: https://www.mediawiki.org/wiki/MediaWiki_1.34/wmf.3/Changelog It's more than 500 git patches :( Finding what's at fault here seems a little bit complicated.

Marostegui added a subscriber: aaron.

By doing some checks with the slow query log and set global log_slow_filter='tmp_table_on_disk' filter enabled for a few seconds it is definitely:

SHOW /* Wikimedia\Rdbms\DatabaseMysqlBase::getReplicaPos 10.64.49.16 */ SESSION VARIABLES LIKE 'gtid_%'\G

Maybe @aaron knows if there were some changes released that day?

thcipriani added subscribers: Smalyshev, Fomafix.

hrm, well I can say nothing changed with that specific function:

$ git log -p origin/wmf/1.34.0-wmf.1..wmf/1.34.0-wmf.3 -- includes/libs/rdbms/database/DatabaseMysqlBase.php

There were 3 folks who made changes anywhere in includes/libs/rdbms for the 1.34.0-wmf.3 branch:

$ git log --format=short --no-merges  origin/wmf/1.34.0-wmf.1..wmf/1.34.0-wmf.3 -- includes/libs/rdbms/
commit 0c91901454502b3ae2b7228b398aacd34f1a3d12
Author: Fomafix <fomafix@googlemail.com>

    rdbms,linker: Simplify implode() with empty array

commit c53cf78f792a5d7b1b16b952e38d556cd5316bc2
Author: Stanislav Malyshev <smalyshev@gmail.com>

    rdbms: add getChronologyProtectorClientId() to ILBFactory

commit 8107fddd8fe3d5e6e0411f3df0a62f55e8b0c999
Author: Aaron Schulz <aschulz@wikimedia.org>

    rdbms: add "secret" parameter to ChronologyProtector to use HMAC client IDs

commit e4ffa0c780586c00137d479144d9bab760311030
Author: Aaron Schulz <aschulz@wikimedia.org>

    rdbms: add missing return value to LoadBalancer::waitForReplication()

Adding those folks to this ticket in the hope that they may have some deeper insight: @Fomafix @Smalyshev and @aaron (who's already on here).

The other paths I see to that particular function are in includes/objectcache (which had no changes in this branch) and in includes/jobqueue

$ git log --format=short --no-merges  origin/wmf/1.34.0-wmf.1..wmf/1.34.0-wmf.3 -- includes/jobqueue/
commit aa4f4d6b53ae3551678a5ba7e1a021492f090287
Author: Aaron Schulz <aschulz@wikimedia.org>

    jobqueue: make JobQueueDB stricter about broken job_params fields

commit 4dce6445966a1fdeb1e635eca534af91188b74bf
Author: Timo Tijhof <krinklemail@gmail.com>

    jobqueue: Follow-up for fc5d51f12936ed (added GenericParameterJob)

I'm unassigning myself from this task as I am not familiar with this code in any detail, but hopefully that narrows the changelog a bit.

By doing some checks with the slow query log and set global log_slow_filter='tmp_table_on_disk' filter enabled for a few seconds it is definitely:

SHOW /* Wikimedia\Rdbms\DatabaseMysqlBase::getReplicaPos 10.64.49.16 */ SESSION VARIABLES LIKE 'gtid_%'\G

I'm not sure I follow. Are you saying that running this select query results in MariaDB creating a temporary table?

Krinkle raised the priority of this task from Medium to High.May 22 2019, 3:52 PM
Krinkle moved this task from Untriaged to Rdbms library on the MediaWiki-libs-Rdbms board.

By doing some checks with the slow query log and set global log_slow_filter='tmp_table_on_disk' filter enabled for a few seconds it is definitely:

SHOW /* Wikimedia\Rdbms\DatabaseMysqlBase::getReplicaPos 10.64.49.16 */ SESSION VARIABLES LIKE 'gtid_%'\G

I'm not sure I follow. Are you saying that running this select query results in MariaDB creating a temporary table?

That's what I am seeing so far (I haven't had time to keep digging into this).
So, when playing around with log_slow_filter and filtering just to log all queries that generate tmp_table_on_disk that is the query that gets logged.

root@db1089.eqiad.wmnet[enwiki]> set global slow_query_log=ON; select sleep(5); set global slow_query_log=OFF;
Query OK, 0 rows affected (0.00 sec)

+----------+
| sleep(5) |
+----------+
|        0 |
+----------+
1 row in set (5.00 sec)

Query OK, 0 rows affected (0.00 sec)

root@db1089.eqiad.wmnet[enwiki]>

So in 5 seconds there are 3500 logged queries that are supposedly generating tmp tables

root@db1089:/srv/sqldata# cat db1089-slow.log | grep Query_time | wc -l
3512

However, a quick run of pt-query-digest over the log, doesn't report them as generating tmp_tables.
Not sure if the SHOW generates them by default or gets the show global status counters increased for each run. Maybe the LIKE '%'?

Change 511951 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] rdbms: fix ChronologyProtector client IDs to not be empty

https://gerrit.wikimedia.org/r/511951

Change 511951 merged by jenkins-bot:
[mediawiki/core@master] rdbms: fix ChronologyProtector client IDs to not be empty

https://gerrit.wikimedia.org/r/511951

Change 512514 had a related patch set uploaded (by Krinkle; owner: Aaron Schulz):
[mediawiki/core@wmf/1.34.0-wmf.6] rdbms: fix ChronologyProtector client IDs to not be empty

https://gerrit.wikimedia.org/r/512514

Change 512514 merged by jenkins-bot:
[mediawiki/core@wmf/1.34.0-wmf.6] rdbms: fix ChronologyProtector client IDs to not be empty

https://gerrit.wikimedia.org/r/512514

Mentioned in SAL (#wikimedia-operations) [2019-05-27T18:41:16Z] <krinkle@deploy1001> Synchronized php-1.34.0-wmf.6/includes/libs/rdbms: rMW66556bf37e88 / T223310, T223978 (duration: 00m 50s)

Krinkle closed this task as Resolved.EditedMay 27 2019, 6:49 PM
Krinkle assigned this task to aaron.
This comment has been deleted.

Recovery:

Screenshot 2019-05-27 at 19.48.30.png (1×2 px, 305 KB)
Screenshot 2019-05-27 at 19.48.44.png (1×2 px, 292 KB)

(Untagging from release, given the problematic commit didn't make it into a release branch.)