1.34.0-wmf.3 generating lots of temporary tables on MySQL slaves
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Marostegui
	May 21 2019, 10:56 AM

Description

Note: I am not sure who to tag here, so please feel free to change the project tags accordingly.
Also, please change the "assigned to" to whoever you feel it should, just assigning it to @thcipriani for now as per SAL.

While investigating T223952: Increased instability in MediaWiki backends (according to load balancers) (even though it looks unrelated, still merits a task) I found that looks like that the 2nd of May something was released that has triggered lots of temporary tables being generated, mainly on s1 (enwiki) and s8 (wikidata in a smaller scale).
Which has never stopped ever since:
This is an s1 slave:
db1089
https://grafana.wikimedia.org/d/000000273/mysql?panelId=7&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1089&var-port=9104&from=1556791578243&to=1556877978243

This is another s1 slave
db1080
https://grafana.wikimedia.org/d/000000273/mysql?panelId=7&fullscreen&orgId=1&from=1556791578243&to=1556877978243&var-dc=eqiad%20prometheus%2Fops&var-server=db1080&var-port=9104

This is a s8 slave
db1109
https://grafana.wikimedia.org/d/000000273/mysql?panelId=7&fullscreen&orgId=1&from=1556791578243&to=1556877978243&var-dc=eqiad%20prometheus%2Fops&var-server=db1109&var-port=9104

The timing matches:

19:27 thcipriani@deploy1001: rebuilt and synchronized wikiversions files: all wikis to 1.34.0-wmf.3

sys schema isn't providing great insights of what queries could be generating that. I might take a look at performance_schema but meanwhile, if someone can check what was released on that version that might have potentially introduced this, that'd be great.

Details

	Subject	Repo	Branch	Lines +/-
	rdbms: fix ChronologyProtector client IDs to not be empty	mediawiki/core	master	+127 -34
	rdbms: fix ChronologyProtector client IDs to not be empty	mediawiki/core	wmf/1.34.0-wmf.6	+127 -34

Customize query in gerrit

Related Objects

Mentioned In: T223310: Investigate increase in tx bandwidth usage for mc1033
Mentioned Here: T223310: Investigate increase in tx bandwidth usage for mc1033
rMW66556bf37e88: rdbms: fix ChronologyProtector client IDs to not be empty
T223952: Increased instability in MediaWiki backends (according to load balancers)

Event Timeline

• Marostegui created this task.May 21 2019, 10:56 AM

Restricted Application edited projects, added Release-Engineering-Team (Kanban); removed Release-Engineering-Team (Backlog). · View Herald TranscriptMay 21 2019, 10:56 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

• Marostegui triaged this task as Medium priority.May 21 2019, 10:58 AM

• Marostegui added a subscriber: Ladsgroup.

• Marostegui added a project: MW-1.34-release.May 21 2019, 11:40 AM

This is list of changes that went live with wmf.3: https://www.mediawiki.org/wiki/MediaWiki_1.34/wmf.3/Changelog It's more than 500 git patches :( Finding what's at fault here seems a little bit complicated.

By doing some checks with the slow query log and set global log_slow_filter='tmp_table_on_disk' filter enabled for a few seconds it is definitely:

SHOW /* Wikimedia\Rdbms\DatabaseMysqlBase::getReplicaPos 10.64.49.16 */ SESSION VARIABLES LIKE 'gtid_%'\G

Maybe @aaron knows if there were some changes released that day?

Krinkle edited projects, added Performance-Team; removed MW-1.34-notes.May 22 2019, 2:27 AM

ArielGlenn subscribed.May 22 2019, 5:11 AM

hrm, well I can say nothing changed with that specific function:

$ git log -p origin/wmf/1.34.0-wmf.1..wmf/1.34.0-wmf.3 -- includes/libs/rdbms/database/DatabaseMysqlBase.php

There were 3 folks who made changes anywhere in includes/libs/rdbms for the 1.34.0-wmf.3 branch:

$ git log --format=short --no-merges  origin/wmf/1.34.0-wmf.1..wmf/1.34.0-wmf.3 -- includes/libs/rdbms/
commit 0c91901454502b3ae2b7228b398aacd34f1a3d12
Author: Fomafix <fomafix@googlemail.com>

    rdbms,linker: Simplify implode() with empty array

commit c53cf78f792a5d7b1b16b952e38d556cd5316bc2
Author: Stanislav Malyshev <smalyshev@gmail.com>

    rdbms: add getChronologyProtectorClientId() to ILBFactory

commit 8107fddd8fe3d5e6e0411f3df0a62f55e8b0c999
Author: Aaron Schulz <aschulz@wikimedia.org>

    rdbms: add "secret" parameter to ChronologyProtector to use HMAC client IDs

commit e4ffa0c780586c00137d479144d9bab760311030
Author: Aaron Schulz <aschulz@wikimedia.org>

    rdbms: add missing return value to LoadBalancer::waitForReplication()

Adding those folks to this ticket in the hope that they may have some deeper insight: @Fomafix @Smalyshev and @aaron (who's already on here).

The other paths I see to that particular function are in includes/objectcache (which had no changes in this branch) and in includes/jobqueue

$ git log --format=short --no-merges  origin/wmf/1.34.0-wmf.1..wmf/1.34.0-wmf.3 -- includes/jobqueue/
commit aa4f4d6b53ae3551678a5ba7e1a021492f090287
Author: Aaron Schulz <aschulz@wikimedia.org>

    jobqueue: make JobQueueDB stricter about broken job_params fields

commit 4dce6445966a1fdeb1e635eca534af91188b74bf
Author: Timo Tijhof <krinklemail@gmail.com>

    jobqueue: Follow-up for fc5d51f12936ed (added GenericParameterJob)

I'm unassigning myself from this task as I am not familiar with this code in any detail, but hopefully that narrows the changelog a bit.

In T223978#5200381, @Marostegui wrote:
By doing some checks with the slow query log and set global log_slow_filter='tmp_table_on_disk' filter enabled for a few seconds it is definitely:
SHOW /* Wikimedia\Rdbms\DatabaseMysqlBase::getReplicaPos 10.64.49.16 */ SESSION VARIABLES LIKE 'gtid_%'\G

I'm not sure I follow. Are you saying that running this select query results in MariaDB creating a temporary table?

Krinkle raised the priority of this task from Medium to High.May 22 2019, 3:52 PM

Krinkle moved this task from Untriaged to Rdbms library on the MediaWiki-libs-Rdbms board.

In T223978#5205182, @Krinkle wrote:
In T223978#5200381, @Marostegui wrote:
By doing some checks with the slow query log and set global log_slow_filter='tmp_table_on_disk' filter enabled for a few seconds it is definitely:
SHOW /* Wikimedia\Rdbms\DatabaseMysqlBase::getReplicaPos 10.64.49.16 */ SESSION VARIABLES LIKE 'gtid_%'\G
I'm not sure I follow. Are you saying that running this select query results in MariaDB creating a temporary table?

That's what I am seeing so far (I haven't had time to keep digging into this).
So, when playing around with log_slow_filter and filtering just to log all queries that generate tmp_table_on_disk that is the query that gets logged.

root@db1089.eqiad.wmnet[enwiki]> set global slow_query_log=ON; select sleep(5); set global slow_query_log=OFF;
Query OK, 0 rows affected (0.00 sec)

+----------+
| sleep(5) |
+----------+
|        0 |
+----------+
1 row in set (5.00 sec)

Query OK, 0 rows affected (0.00 sec)

root@db1089.eqiad.wmnet[enwiki]>

So in 5 seconds there are 3500 logged queries that are supposedly generating tmp tables

root@db1089:/srv/sqldata# cat db1089-slow.log | grep Query_time | wc -l
3512

However, a quick run of pt-query-digest over the log, doesn't report them as generating tmp_tables.
Not sure if the SHOW generates them by default or gets the show global status counters increased for each run. Maybe the LIKE '%'?

Change 511951 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] rdbms: fix ChronologyProtector client IDs to not be empty

https://gerrit.wikimedia.org/r/511951

gerritbot added a project: Patch-For-Review.May 22 2019, 7:04 PM

Change 511951 merged by jenkins-bot:
[mediawiki/core@master] rdbms: fix ChronologyProtector client IDs to not be empty

https://gerrit.wikimedia.org/r/511951

ReleaseTaggerBot added a project: MW-1.34-notes (1.34.0-wmf.7; 2019-05-28).May 25 2019, 10:00 PM

Change 512514 had a related patch set uploaded (by Krinkle; owner: Aaron Schulz):
[mediawiki/core@wmf/1.34.0-wmf.6] rdbms: fix ChronologyProtector client IDs to not be empty

https://gerrit.wikimedia.org/r/512514

Change 512514 merged by jenkins-bot:
[mediawiki/core@wmf/1.34.0-wmf.6] rdbms: fix ChronologyProtector client IDs to not be empty

https://gerrit.wikimedia.org/r/512514

Mentioned in SAL (#wikimedia-operations) [2019-05-27T18:41:16Z] <krinkle@deploy1001> Synchronized php-1.34.0-wmf.6/includes/libs/rdbms: rMW66556bf37e88 / T223310, T223978 (duration: 00m 50s)

This comment has been deleted.

Recovery:

Screenshot 2019-05-27 at 19.48.30.png (1×2 px, 305 KB)

Screenshot 2019-05-27 at 19.48.44.png (1×2 px, 292 KB)

(Untagging from release, given the problematic commit didn't make it into a release branch.)

ReleaseTaggerBot edited projects, added MW-1.34-notes (1.34.0-wmf.6; 2019-05-21); removed MW-1.34-notes (1.34.0-wmf.7; 2019-05-28).May 27 2019, 7:00 PM

Thank you!

Aklapper removed a subscriber: Anomie.Oct 16 2020, 5:40 PM

	F29260040: Screenshot 2019-05-27 at 19.48.44.png
	May 27 2019, 6:49 PM

	F29260039: Screenshot 2019-05-27 at 19.48.30.png
	May 27 2019, 6:49 PM

1.34.0-wmf.3 generating lots of temporary tables on MySQL slavesClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

1.34.0-wmf.3 generating lots of temporary tables on MySQL slaves
Closed, ResolvedPublic
Actions