1.31.0-wmf.27 rolled back due to increase in fatals: "Replication wait failed: lost connection to MySQL server during query"
Closed, ResolvedPublic

Description

During the time that 1.31.0-wmf.27 was deployed to group1, implicit temporary tables shot up to ~4x the baseline, as can be seen in this grafana view

The error is coming from line 258 of RefreshLinksJob.php where the code calls commitAndWaitForReplication:

$lbFactory->commitAndWaitForReplication( __METHOD__, $ticket );

stack trace

#0 /srv/mediawiki/php-1.31.0-wmf.27/includes/libs/rdbms/loadbalancer/LoadBalancer.php(639): Wikimedia\Rdbms\DatabaseMysqlBase->masterPosWait(Wikimedia\Rdbms\MySQLMasterPos, double)
#1 /srv/mediawiki/php-1.31.0-wmf.27/includes/libs/rdbms/loadbalancer/LoadBalancer.php(534): Wikimedia\Rdbms\LoadBalancer->doWait(integer, boolean, double)
#2 /srv/mediawiki/php-1.31.0-wmf.27/includes/libs/rdbms/lbfactory/LBFactory.php(367): Wikimedia\Rdbms\LoadBalancer->waitForAll(Wikimedia\Rdbms\MySQLMasterPos, double)
#3 /srv/mediawiki/php-1.31.0-wmf.27/includes/libs/rdbms/lbfactory/LBFactory.php(419): Wikimedia\Rdbms\LBFactory->waitForReplication(array)
#4 /srv/mediawiki/php-1.31.0-wmf.27/includes/jobqueue/jobs/RefreshLinksJob.php(290): Wikimedia\Rdbms\LBFactory->commitAndWaitForReplication(string, integer)
#5 /srv/mediawiki/php-1.31.0-wmf.27/includes/jobqueue/jobs/RefreshLinksJob.php(122): RefreshLinksJob->runForTitle(Title)
#6 /srv/mediawiki/php-1.31.0-wmf.27/extensions/EventBus/includes/JobExecutor.php(59): RefreshLinksJob->run()
#7 /srv/mediawiki/rpc/RunSingleJob.php(79): JobExecutor->execute(array)
#8 {main}

Timeline

19:24twentyafterfour@tin:Synchronized php: group1 wikis to 1.31.0-wmf.26 (duration: 01m 17s)
19:22twentyafterfour@tin:rebuilt and synchronized wikiversions files: group1 wikis to 1.31.0-wmf.26
19:20twentyafterfour:Rolling back to wmf.26 due to increase in fatals: "Replication wait failed: lost connection to MySQL server during query"
19:19twentyafterfour:rolling back to wmf.26
19:18icinga-wmPROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0]
https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen&from=1522263260081&to=1522265839537
19:17twentyafterfour:I'm seeing quite a few "[{exception_id}] {exception_url} Wikimedia\Rdbms\DBExpectedError: Replication wait failed: Lost connection to MySQL server during query
19:12milimetric@tin:Finished deploy [analytics/refinery@c22fd1e]: Fixing python import bug (duration: 02m 48s)
19:09milimetric@tin:Started deploy [analytics/refinery@c22fd1e]: Fixing python import bug
19:09milimetric@tin:Started deploy [analytics/refinery@c22fd1e]: (no justification provided)
19:06twentyafterfour@tin:Synchronized php: group1 wikis to 1.31.0-wmf.27 (duration: 01m 17s)
19:05twentyafterfour@tin:rebuilt and synchronized wikiversions files: group1 wikis to 1.31.0-wmf.27

Incident report

https://wikitech.wikimedia.org/wiki/Incident_documentation/20180229-Train-1.31.0-wmf.27

mmodell created this task.Mar 28 2018, 7:59 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 28 2018, 7:59 PM
mmodell updated the task description. (Show Details)Mar 28 2018, 8:03 PM
mmodell updated the task description. (Show Details)
mmodell triaged this task as Unbreak Now! priority.Mar 28 2018, 8:06 PM
Restricted Application added subscribers: Liuxinyu970226, TerraCodes. · View Herald TranscriptMar 28 2018, 8:06 PM
greg added subscribers: aaron, greg.

@aaron as our DBAs are out, can you help here or not?

Change 422547 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Normalize and lower the default DB lag wait timeout

https://gerrit.wikimedia.org/r/422547

Change 422565 had a related patch set uploaded (by 20after4; owner: Aaron Schulz):
[mediawiki/core@wmf/1.31.0-wmf.27] Normalize and lower the default DB lag wait timeout

https://gerrit.wikimedia.org/r/422565

Change 422547 merged by jenkins-bot:
[mediawiki/core@master] Normalize and lower the default DB lag wait timeout

https://gerrit.wikimedia.org/r/422547

@aaron: Can you explain how your patch addresses the issue? I'm not confident that it'll fix the root cause, even if it does work around the connection issues.

Change 422565 merged by jenkins-bot:
[mediawiki/core@wmf/1.31.0-wmf.27] Normalize and lower the default DB lag wait timeout

https://gerrit.wikimedia.org/r/422565

Mentioned in SAL (#wikimedia-operations) [2018-03-28T22:08:49Z] <twentyafterfour> rolling forward group1 to 1.31.0-wmf.27 refs T183966 T190960

Mentioned in SAL (#wikimedia-operations) [2018-03-28T22:09:21Z] <twentyafterfour@tin> rebuilt and synchronized wikiversions files: sync https://gerrit.wikimedia.org/r/#/c/422563/ group1 wikis to 1.31.0-wmf.27 refs T183966 T190960

aaron added a comment.Mar 29 2018, 2:19 AM

The temp table stat increases just seem like noise due to some queries going from "SELECT @@" to "SHOW GLOBAL VARIABLES LIKE 'gtid_%'. E.g.:

mysql> show status like 'Created_%';
+-------------------------+-------+
| Variable_name           | Value |
+-------------------------+-------+
| Created_tmp_disk_tables | 0     |
| Created_tmp_files       | 6     |
| Created_tmp_tables      | 2     |
+-------------------------+-------+
3 rows in set (0.00 sec)

mysql> show variables like 'gtid%';
+----------------------------------+------------------------------------------+
| Variable_name                    | Value                                    |
+----------------------------------+------------------------------------------+
| gtid_executed_compression_period | 1000                                     |
| gtid_mode                        | OFF                                      |
| gtid_next                        | AUTOMATIC                                |
| gtid_owned                       |                                          |
| gtid_purged                      | fae78223-0aae-11e8-ba79-340286975770:1-2 |
+----------------------------------+------------------------------------------+
5 rows in set (0.01 sec)

mysql> show status like 'Created_%';
+-------------------------+-------+
| Variable_name           | Value |
+-------------------------+-------+
| Created_tmp_disk_tables | 0     |
| Created_tmp_files       | 6     |
| Created_tmp_tables      | 3     |
+-------------------------+-------+
3 rows in set (0.00 sec)
Volans added a subscriber: Volans.Mar 29 2018, 8:20 AM

@aaron: can we revert whatever patch made this change (from SELECT to SHOW) so we can roll forward? The DBAs are not around this week and it's a short week.

mmodell added a comment.EditedMar 29 2018, 6:37 PM

@aaron I think the temp tables are caused by holding open transactions for longer which is what would happen if we were waiting a long time for a slave to catch up before committing the transaction. Setting the timeout lower would make the situation better but not entirely resolved it.

Can we revert rMWceb7d61ee7ef: rdbms: make getMasterPos() ignore GTIDs outside of gtid_domain_id for now and merge it again next week? I'd be more comfortable with doing it when we have a DBA around to help verify that we aren't killing the database.

The reason I don't think the temp tables are noise is because they correlate with the timing of a large jump in failed connections to mysql. So correlation isn't causation but nonetheless something is causing the connection errors. If it's not this change then we have to figure out what is causing it. If we still have the errors without the gtid change then at least we've proved what isn't causing it.

Just to be explicit: we're now passed the point of no return (for this week). We can't safely do group1 and group 2 before the end of the day; especially as tomorrow is a WMF Holiday.

The train is now halted until next week.

aaron added a comment.Mar 29 2018, 9:38 PM

I don't mean "noise" as "unrelated to deploy", rather "expected, but doesn't matter".

I'm curious why 60 second timeouts were still showing last go. That should only be possible in CLI mode, which JobExecuter should not be in. Something is wrong there.

@aaron: What I saw with the second deployment was a ton of errors like the following:

SlowTimer [10000ms] at runtime/ext_mysql: slow query: SELECT MASTER_GTID_WAIT(...)

I apologize for not updating the ticket with that last detail. I logged that in the SAL but I forgot to mention T190960 so it didn't get logged here.

SlowTimer [10000ms] at runtime/ext_mysql: slow query: SELECT MASTER_GTID_WAIT(...)

That is HHVM complaining a query took more than 10 seconds when MASTER_GTID_WAIT() is being passed a 10 seconds timeout. That is/has been the cause of spam log previously but I could not find a reference to an existing task :(

The temp table stat increases just seem like noise due to some queries going from "SELECT @@" to "SHOW GLOBAL VARIABLES LIKE 'gtid_%'.

Then I guess those queries should be moved back to use a SELECT. Adding a large baseline of noise to the mysql internal monitoring metrics is probably going to be a pain to the DBA. If later on there is an increase of temporary tables creations which hidden by that noise, it is going to hurt surely?

Not sure whether all SHOW can be converted back to SELECT but surely that should be doable.

For reference this is the max replication lag between all eqiad DBs in that time frame (Mar. 28th, ~18:30-20:30), from where seems pretty clear that there was no sensible lag at all in that time frame.

Change 423069 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] rdbms: avoid lag estimates in getLagFromPtHeartbeat ruined by snapshots

https://gerrit.wikimedia.org/r/423069

Change 423069 merged by jenkins-bot:
[mediawiki/core@master] rdbms: avoid lag estimates in getLagFromPtHeartbeat ruined by snapshots

https://gerrit.wikimedia.org/r/423069

Change 423488 had a related patch set uploaded (by 20after4; owner: Aaron Schulz):
[mediawiki/core@wmf/1.31.0-wmf.27] rdbms: avoid lag estimates in getLagFromPtHeartbeat ruined by snapshots

https://gerrit.wikimedia.org/r/423488

Change 423488 merged by jenkins-bot:
[mediawiki/core@wmf/1.31.0-wmf.27] rdbms: avoid lag estimates in getLagFromPtHeartbeat ruined by snapshots

https://gerrit.wikimedia.org/r/423488

Mentioned in SAL (#wikimedia-operations) [2018-04-02T19:06:49Z] <twentyafterfour> sync rdbms: avoid lag estimates in getLagFromPtHeartbeat ruined by snapshots Bug: T190960 Change-Id: I57dd8d3d0ca96d6fb2f9e83f062f29b1d53224dd

Mentioned in SAL (#wikimedia-operations) [2018-04-02T19:09:49Z] <twentyafterfour@tin> Synchronized php-1.31.0-wmf.27/includes/libs/rdbms/: sync I57dd8d3d0ca96d6fb2f9e83f062f29b1d53224dd refs T183966 T190960 (duration: 01m 19s)

Mentioned in SAL (#wikimedia-operations) [2018-04-02T19:41:54Z] <twentyafterfour@tin> Synchronized php-1.31.0-wmf.27/includes/libs/rdbms/database/DatabaseMysqlBase.php: sync 779e7fd3de08101ab626136c07045c98ea162c5b refs T190960 (duration: 01m 16s)

Mentioned in SAL (#wikimedia-operations) [2018-04-02T20:38:11Z] <twentyafterfour@tin> Synchronized php-1.31.0-wmf.27/includes/libs/rdbms/database/DatabaseMysqlBase.php: sync 36c52351487aa70a651dbc6157e0a5882dfd9e7f refs T190960 (duration: 01m 16s)

Mentioned in SAL (#wikimedia-operations) [2018-04-02T20:59:16Z] <twentyafterfour> MediaWiki Train: rolling back to 1.31.0-wmf.26 refs T183966, T190960

Mentioned in SAL (#wikimedia-operations) [2018-04-02T21:22:17Z] <twentyafterfour@tin> Synchronized php-1.31.0-wmf.27/includes/libs/rdbms/database/: Revert ceb7d61ee7ef3edc6705abd41ec86b3afcd9c491 refs T183966 T190960 (duration: 00m 59s)

The original patch that caused this has been reverted, unblocking the train.

I am not sure if this is related but in the last 12h, starting around 7pm, the following message has caused almost 1.2M errors:

message	       	Using cached lag value for 10.64.16.84:3311 due to active transaction
t  method	       	Wikimedia\Rdbms\DatabaseMysqlBase::getLagFromPtHeartbeat
t  mwversion	       	1.31.0-wmf.27
t  normalized_message	       	Using cached lag value for {db_server} due to active transaction

https://logstash.wikimedia.org/goto/e2a6f0cc6b65ccb88081e09216d1c7d7

@Marostegui: AFAIK that's a debugging log message put in place by @aaron

mmodell updated the task description. (Show Details)Apr 3 2018, 5:43 AM

@Marostegui: AFAIK that's a debugging log message put in place by @aaron

ok - thanks for the clarification, I thought it was better to report it :-)

I've started working on an incident report for this: https://wikitech.wikimedia.org/wiki/Incident_documentation/20180229-Train-1.31.0-wmf.27 feel free to add to it, it's just a basic outline so far.

I think there's another problem related to wmf.27 rollout, see T191282 - the timings for those log entries correlate with the train deployments precisely

jcrespo added a subscriber: jcrespo.Apr 3 2018, 2:05 PM

temp tables as part of SHOW are a non-problem, those are complusory tiny tables on memory- they only would be problematic if they created significant latency issues or they were created on disk (io increase, with means performance impact).

@aaron - if you don't like the current model, should we think an alternative, simpler one based on the heartbeat table- or is this still ok for you?

mmodell updated the task description. (Show Details)
aaron added a comment.Apr 4 2018, 9:08 PM

@aaron - if you don't like the current model, should we think an alternative, simpler one based on the heartbeat table- or is this still ok for you?

It's OK for now. I made https://gerrit.wikimedia.org/r/#/c/423651/ to have another go at that patch.

mmodell added a parent task: Restricted Task.Apr 10 2018, 3:02 AM

By reading the IR I guess this task can be closed as the problem was found @mmodell ?

mmodell closed this task as Resolved.Apr 12 2018, 8:33 PM
mmodell claimed this task.

@Marostegui We still have a retrospective meeting scheduled for this one, not sure if we should keep it open for that purpose though.