During the time that 1.31.0-wmf.27 was deployed to group1, implicit temporary tables shot up to ~4x the baseline, as can be seen in this grafana view
The error is coming from line 258 of RefreshLinksJob.php where the code calls commitAndWaitForReplication:
$lbFactory->commitAndWaitForReplication( __METHOD__, $ticket );
stack trace
#0 /srv/mediawiki/php-1.31.0-wmf.27/includes/libs/rdbms/loadbalancer/LoadBalancer.php(639): Wikimedia\Rdbms\DatabaseMysqlBase->masterPosWait(Wikimedia\Rdbms\MySQLMasterPos, double)
#1 /srv/mediawiki/php-1.31.0-wmf.27/includes/libs/rdbms/loadbalancer/LoadBalancer.php(534): Wikimedia\Rdbms\LoadBalancer->doWait(integer, boolean, double)
#2 /srv/mediawiki/php-1.31.0-wmf.27/includes/libs/rdbms/lbfactory/LBFactory.php(367): Wikimedia\Rdbms\LoadBalancer->waitForAll(Wikimedia\Rdbms\MySQLMasterPos, double)
#3 /srv/mediawiki/php-1.31.0-wmf.27/includes/libs/rdbms/lbfactory/LBFactory.php(419): Wikimedia\Rdbms\LBFactory->waitForReplication(array)
#4 /srv/mediawiki/php-1.31.0-wmf.27/includes/jobqueue/jobs/RefreshLinksJob.php(290): Wikimedia\Rdbms\LBFactory->commitAndWaitForReplication(string, integer)
#5 /srv/mediawiki/php-1.31.0-wmf.27/includes/jobqueue/jobs/RefreshLinksJob.php(122): RefreshLinksJob->runForTitle(Title)
#6 /srv/mediawiki/php-1.31.0-wmf.27/extensions/EventBus/includes/JobExecutor.php(59): RefreshLinksJob->run()
#7 /srv/mediawiki/rpc/RunSingleJob.php(79): JobExecutor->execute(array)
#8 {main}Timeline
| 19:24 | twentyafterfour@tin: | Synchronized php: group1 wikis to 1.31.0-wmf.26 (duration: 01m 17s) |
| 19:22 | twentyafterfour@tin: | rebuilt and synchronized wikiversions files: group1 wikis to 1.31.0-wmf.26 |
| 19:20 | twentyafterfour: | Rolling back to wmf.26 due to increase in fatals: "Replication wait failed: lost connection to MySQL server during query" |
| 19:19 | twentyafterfour: | rolling back to wmf.26 |
| 19:18 | icinga-wm | PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] |
| https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen&from=1522263260081&to=1522265839537 | ||
| 19:17 | twentyafterfour: | I'm seeing quite a few "[{exception_id}] {exception_url} Wikimedia\Rdbms\DBExpectedError: Replication wait failed: Lost connection to MySQL server during query |
| 19:12 | milimetric@tin: | Finished deploy [analytics/refinery@c22fd1e]: Fixing python import bug (duration: 02m 48s) |
| 19:09 | milimetric@tin: | Started deploy [analytics/refinery@c22fd1e]: Fixing python import bug |
| 19:09 | milimetric@tin: | Started deploy [analytics/refinery@c22fd1e]: (no justification provided) |
| 19:06 | twentyafterfour@tin: | Synchronized php: group1 wikis to 1.31.0-wmf.27 (duration: 01m 17s) |
| 19:05 | twentyafterfour@tin: | rebuilt and synchronized wikiversions files: group1 wikis to 1.31.0-wmf.27 |
Incident report
https://wikitech.wikimedia.org/wiki/Incident_documentation/20180229-Train-1.31.0-wmf.27