Logs indicate a transaction was rolled back and the database had to crash recover on db1078 (s3).
Description
Details
Related Objects
Event Timeline
The server was depooled: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/474417/
Thank you for letting us know
Thanks also @Joe for calling me up.
We will take it from here :-)
Change 474447 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1078: Disable notifications
Change 474447 merged by Marostegui:
[operations/puppet@production] db1078: Disable notifications
Change 474620 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Add db1078 to the file, but depooled
Change 474620 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Add db1078 to the file, but depooled
Mentioned in SAL (#wikimedia-operations) [2018-11-19T06:20:05Z] <marostegui@deploy1001> sync-file aborted: Add db1078 line back to config file but depooled T209754 (duration: 00m 02s)
Mentioned in SAL (#wikimedia-operations) [2018-11-19T06:21:04Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Add db1078 line back to config file but depooled T209754 (duration: 00m 51s)
I have rebooted this host to see if there were any HW errors on boot-up, but it came back fine, no storage, memory or any other kind of error reported.
Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts:
db1078.eqiad.wmnet
The log can be found in /var/log/wmf-auto-reimage/201811190631_marostegui_255462_db1078_eqiad_wmnet.log.
This host also crashed a bit over a year ago: T173365
Even if I didn't find any trace of a real storage crash, this is what syslog shows 10 minutes before the crash:
Nov 17 05:25:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature
Although that is quite common on that host apparently, it has been happening before:
Nov 16 06:55:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature Nov 16 07:25:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature Nov 16 07:55:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature Nov 16 08:25:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature Nov 16 08:55:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature Nov 16 09:25:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature Nov 16 09:55:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature Nov 16 10:25:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature Nov 16 10:55:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature Nov 16 11:25:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature Nov 16 11:55:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature Nov 16 12:25:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature Nov 16 12:55:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature Nov 16 13:25:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature Nov 16 13:55:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature Nov 16 14:25:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature Nov 16 14:55:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature Nov 16 15:25:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature Nov 16 15:55:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature
And then just InnoDB warnings prior the crash:
Nov 17 05:39:29 db1078 mysqld[14906]: InnoDB: Warning: Index PRIMARY points to table itwiktionary/image_comment_temp and ib_table itwiktionary/image_comment_temp statistics is initialized 1 but index table itwiktionary/image_comment_temp initialized 0 mysql table is image_comment_temp. Have you mixed up .frm files from different installations? See http://dev.mysql.com/doc/refman/5.6/en/innodb-troubleshooting.html <snip> Nov 17 05:39:29 db1078 mysqld[14906]: len 257; hex 0e0046722d436f6e636572742e6f676700000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000; asc Fr-Concert.ogg
That continues and then in the end MySQL dies.
Change 474621 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] install_server: Allow re-image db1078
Change 474621 merged by Marostegui:
[operations/puppet@production] install_server: Allow re-image db1078
Completed auto-reimage of hosts:
['db1078.eqiad.wmnet']
Of which those FAILED:
['db1078.eqiad.wmnet']
Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts:
db1078.eqiad.wmnet
The log can be found in /var/log/wmf-auto-reimage/201811190712_marostegui_5908_db1078_eqiad_wmnet.log.
Completed auto-reimage of hosts:
['db1078.eqiad.wmnet']
Of which those FAILED:
['db1078.eqiad.wmnet']
Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts:
db1078.eqiad.wmnet
The log can be found in /var/log/wmf-auto-reimage/201811190712_marostegui_6014_db1078_eqiad_wmnet.log.
Completed auto-reimage of hosts:
['db1078.eqiad.wmnet']
Of which those FAILED:
['db1078.eqiad.wmnet']
Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts:
db1078.eqiad.wmnet
The log can be found in /var/log/wmf-auto-reimage/201811190712_marostegui_6126_db1078_eqiad_wmnet.log.
Change 474627 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1123
Change 474627 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1123
Mentioned in SAL (#wikimedia-operations) [2018-11-19T07:56:37Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Depool db1123 to clone db1078 T209754 (duration: 00m 47s)
Mentioned in SAL (#wikimedia-operations) [2018-11-19T07:57:14Z] <marostegui> Stop MySQL on db1123 - T209754
Change 474638 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Repool db1123 and db1078
Change 474640 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1078: Enable notifications
Change 474640 merged by Marostegui:
[operations/puppet@production] db1078: Enable notifications
Change 474638 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Repool db1123 and db1078
Mentioned in SAL (#wikimedia-operations) [2018-11-19T09:43:59Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Slowly repool db1123 and db1078 T209754 (duration: 00m 46s)
Change 474656 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Increase weight for db1078 and db1123
Change 474656 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Increase weight for db1078 and db1123
Mentioned in SAL (#wikimedia-operations) [2018-11-19T09:57:15Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Fully repool db1123 and increase weight for db1078 T209754 (duration: 00m 46s)
Mentioned in SAL (#wikimedia-operations) [2018-11-19T10:11:03Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Increase weight for db1078 T209754 (duration: 00m 46s)
Mentioned in SAL (#wikimedia-operations) [2018-11-19T10:21:33Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Increase weight for db1078 T209754 (duration: 00m 46s)
Change 474663 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Fully repool db1078
Change 474663 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Fully repool db1078
Mentioned in SAL (#wikimedia-operations) [2018-11-19T10:41:50Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Fully repool db1078 T209754 (duration: 00m 46s)
db1078 is now fully repooled after cloning it.
This is all done.
As a follow up with DCOps I have created T209815: Upgrade firmware on db1078 so we can have everything up to date, and if this happens again we should contact the vendor.