db1078 (s3 candidate master) crashed
Closed, ResolvedPublic

Description

Logs indicate a transaction was rolled back and the database had to crash recover on db1078 (s3).

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSat, Nov 17, 6:14 AM
colewhite updated the task description. (Show Details)Sat, Nov 17, 6:18 AM

Thank you for letting us know
Thanks also @Joe for calling me up.

We will take it from here :-)

Marostegui renamed this task from MariaDB killed by systemd with ABRT6 to db1078 (s3 candidate master) crashed .Sat, Nov 17, 6:37 AM
Marostegui triaged this task as High priority.
Marostegui updated the task description. (Show Details)
Marostegui moved this task from Triage to In progress on the DBA board.Sat, Nov 17, 6:42 AM

MySQL got corrupted - this host needs to be rebuilt.

Change 474447 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1078: Disable notifications

https://gerrit.wikimedia.org/r/474447

Change 474447 merged by Marostegui:
[operations/puppet@production] db1078: Disable notifications

https://gerrit.wikimedia.org/r/474447

I haven't found anything on HW logs that might indicate a HW malfunction

jijiki added a subscriber: jijiki.Sat, Nov 17, 9:41 AM

Change 474620 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Add db1078 to the file, but depooled

https://gerrit.wikimedia.org/r/474620

Change 474620 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Add db1078 to the file, but depooled

https://gerrit.wikimedia.org/r/474620

Mentioned in SAL (#wikimedia-operations) [2018-11-19T06:20:05Z] <marostegui@deploy1001> sync-file aborted: Add db1078 line back to config file but depooled T209754 (duration: 00m 02s)

Mentioned in SAL (#wikimedia-operations) [2018-11-19T06:21:04Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Add db1078 line back to config file but depooled T209754 (duration: 00m 51s)

I have rebooted this host to see if there were any HW errors on boot-up, but it came back fine, no storage, memory or any other kind of error reported.

Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts:

db1078.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201811190631_marostegui_255462_db1078_eqiad_wmnet.log.

This host also crashed a bit over a year ago: T173365
Even if I didn't find any trace of a real storage crash, this is what syslog shows 10 minutes before the crash:

Nov 17 05:25:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature

Although that is quite common on that host apparently, it has been happening before:

Nov 16 06:55:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature
Nov 16 07:25:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature
Nov 16 07:55:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature
Nov 16 08:25:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature
Nov 16 08:55:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature
Nov 16 09:25:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature
Nov 16 09:55:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature
Nov 16 10:25:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature
Nov 16 10:55:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature
Nov 16 11:25:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature
Nov 16 11:55:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature
Nov 16 12:25:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature
Nov 16 12:55:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature
Nov 16 13:25:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature
Nov 16 13:55:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature
Nov 16 14:25:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature
Nov 16 14:55:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature
Nov 16 15:25:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature
Nov 16 15:55:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature

And then just InnoDB warnings prior the crash:

Nov 17 05:39:29 db1078 mysqld[14906]: InnoDB: Warning: Index PRIMARY points to table itwiktionary/image_comment_temp and ib_table itwiktionary/image_comment_temp statistics is initialized 1  but index table itwiktionary/image_comment_temp initialized 0  mysql table is image_comment_temp. Have you mixed up .frm files from different installations? See http://dev.mysql.com/doc/refman/5.6/en/innodb-troubleshooting.html
<snip>
Nov 17 05:39:29 db1078 mysqld[14906]:  len 257; hex 0e0046722d436f6e636572742e6f676700000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000; asc   Fr-Concert.ogg

That continues and then in the end MySQL dies.

Change 474621 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] install_server: Allow re-image db1078

https://gerrit.wikimedia.org/r/474621

Change 474621 merged by Marostegui:
[operations/puppet@production] install_server: Allow re-image db1078

https://gerrit.wikimedia.org/r/474621

Completed auto-reimage of hosts:

['db1078.eqiad.wmnet']

Of which those FAILED:

['db1078.eqiad.wmnet']

Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts:

db1078.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201811190712_marostegui_5908_db1078_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['db1078.eqiad.wmnet']

Of which those FAILED:

['db1078.eqiad.wmnet']

Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts:

db1078.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201811190712_marostegui_6014_db1078_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['db1078.eqiad.wmnet']

Of which those FAILED:

['db1078.eqiad.wmnet']

Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts:

db1078.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201811190712_marostegui_6126_db1078_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['db1078.eqiad.wmnet']

and were ALL successful.

Change 474627 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1123

https://gerrit.wikimedia.org/r/474627

Change 474627 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1123

https://gerrit.wikimedia.org/r/474627

Mentioned in SAL (#wikimedia-operations) [2018-11-19T07:56:37Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Depool db1123 to clone db1078 T209754 (duration: 00m 47s)

Mentioned in SAL (#wikimedia-operations) [2018-11-19T07:57:14Z] <marostegui> Stop MySQL on db1123 - T209754

Change 474638 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Repool db1123 and db1078

https://gerrit.wikimedia.org/r/474638

Change 474640 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1078: Enable notifications

https://gerrit.wikimedia.org/r/474640

Change 474640 merged by Marostegui:
[operations/puppet@production] db1078: Enable notifications

https://gerrit.wikimedia.org/r/474640

Change 474638 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Repool db1123 and db1078

https://gerrit.wikimedia.org/r/474638

Mentioned in SAL (#wikimedia-operations) [2018-11-19T09:43:59Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Slowly repool db1123 and db1078 T209754 (duration: 00m 46s)

Change 474656 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Increase weight for db1078 and db1123

https://gerrit.wikimedia.org/r/474656

Change 474656 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Increase weight for db1078 and db1123

https://gerrit.wikimedia.org/r/474656

Mentioned in SAL (#wikimedia-operations) [2018-11-19T09:57:15Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Fully repool db1123 and increase weight for db1078 T209754 (duration: 00m 46s)

Mentioned in SAL (#wikimedia-operations) [2018-11-19T10:11:03Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Increase weight for db1078 T209754 (duration: 00m 46s)

Mentioned in SAL (#wikimedia-operations) [2018-11-19T10:21:33Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Increase weight for db1078 T209754 (duration: 00m 46s)

Change 474663 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Fully repool db1078

https://gerrit.wikimedia.org/r/474663

Change 474663 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Fully repool db1078

https://gerrit.wikimedia.org/r/474663

Mentioned in SAL (#wikimedia-operations) [2018-11-19T10:41:50Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Fully repool db1078 T209754 (duration: 00m 46s)

Marostegui closed this task as Resolved.Mon, Nov 19, 10:47 AM

db1078 is now fully repooled after cloning it.
This is all done.
As a follow up with DCOps I have created T209815: Upgrade firmware on db1078 so we can have everything up to date, and if this happens again we should contact the vendor.