Maniphest T209754

db1078 (s3 candidate master) crashed
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	colewhite
	Nov 17 2018, 6:14 AM

Description

Logs indicate a transaction was rolled back and the database had to crash recover on db1078 (s3).

ABRT6.log582 KBDownload

Details

Subject	Repo	Branch	Lines +/-
db-eqiad.php: Fully repool db1078	operations/mediawiki-config	master	+1 -1
db-eqiad.php: Increase weight for db1078 and db1123	operations/mediawiki-config	master	+4 -4
install_server: Allow re-image db1078	operations/puppet	production	+1 -1
db-eqiad.php: Repool db1123 and db1078	operations/mediawiki-config	master	+2 -2
db1078: Enable notifications	operations/puppet	production	+0 -1
db-eqiad.php: Depool db1123	operations/mediawiki-config	master	+4 -4
db-eqiad.php: Add db1078 to the file, but depooled	operations/mediawiki-config	master	+1 -0
db1078: Disable notifications	operations/puppet	production	+1 -0

Customize query in gerrit

Related Objects

Mentioned In: T209815: Upgrade firmware on db1078
T209757: Notifications disablement via puppet not working on icinga
Mentioned Here: T209815: Upgrade firmware on db1078
T173365: RAID crashed on db1078

Event Timeline

colewhite created this task.Nov 17 2018, 6:14 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 17 2018, 6:14 AM

colewhite updated the task description. (Show Details)Nov 17 2018, 6:18 AM

Joe added a project: SRE.Nov 17 2018, 6:25 AM

The server was depooled: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/474417/

Thank you for letting us know
Thanks also @Joe for calling me up.

We will take it from here :-)

• Marostegui renamed this task from MariaDB killed by systemd with ABRT6 to db1078 (s3 candidate master) crashed .Nov 17 2018, 6:37 AM

• Marostegui triaged this task as High priority.

• Marostegui updated the task description. (Show Details)

MySQL got corrupted - this host needs to be rebuilt.

Change 474447 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1078: Disable notifications

https://gerrit.wikimedia.org/r/474447

Change 474447 merged by Marostegui:
[operations/puppet@production] db1078: Disable notifications

https://gerrit.wikimedia.org/r/474447

I haven't found anything on HW logs that might indicate a HW malfunction

• Marostegui mentioned this in T209757: Notifications disablement via puppet not working on icinga.Nov 17 2018, 7:12 AM

jijiki subscribed.Nov 17 2018, 9:41 AM

Change 474620 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Add db1078 to the file, but depooled

https://gerrit.wikimedia.org/r/474620

Change 474620 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Add db1078 to the file, but depooled

https://gerrit.wikimedia.org/r/474620

Mentioned in SAL (#wikimedia-operations) [2018-11-19T06:20:05Z] <marostegui@deploy1001> sync-file aborted: Add db1078 line back to config file but depooled T209754 (duration: 00m 02s)

Mentioned in SAL (#wikimedia-operations) [2018-11-19T06:21:04Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Add db1078 line back to config file but depooled T209754 (duration: 00m 51s)

I have rebooted this host to see if there were any HW errors on boot-up, but it came back fine, no storage, memory or any other kind of error reported.

Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts:

db1078.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201811190631_marostegui_255462_db1078_eqiad_wmnet.log.

This host also crashed a bit over a year ago: T173365
Even if I didn't find any trace of a real storage crash, this is what syslog shows 10 minutes before the crash:

Nov 17 05:25:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature

Although that is quite common on that host apparently, it has been happening before:

Nov 16 06:55:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature
Nov 16 07:25:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature
Nov 16 07:55:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature
Nov 16 08:25:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature
Nov 16 08:55:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature
Nov 16 09:25:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature
Nov 16 09:55:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature
Nov 16 10:25:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature
Nov 16 10:55:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature
Nov 16 11:25:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature
Nov 16 11:55:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature
Nov 16 12:25:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature
Nov 16 12:55:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature
Nov 16 13:25:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature
Nov 16 13:55:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature
Nov 16 14:25:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature
Nov 16 14:55:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature
Nov 16 15:25:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature
Nov 16 15:55:04 db1078 smartd[567]: Device: /dev/sda, failed to read Temperature

And then just InnoDB warnings prior the crash:

Nov 17 05:39:29 db1078 mysqld[14906]: InnoDB: Warning: Index PRIMARY points to table itwiktionary/image_comment_temp and ib_table itwiktionary/image_comment_temp statistics is initialized 1  but index table itwiktionary/image_comment_temp initialized 0  mysql table is image_comment_temp. Have you mixed up .frm files from different installations? See http://dev.mysql.com/doc/refman/5.6/en/innodb-troubleshooting.html
<snip>
Nov 17 05:39:29 db1078 mysqld[14906]:  len 257; hex 0e0046722d436f6e636572742e6f676700000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000; asc   Fr-Concert.ogg

That continues and then in the end MySQL dies.

Change 474621 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] install_server: Allow re-image db1078

https://gerrit.wikimedia.org/r/474621

Change 474621 merged by Marostegui:
[operations/puppet@production] install_server: Allow re-image db1078

https://gerrit.wikimedia.org/r/474621

Completed auto-reimage of hosts:

['db1078.eqiad.wmnet']

Of which those FAILED:

['db1078.eqiad.wmnet']

Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts:

db1078.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201811190712_marostegui_5908_db1078_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['db1078.eqiad.wmnet']

Of which those FAILED:

['db1078.eqiad.wmnet']

Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts:

db1078.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201811190712_marostegui_6014_db1078_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['db1078.eqiad.wmnet']

Of which those FAILED:

['db1078.eqiad.wmnet']

Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts:

db1078.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201811190712_marostegui_6126_db1078_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['db1078.eqiad.wmnet']

and were ALL successful.

Change 474627 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1123

https://gerrit.wikimedia.org/r/474627

Change 474627 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1123

https://gerrit.wikimedia.org/r/474627

Mentioned in SAL (#wikimedia-operations) [2018-11-19T07:56:37Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Depool db1123 to clone db1078 T209754 (duration: 00m 47s)

Mentioned in SAL (#wikimedia-operations) [2018-11-19T07:57:14Z] <marostegui> Stop MySQL on db1123 - T209754

Change 474638 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Repool db1123 and db1078

https://gerrit.wikimedia.org/r/474638

Change 474640 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1078: Enable notifications

https://gerrit.wikimedia.org/r/474640

Change 474640 merged by Marostegui:
[operations/puppet@production] db1078: Enable notifications

https://gerrit.wikimedia.org/r/474640

• Marostegui mentioned this in T209815: Upgrade firmware on db1078.Nov 19 2018, 9:34 AM

Change 474638 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Repool db1123 and db1078

https://gerrit.wikimedia.org/r/474638

Mentioned in SAL (#wikimedia-operations) [2018-11-19T09:43:59Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Slowly repool db1123 and db1078 T209754 (duration: 00m 46s)

• Marostegui claimed this task.Nov 19 2018, 9:49 AM

Change 474656 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Increase weight for db1078 and db1123

https://gerrit.wikimedia.org/r/474656

Change 474656 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Increase weight for db1078 and db1123

https://gerrit.wikimedia.org/r/474656

Mentioned in SAL (#wikimedia-operations) [2018-11-19T09:57:15Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Fully repool db1123 and increase weight for db1078 T209754 (duration: 00m 46s)

Mentioned in SAL (#wikimedia-operations) [2018-11-19T10:11:03Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Increase weight for db1078 T209754 (duration: 00m 46s)

Mentioned in SAL (#wikimedia-operations) [2018-11-19T10:21:33Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Increase weight for db1078 T209754 (duration: 00m 46s)

Change 474663 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Fully repool db1078

https://gerrit.wikimedia.org/r/474663

Change 474663 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Fully repool db1078

https://gerrit.wikimedia.org/r/474663

Mentioned in SAL (#wikimedia-operations) [2018-11-19T10:41:50Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Fully repool db1078 T209754 (duration: 00m 46s)

db1078 is now fully repooled after cloning it.
This is all done.
As a follow up with DCOps I have created T209815: Upgrade firmware on db1078 so we can have everything up to date, and if this happens again we should contact the vendor.

db1078 (s3 candidate master) crashed Closed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

db1078 (s3 candidate master) crashed
Closed, ResolvedPublic
Actions