Page MenuHomePhabricator

db1065 storage crash
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host db1065. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: 1 failed LD(s) (Degraded)

$ sudo /usr/local/lib/nagios/plugins/get_raid_status_megacli
=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 0 (Target Id: 0)
	RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
	State: =====> Degraded <=====
	Number Of Drives per span: 2
	Number of Spans: 6
	Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU

		Span: 0 - Number of PDs: 2

			PD: 0 Information
			Enclosure Device ID: 32
			Slot Number: 0
			Drive's position: DiskGroup: 0, Span: 0, Arm: 0
			Media Error Count: 5
			Other Error Count: 9
			Predictive Failure Count: 0
			Last Predictive Failure Event Seq Number: 0

				Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
				Firmware state: =====> Failed <=====
				Media Type: Hard Disk Device
				Drive Temperature: 44C (111.20 F)

=== RaidStatus completed

Related Objects

Event Timeline

Marostegui added a project: DBA.
Marostegui added a subscriber: Cmjohnson.

@Cmjohnson let's get this disk replaced

Thanks!

Storage crashed:

root@db1065:~# df -hT
-bash: /bin/df: Input/output error
root@db1065:~# dmesg
-bash: /bin/dmesg: Input/output error

@Cmjohnson can you visually check if there are more than 1 disk broken?

Change 434933 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Disable db1065 notifications

https://gerrit.wikimedia.org/r/434933

Change 434933 merged by Jcrespo:
[operations/puppet@production] mariadb: Disable db1065 notifications

https://gerrit.wikimedia.org/r/434933

jcrespo renamed this task from Degraded RAID on db1065 to db1065 storage crash.May 24 2018, 4:03 PM

Change 434946 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] auto_install: Allow full reimage of db1065, disallow most others

https://gerrit.wikimedia.org/r/434946

Change 434946 merged by Jcrespo:
[operations/puppet@production] auto_install: Allow full reimage of db1065, disallow most others

https://gerrit.wikimedia.org/r/434946

db1065 storage has been rebuilt and data cloned to it again. However, there is a smart error on the second disk (I think it is #1, as it starts from 0). We need a replacement there. Please @Cmjohnson change that, I can tell you were to get good disks if you don't have any available right now.

Change 435130 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Reenable db1065 notifications after crash and rebuilding

https://gerrit.wikimedia.org/r/435130

Change 435130 merged by Jcrespo:
[operations/puppet@production] mariadb: Reenable db1065 notifications after crash and rebuilding

https://gerrit.wikimedia.org/r/435130

Is the scope of this task finished?

db1065 storage has been rebuilt and data cloned to it again. However, there is a smart error on the second disk (I think it is #1, as it starts from 0). We need a replacement there. Please @Cmjohnson change that, I can tell you were to get good disks if you don't have any available right now.

Do you have disks available for this @Cmjohnson?

After replacing disk #1, this is all good now.

root@db1065:~# megacli -LDPDInfo -aAll | grep -i flagged
Drive has flagged a S.M.A.R.T alert : No
Drive has flagged a S.M.A.R.T alert : No
Drive has flagged a S.M.A.R.T alert : No
Drive has flagged a S.M.A.R.T alert : No
Drive has flagged a S.M.A.R.T alert : No
Drive has flagged a S.M.A.R.T alert : No
Drive has flagged a S.M.A.R.T alert : No
Drive has flagged a S.M.A.R.T alert : No
Drive has flagged a S.M.A.R.T alert : No
Drive has flagged a S.M.A.R.T alert : No
Drive has flagged a S.M.A.R.T alert : No
Drive has flagged a S.M.A.R.T alert : N
Vvjjkkii renamed this task from db1065 storage crash to becaaaaaaa.Jul 1 2018, 1:08 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii removed Cmjohnson as the assignee of this task.
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed subscribers: gerritbot, Aklapper.
Marostegui renamed this task from becaaaaaaa to db1065 storage crash.Jul 1 2018, 6:56 PM
Marostegui closed this task as Resolved.
Marostegui assigned this task to Cmjohnson.
Marostegui lowered the priority of this task from High to Medium.
Marostegui updated the task description. (Show Details)