Degraded RAID on db1063
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ops-monitoring-bot
	Dec 10 2018, 3:44 AM

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host db1063. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: 1 failed LD(s) (Degraded)

$ sudo /usr/local/lib/nagios/plugins/get_raid_status_megacli
=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 0 (Target Id: 0)
	RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
	State: =====> Degraded <=====
	Number Of Drives per span: 2
	Number of Spans: 6
	Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU

		Span: 0 - Number of PDs: 2

			PD: 0 Information
			Enclosure Device ID: 32
			Slot Number: 0
			Drive's position: DiskGroup: 0, Span: 0, Arm: 0
			Media Error Count: 4
			Other Error Count: 8
			Predictive Failure Count: 0
			Last Predictive Failure Event Seq Number: 0

				Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
				Firmware state: =====> Failed <=====
				Media Type: Hard Disk Device
				Drive Temperature: 36C (96.80 F)

		Span: 3 - Number of PDs: 2

			PD: 1 Information
			Enclosure Device ID: 32
			Slot Number: 7
			Drive's position: DiskGroup: 0, Span: 3, Arm: 1
			Media Error Count: 2
			Other Error Count: 0
			Predictive Failure Count: =====> 8 <=====
			Last Predictive Failure Event Seq Number: 2788

				Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
				Firmware state: Online, Spun Up
				Media Type: Hard Disk Device
				Drive Temperature: 34C (93.20 F)

=== RaidStatus completed

Related Objects

Mentioned In: T212969: Degraded RAID alert not acking notifications
T208323: Predictive failures on disk S.M.A.R.T. status

Event Timeline

ops-monitoring-bot added projects: SRE, ops-eqiad.Dec 10 2018, 3:44 AM

ops-monitoring-bot subscribed.

Restricted Application added subscribers: • Banyek, • Marostegui. · View Herald TranscriptDec 10 2018, 3:44 AM

@Cmjohnson I am setting this to high priority because there is one failed disk and another one with smart errors (on a different SPAN).
Let's replace only the failed one.
This is m1 master, so let's do this as soon as possible.

The failed disk is #0 - that is the one we need to replace only so far:

root@db1063:~# megacli -PDList -aall | egrep -i "Slot|Firmw"
Slot Number: 0
Firmware state: Failed
Device Firmware Level: ES66

• Marostegui moved this task from Triage to In progress on the DBA board.Dec 10 2018, 6:32 AM

• Marostegui mentioned this in T208323: Predictive failures on disk S.M.A.R.T. status.Dec 10 2018, 6:35 AM

swapped in slot 0

• Cmjohnson moved this task from Backlog to High Priority Task on the ops-eqiad board.Dec 11 2018, 6:38 PM

root@db1063:~# megacli -PDList -aall | egrep -i "Slot|Firmw"
Slot Number: 0
Firmware state: Rebuild
Device Firmware Level: 0008

awesome, thanks!

The sync finished, thank you @Cmjohnson

	Virtual Drive: 0 (Target Id: 0)
	RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
	State: Optimal
	Number Of Drives per span: 2
	Number of Spans: 6
	Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU

banyek@db1063:~ $ sudo megacli -PDList -aall | egrep -i "Slot|Firmw"
Slot Number: 0
Firmware state: Online, Spun Up
Device Firmware Level: 0008

• Marostegui mentioned this in T212969: Degraded RAID alert not acking notifications.Jan 4 2019, 8:02 PM

Degraded RAID on db1063Closed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Degraded RAID on db1063
Closed, ResolvedPublic
Actions