Page MenuHomePhabricator

Degraded RAID on db1052
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host db1052. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: 1 failed LD(s) (Degraded)
=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 0 (Target Id: 0)
	RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
	State: =====> Degraded <=====
	Number Of Drives per span: 2
	Number of Spans: 6
	Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU

		Span: 0 - Number of PDs: 2

			PD: 0 Information
			Enclosure Device ID: 32
			Slot Number: 0
			Drive's position: DiskGroup: 0, Span: 0, Arm: 0
			Media Error Count: 37
			Other Error Count: 0
			Predictive Failure Count: =====> 49 <=====
			Last Predictive Failure Event Seq Number: 30374

				Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
				Firmware state: =====> Failed <=====
				Media Type: Hard Disk Device
				Drive Temperature: 32C (89.60 F)

		Span: 1 - Number of PDs: 2

			PD: 0 Information
			Enclosure Device ID: 32
			Slot Number: 2
			Drive's position: DiskGroup: 0, Span: 1, Arm: 0
			Media Error Count: 35
			Other Error Count: 2
			Predictive Failure Count: =====> 63 <=====
			Last Predictive Failure Event Seq Number: 30375

				Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
				Firmware state: Online, Spun Up
				Media Type: Hard Disk Device
				Drive Temperature: 31C (87.80 F)

		Span: 4 - Number of PDs: 2

			PD: 0 Information
			Enclosure Device ID: 32
			Slot Number: 8
			Drive's position: DiskGroup: 0, Span: 4, Arm: 0
			Media Error Count: 0
			Other Error Count: 0
			Predictive Failure Count: =====> 157 <=====
			Last Predictive Failure Event Seq Number: 30376

				Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
				Firmware state: Online, Spun Up
				Media Type: Hard Disk Device
				Drive Temperature: 29C (84.20 F)

=== RaidStatus completed

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 30 2017, 5:36 PM
Volans added subscribers: Marostegui, jcrespo, Volans.

FYI: This is s1 master! Adding @Marostegui @jcrespo directly too for visibility.
At least the other 2 disks with predictive failure are in different spans.

@Cmjohnson you still in the DC?

@Marostegui: I need to check to see if I have any spare 600Gb disks left...we bought a few but we went through them pretty quickly.

@Cmjohnson if it helps, there are some hosts that are ready to be decommissioned which have 600GB disks which are probably old though: T166486 T164702

jcrespo triaged this task as High priority.Jun 30 2017, 6:03 PM

@Marostegui the disk has been swapped with the last new spare disk on-site.

Currently rebuilding

Enclosure Device ID: 32
Slot Number: 0
Drive's position: DiskGroup: 0, Span: 0, Arm: 0
Enclosure position: 1
Device Id: 0
WWN: 5000C500437173D8
Sequence Number: 11
Media Error Count: 0
Other Error Count: 1
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS

Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
Non Coerced Size: 558.411 GB [0x45cd2fb0 Sectors]
Coerced Size: 558.375 GB [0x45cc0000 Sectors]
Sector Size: 0
Firmware state: Rebuild
Device Firmware Level: 0008
Shield Counter: 0

@Marostegui the disk has been swapped with the last new spare disk on-site.

Thanks Chris!
Should we order more spares or how is this usually handled? /cc @mark

Rebuild completed, RAID back to optimal. There are 2 disks with predictive failure that might fail sooner or later

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli
=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 0 (Target Id: 0)
	RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
	State: Optimal
	Number Of Drives per span: 2
	Number of Spans: 6
	Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU

		Span: 1 - Number of PDs: 2

			PD: 0 Information
			Enclosure Device ID: 32
			Slot Number: 2
			Drive's position: DiskGroup: 0, Span: 1, Arm: 0
			Media Error Count: 35
			Other Error Count: 2
			Predictive Failure Count: =====> 63 <=====
			Last Predictive Failure Event Seq Number: 30375

				Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
				Firmware state: Online, Spun Up
				Media Type: Hard Disk Device
				Drive Temperature: 31C (87.80 F)

		Span: 4 - Number of PDs: 2

			PD: 0 Information
			Enclosure Device ID: 32
			Slot Number: 8
			Drive's position: DiskGroup: 0, Span: 4, Arm: 0
			Media Error Count: 0
			Other Error Count: 0
			Predictive Failure Count: =====> 157 <=====
			Last Predictive Failure Event Seq Number: 30376

				Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
				Firmware state: Online, Spun Up
				Media Type: Hard Disk Device
				Drive Temperature: 30C (86.00 F)

=== RaidStatus completed
Marostegui closed this task as Resolved.Jun 30 2017, 10:10 PM
Marostegui assigned this task to Cmjohnson.

Great!! Thanks!
I will close this for now, and we will check if we need to buy more disks next week!
Thanks a lot Chris!