Page MenuHomePhabricator

Degraded RAID on db1059
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host db1059. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: 1 failed LD(s) (Degraded)

$ sudo /usr/local/lib/nagios/plugins/get_raid_status_megacli
=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 0 (Target Id: 0)
	RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
	State: =====> Degraded <=====
	Number Of Drives per span: 2
	Number of Spans: 6
	Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU

		Span: 0 - Number of PDs: 2

			PD: 0 Information
			Enclosure Device ID: 32
			Slot Number: 0
			Drive's position: DiskGroup: 0, Span: 0, Arm: 0
			Media Error Count: 29
			Other Error Count: 23
			Predictive Failure Count: =====> 2 <=====
			Last Predictive Failure Event Seq Number: 3221

				Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
				Firmware state: =====> Failed <=====
				Media Type: Hard Disk Device
				Drive Temperature: 43C (109.40 F)

=== RaidStatus completed

Related Objects

Event Timeline

Marostegui added a project: DBA.
Marostegui added subscribers: Cmjohnson, Marostegui.

@Cmjohnson, can we get the disk replaced?
Thanks!

Please note this is out of warranty.

Chris has spares, plus we have spare disks from decom task T178162.

@Cmjohnson if you have some time today, could we the failed disk swapped?
Thank you!

Thank you Chris!

root@db1059:~# megacli -PDRbld -ShowProg -PhysDrv [32:0] -aALL

Rebuild Progress on Device at Enclosure 32, Slot 0 Completed 10% in 5 Minutes.

Exit Code: 0x00

It failed again, was this a brand new disk, @Cmjohnson?

root@db1059:~# megacli -pdlist -a0

Adapter #0

Enclosure Device ID: 32
Slot Number: 0
Drive's position: DiskGroup: 0, Span: 0, Arm: 0
Enclosure position: 1
Device Id: 0
WWN: 5000C50023CDAAF4
Sequence Number: 12
Media Error Count: 2
Other Error Count: 3
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS

Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
Non Coerced Size: 558.411 GB [0x45cd2fb0 Sectors]
Coerced Size: 558.375 GB [0x45cc0000 Sectors]
Sector Size:  0
Firmware state: Failed

Looks like the rebuild is complete and all disks are back online

root@db1059:~# megacli -PDList -aALL |grep "Firmware state"
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
root@db1059:~#