Page MenuHomePhabricator

Degraded RAID on db1073
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host db1073. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: 1 failed LD(s) (Degraded)

$ sudo /usr/local/lib/nagios/plugins/get_raid_status_megacli
=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 0 (Target Id: 0)
	RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
	State: =====> Degraded <=====
	Number Of Drives per span: 2
	Number of Spans: 6
	Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU

		Span: 0 - Number of PDs: 2

			PD: 1 Information
			Enclosure Device ID: 32
			Slot Number: 1
			Drive's position: DiskGroup: 0, Span: 0, Arm: 1
			Media Error Count: 7
			Other Error Count: 0
			Predictive Failure Count: 0
			Last Predictive Failure Event Seq Number: 0

				Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
				Firmware state: =====> Failed <=====
				Media Type: Hard Disk Device
				Drive Temperature: 31C (87.80 F)

		Span: 1 - Number of PDs: 2

			PD: 1 Information
			Enclosure Device ID: 32
			Slot Number: 3
			Drive's position: DiskGroup: 0, Span: 1, Arm: 1
			Media Error Count: 0
			Other Error Count: 1
			Predictive Failure Count: =====> 115 <=====
			Last Predictive Failure Event Seq Number: 31123

				Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
				Firmware state: Online, Spun Up
				Media Type: Hard Disk Device
				Drive Temperature: 33C (91.40 F)

		Span: 3 - Number of PDs: 2

			PD: 0 Information
			Enclosure Device ID: 32
			Slot Number: 6
			Drive's position: DiskGroup: 0, Span: 3, Arm: 0
			Media Error Count: 40
			Other Error Count: 19124
			Predictive Failure Count: =====> 42 <=====
			Last Predictive Failure Event Seq Number: 31124

				Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
				Firmware state: =====> Failed <=====
				Media Type: Hard Disk Device
				Drive Temperature: 33C (91.40 F)

=== RaidStatus completed

Event Timeline

Restricted Application added subscribers: Banyek, Marostegui. · View Herald TranscriptFeb 1 2019, 5:18 AM
Marostegui triaged this task as Normal priority.
Marostegui added a subscriber: jcrespo.

Let's get it replaced sooner than later as it is a master on m5

Marostegui moved this task from Triage to In progress on the DBA board.Feb 1 2019, 11:18 PM

The disk has been replaced but I also a bad disk on slot 6. leaving this open until tomorrow and will replace it

@Cmjohnson you can proceed with the one on slot 6.
The one on slot #1 finished correctly

Enclosure Device ID: 32
Slot Number: 1
Drive's position: DiskGroup: 0, Span: 0, Arm: 1
Enclosure position: 1
Device Id: 1
WWN: 5000C5008D7E8474
Sequence Number: 12
Media Error Count: 0
Other Error Count: 2
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS

Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
Non Coerced Size: 558.411 GB [0x45cd2fb0 Sectors]
Coerced Size: 558.375 GB [0x45cc0000 Sectors]
Sector Size:  0
Firmware state: Online, Spun Up
Device Firmware Level: 0008
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x5000c5008d7e8475
SAS Address(1): 0x0
Connected Port Number: 0(path0)
Inquiry Data: SEAGATE ST3600057SS     00086SLAZC71
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None
Device Speed: 6.0Gb/s
Link Speed: 6.0Gb/s
Media Type: Hard Disk Device
Drive:  Not Certified
Drive Temperature :32C (89.60 F)
PI Eligibility:  No
Drive is formatted for PI information:  No
PI: No PI
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s
Port-1 :
Port status: Active
Port's Linkspeed: Unknown
Drive has flagged a S.M.A.R.T alert : No

Thank you

Marostegui closed this task as Resolved.Feb 7 2019, 5:34 PM

Thanks @Cmjohnson for replacing disk #6!

17:31 <+icinga-wm> RECOVERY - MegaRAID on db1073 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy