Page MenuHomePhabricator

Degraded RAID on analytics1055
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host analytics1055. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: 1 failed LD(s) (Offline)
=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 2 (Target Id: 2)
	RAID Level: Primary-0, Secondary-0, RAID Level Qualifier-0
	State: =====> Offline <=====
	Number Of Drives: 1
	Number of Spans: 1
	Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU

		Span: 0 - Number of PDs: 1

			PD: 0 Information
			Enclosure Device ID: 32
			Slot Number: 1
			Drive's position: DiskGroup: 2, Span: 0, Arm: 0
			Media Error Count: 0
			Other Error Count: 0
			Predictive Failure Count: 0
			Last Predictive Failure Event Seq Number: 0

				Raw Size: 3.638 TB [0x1d1c0beb0 Sectors]
				Firmware state: =====> Failed <=====
				Media Type: Hard Disk Device
				Drive Temperature: 32C (89.60 F)

=== RaidStatus completed

Related Objects

StatusSubtypeAssignedTask
Resolved Cmjohnson

Event Timeline

So the raid0 device to disk is not a 1:1 mapping, so while VD2 (raid0 of a single disk) has failed, its actually the HDD is slot 1:

Enclosure Device ID: 32
Slot Number: 1
Drive's position: DiskGroup: 2, Span: 0, Arm: 0
Enclosure position: 1
Device Id: 1
WWN: 500003961ba820e5
Sequence Number: 3
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA


Raw Size: 3.638 TB [0x1d1c0beb0 Sectors]
Non Coerced Size: 3.637 TB [0x1d1b0beb0 Sectors]
Coerced Size: 3.637 TB [0x1d1b00000 Sectors]
Sector Size:  512
Logical Sector Size:  512
Physical Sector Size:  512
Firmware state: Failed
Device Firmware Level: FL1H
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x500056b31234abc1
Connected Port Number: 0(path0) 
Inquiry Data: ATA     TOSHIBA MG03ACA4FL1H           25I8K1QWF
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None 
Device Speed: 6.0Gb/s 
Link Speed: 6.0Gb/s 
Media Type: Hard Disk Device
Drive Temperature :32C (89.60 F)
PI Eligibility:  No 
Drive is formatted for PI information:  No
PI: No PI
Drive's NCQ setting : N/A
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s 
Drive has flagged a S.M.A.R.T alert : No

I've submitted the above to Dell via the self dispatch tool, having the replacement disk ship directly to eqiad. Dell self dispatch # SR952244652.

RobH moved this task from Backlog to Up next on the ops-eqiad board.
RobH created subtask Unknown Object (Task).Aug 11 2017, 9:25 PM
Volans triaged this task as Medium priority.Aug 22 2017, 8:01 AM

The new disk arrived today and will be swapped on 08/30.

@elukey I swapped the disk in slot 1...megacli still shows failed but also does not show the updated s/n for the new disk it could be preserved cache. Please try and add back and lmk if you have any problems

Done! Host back to working, thanks Chris!

RobH closed subtask Unknown Object (Task) as Resolved.Feb 14 2018, 11:56 PM