Page MenuHomePhabricator

Degraded RAID on dbstore1003
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host dbstore1003. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: 1 failed LD(s) (Degraded)

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli
=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 0 (Target Id: 0)
	RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
	State: =====> Degraded <=====
	Number Of Drives: 6
	Number of Spans: 1
	Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU

		Span: 0 - Number of PDs: 6

			PD: 4 Information
			ERROR: =====> MISSING DRIVE INFO <=====

=== RaidStatus completed

Event Timeline

Restricted Application added a subscriber: Marostegui. · View Herald TranscriptNov 26 2019, 10:50 AM
Marostegui added a subscriber: elukey.
jbond triaged this task as Medium priority.Nov 26 2019, 11:46 AM
Cmjohnson added subscribers: Jclark-ctr, Cmjohnson.

I created a self-dispatch ticket. You have successfully submitted request SR1004377941. Assigning to @Jclark-ctr since I will be out of the area.

Thanks a lot!

Any update on this? Thanks!

Jclark-ctr added a comment.EditedDec 2 2019, 9:29 PM

@Marostegui just received drive from warehouse. Can the drive be swapped now? can you confirm slot?

It looks like disk #4:

root@dbstore1003:~# megacli -LDPDInfo -aAll

Adapter #0

Number of Virtual Disks: 1
Virtual Drive: 0 (Target Id: 0)
Name                :
RAID Level          : Primary-1, Secondary-0, RAID Level Qualifier-0
Size                : 4.364 TB
Sector Size         : 512
Is VD emulated      : No
Mirror Data         : 4.364 TB
State               : Degraded
Strip Size          : 256 KB
Number Of Drives    : 6
Span Depth          : 1
Default Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy   : Disk's Default
Encryption Type     : None
Default Power Savings Policy: Controller Defined
Current Power Savings Policy: None
Can spin up in 1 minute: No
LD has drives that support T10 power conditions: No
LD's IO profile supports MAX power savings with cached writes: No
Bad Blocks Exist: No
Is VD Cached: No
Number of Spans: 1
Span: 0 - Number of PDs: 6

PD: 0 Information
Enclosure Device ID: 32
Slot Number: 0
Drive's position: DiskGroup: 0, Span: 0, Arm: 0
Enclosure position: 1
Device Id: 0
WWN: 500080d910eafe59
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA

Raw Size: 1.455 TB [0xba4d4ab0 Sectors]
Non Coerced Size: 1.454 TB [0xba3d4ab0 Sectors]
Coerced Size: 1.454 TB [0xba3c0000 Sectors]
Sector Size:  512
Logical Sector Size:  512
Physical Sector Size:  512
Firmware state: Online, Spun Up
Device Firmware Level: DAC9
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x500056b37b8dd1c0
Connected Port Number: 0(path0)
Inquiry Data:         183S103PTBWTTHNSF81D60CSE                               DAC9
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None
Device Speed: 6.0Gb/s
Link Speed: 6.0Gb/s
Media Type: Solid State Device
Drive Temperature :25C (77.00 F)
PI Eligibility:  No
Drive is formatted for PI information:  No
PI: No PI
Drive's NCQ setting : N/A
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s
Drive has flagged a S.M.A.R.T alert : No




PD: 1 Information
Enclosure Device ID: 32
Slot Number: 1
Drive's position: DiskGroup: 0, Span: 0, Arm: 1
Enclosure position: 1
Device Id: 1
WWN: 500080d910eafee1
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA

Raw Size: 1.455 TB [0xba4d4ab0 Sectors]
Non Coerced Size: 1.454 TB [0xba3d4ab0 Sectors]
Coerced Size: 1.454 TB [0xba3c0000 Sectors]
Sector Size:  512
Logical Sector Size:  512
Physical Sector Size:  512
Firmware state: Online, Spun Up
Device Firmware Level: DAC9
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x500056b37b8dd1c1
Connected Port Number: 0(path0)
Inquiry Data:         183S104TTBWTTHNSF81D60CSE                               DAC9
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None
Device Speed: 6.0Gb/s
Link Speed: 6.0Gb/s
Media Type: Solid State Device
Drive Temperature :25C (77.00 F)
PI Eligibility:  No
Drive is formatted for PI information:  No
PI: No PI
Drive's NCQ setting : N/A
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s
Drive has flagged a S.M.A.R.T alert : No




PD: 2 Information
Enclosure Device ID: 32
Slot Number: 2
Drive's position: DiskGroup: 0, Span: 0, Arm: 2
Enclosure position: 1
Device Id: 2
WWN: 500080d910eafe7b
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA

Raw Size: 1.455 TB [0xba4d4ab0 Sectors]
Non Coerced Size: 1.454 TB [0xba3d4ab0 Sectors]
Coerced Size: 1.454 TB [0xba3c0000 Sectors]
Sector Size:  512
Logical Sector Size:  512
Physical Sector Size:  512
Firmware state: Online, Spun Up
Device Firmware Level: DAC9
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x500056b37b8dd1c2
Connected Port Number: 0(path0)
Inquiry Data:         183S103MTBWTTHNSF81D60CSE                               DAC9
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None
Device Speed: 6.0Gb/s
Link Speed: 6.0Gb/s
Media Type: Solid State Device
Drive Temperature :24C (75.20 F)
PI Eligibility:  No
Drive is formatted for PI information:  No
PI: No PI
Drive's NCQ setting : N/A
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s
Drive has flagged a S.M.A.R.T alert : No




PD: 3 Information
Enclosure Device ID: 32
Slot Number: 3
Drive's position: DiskGroup: 0, Span: 0, Arm: 3
Enclosure position: 1
Device Id: 3
WWN: 500080d910eafec3
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA

Raw Size: 1.455 TB [0xba4d4ab0 Sectors]
Non Coerced Size: 1.454 TB [0xba3d4ab0 Sectors]
Coerced Size: 1.454 TB [0xba3c0000 Sectors]
Sector Size:  512
Logical Sector Size:  512
Physical Sector Size:  512
Firmware state: Online, Spun Up
Device Firmware Level: DAC9
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x500056b37b8dd1c3
Connected Port Number: 0(path0)
Inquiry Data:         183S103RTBWTTHNSF81D60CSE                               DAC9
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None
Device Speed: 6.0Gb/s
Link Speed: 6.0Gb/s
Media Type: Solid State Device
Drive Temperature :25C (77.00 F)
PI Eligibility:  No
Drive is formatted for PI information:  No
PI: No PI
Drive's NCQ setting : N/A
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s
Drive has flagged a S.M.A.R.T alert : No




PD: 4 Information




PD: 5 Information
Enclosure Device ID: 32
Slot Number: 5
Drive's position: DiskGroup: 0, Span: 0, Arm: 5
Enclosure position: 1
Device Id: 5
WWN: 500080d910eafebf
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA

Raw Size: 1.455 TB [0xba4d4ab0 Sectors]
Non Coerced Size: 1.454 TB [0xba3d4ab0 Sectors]
Coerced Size: 1.454 TB [0xba3c0000 Sectors]
Sector Size:  512
Logical Sector Size:  512
Physical Sector Size:  512
Firmware state: Online, Spun Up
Device Firmware Level: DAC9
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x500056b37b8dd1c5
Connected Port Number: 0(path0)
Inquiry Data:         183S1044TBWTTHNSF81D60CSE                               DAC9
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None
Device Speed: 6.0Gb/s
Link Speed: 6.0Gb/s
Media Type: Solid State Device
Drive Temperature :25C (77.00 F)
PI Eligibility:  No
Drive is formatted for PI information:  No
PI: No PI
Drive's NCQ setting : N/A
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s
Drive has flagged a S.M.A.R.T alert : No




Exit Code: 0x00

And confirming with the controller's log:

seqNum: 0x000002d4
Time: Tue Nov 26 10:18:58 2019

Code: 0x0000010b
Class: 1
Locale: 0x02
Event Description: Command timeout on PD 04(e0x20/s4) Path 500056b37b8dd1c4, CDB: 2a 00 88 b6 72 00 00 02 00 00
Event Data:
===========
Device ID: 4
Enclosure Index: 32
Slot Number: 4
CDB Length: 10
CDB Data:
002a 0000 0088 00b6 0072 0000 0000 0002 0000 0000 0000 0000 0000 0000 0000 0000 Sense Length: 0
Sense Data:
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

Time: Tue Nov 26 10:19:07 2019

Code: 0x00000072
Class: 0
Locale: 0x02
Event Description: State change on PD 04(e0x20/s4) from ONLINE(18) to FAILED(11)
Event Data:
===========
Device ID: 4
Enclosure Index: 32
Slot Number: 4
Previous state: 24
New state: 17

Time: Tue Nov 26 10:19:08 2019

Code: 0x00000072
Class: 0
Locale: 0x02
Event Description: State change on PD 04(e0x20/s4) from FAILED(11) to UNCONFIGURED_BAD(1)
Event Data:
===========
Device ID: 4
Enclosure Index: 32
Slot Number: 4
Previous state: 17
New state: 1

@Jclark-ctr can you confirm if the LED for slot 4 is blinking differently on the server chassis?

Disk replaced by John and I can see it rebuilding:

root@dbstore1003:~# megacli -PDRbld -ShowProg -physdrv[32:4] -aALL

Rebuild Progress on Device at Enclosure 32, Slot 4 Completed 12% in 11 Minutes.
Jclark-ctr closed this task as Resolved.Dec 10 2019, 2:14 PM

Replaced Failed Drive