Page MenuHomePhabricator

Degraded RAID on cloudvirt1018
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host cloudvirt1018. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: 1 failed LD(s) (Degraded)

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli
=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 0 (Target Id: 0)
	RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
	State: =====> Degraded <=====
	Number Of Drives: 8
	Number of Spans: 1
	Current Cache Policy: WriteThrough, ReadAhead, Direct, No Write Cache if Bad BBU

		Span: 0 - Number of PDs: 8

			PD: 4 Information
			Enclosure Device ID: 32
			Slot Number: 8
			Drive's position: DiskGroup: 0, Span: 0, Arm: 4
			Media Error Count: 0
			Other Error Count: 25290
			Predictive Failure Count: 0
			Last Predictive Failure Event Seq Number: 0

				Raw Size: 1.455 TB [0xba4d4ab0 Sectors]
				Firmware state: =====> Rebuild <=====
				Media Type: Solid State Device
				Drive Temperature: 27C (80.60 F)

=== RaidStatus completed

Event Timeline

Bstorm added a subscriber: Bstorm.Oct 23 2019, 11:42 PM
Enclosure Device ID: 32
Slot Number: 4
Enclosure position: 1
Device Id: 4
WWN: 55cd2e414dae9475
Sequence Number: 4
Media Error Count: 0
Other Error Count: 98287
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA

Raw Size: 1.455 TB [0xba4d4ab0 Sectors]
Non Coerced Size: 1.454 TB [0xba3d4ab0 Sectors]
Coerced Size: 1.454 TB [0xba3c0000 Sectors]
Sector Size:  512
Logical Sector Size:  512
Physical Sector Size:  4096
Firmware state: Unconfigured(bad)
Device Firmware Level: DL2D
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x500056b3f5bd95c4
Connected Port Number: 0(path0) 
Inquiry Data:   BTHC7112043Q1P6PGNINTEL SSDSC2BX016T4R                    G201DL2D
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None 
Device Speed: 6.0Gb/s 
Link Speed: 6.0Gb/s 
Media Type: Solid State Device
Drive Temperature :26C (78.80 F)
PI Eligibility:  No 
Drive is formatted for PI information:  No
PI: No PI
Drive's NCQ setting : N/A
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s 
Drive has flagged a S.M.A.R.T alert : No

This one is Dell's expected size.

Funny thing is that the disk that failed for T230575: Degraded RAID on cloudvirt1018 isn't listed as having a problem now. Just this one is listed as failed.

wiki_willy added subscribers: Jclark-ctr, wiki_willy.

@Bstorm - that's really weird. If it's just the drive size that Dell has on file for us, I'll just shoot this over @Jclark-ctr to have that RMA'd. Thanks, Willy

@Bstorm Is this host in use? when can we schedule a good time for me to troubleshoot?

It is in use. It can handle a couple disk failures without evacuating. If it needs to be taken offline, it will need some work to get it ready on our end.

To ask more directly, do you need us to evacuate this host for troubleshooting? @Jclark-ctr

Looks like this was approved this time around, @Jclark-ctr please keep an eye out for the disk in receiving

Replaced Failed Drive

JHedden closed this task as Resolved.Nov 19 2019, 10:25 PM
JHedden added a subscriber: JHedden.

@Jclark-ctr replaced this today with a new 1.9TB drive. No host errors were seen and the megaraid card looks clean.