Degraded RAID on cloudvirt1018
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ops-monitoring-bot
	Aug 15 2019, 8:25 PM

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host cloudvirt1018. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: 1 failed LD(s) (Degraded)

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli
=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 0 (Target Id: 0)
	RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
	State: =====> Degraded <=====
	Number Of Drives: 8
	Number of Spans: 1
	Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU

		Span: 0 - Number of PDs: 8

			PD: 1 Information
			Enclosure Device ID: 32
			Slot Number: 8
			Drive's position: DiskGroup: 0, Span: 0, Arm: 1
			Media Error Count: 0
			Other Error Count: 13411
			Predictive Failure Count: 0
			Last Predictive Failure Event Seq Number: 0

				Raw Size: 1.455 TB [0xba4d4ab0 Sectors]
				Firmware state: =====> Rebuild <=====
				Media Type: Solid State Device
				Drive Temperature: 27C (80.60 F)

=== RaidStatus completed

Related Objects
Search...

		Status	Subtype	Assigned	Task
					Unknown Object (Task)
		Resolved		Jclark-ctr	T230575 Degraded RAID on cloudvirt1018

Event Timeline

ops-monitoring-bot added projects: SRE, ops-eqiad.Aug 15 2019, 8:25 PM

ops-monitoring-bot subscribed.

Looks like a bad disk here:

Enclosure Device ID: 32
Slot Number: 1
Enclosure position: 1
Device Id: 1
WWN: 55cd2e41505091c6
Sequence Number: 4
Media Error Count: 31
Other Error Count: 12965
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA

Raw Size: 1.746 TB [0xdf8fe2b0 Sectors]
Non Coerced Size: 1.745 TB [0xdf7fe2b0 Sectors]
Coerced Size: 1.745 TB [0xdf7c0000 Sectors]
Sector Size:  512
Logical Sector Size:  512
Physical Sector Size:  4096
Firmware state: Unconfigured(bad)
Device Firmware Level: DL61
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x500056b3f5bd95c1
Connected Port Number: 0(path0) 
Inquiry Data:   PHYG8450001W1P9DGNSSDSC2KG019T8R                          XCV1DL61
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None 
Device Speed: 6.0Gb/s 
Link Speed: 6.0Gb/s 
Media Type: Solid State Device
Drive Temperature :24C (75.20 F)
PI Eligibility:  No 
Drive is formatted for PI information:  No
PI: No PI
Drive's NCQ setting : N/A
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s 
Drive has flagged a S.M.A.R.T alert : No

The rebuilding disk is the hot spare replacing it. Disk in position 1 would be the bad one.

Looks like the exact same thing as T229156: Degraded RAID on cloudvirt1018. Same disk, same error and even same hot spare rebuilding.

That's not great.

wiki_willy assigned this task to • Cmjohnson.Aug 15 2019, 9:08 PM

• Cmjohnson moved this task from Backlog to Cloud Tasks on the ops-eqiad board.Aug 20 2019, 2:24 PM

Another ticket has been placed with Dell

The ticket was declined by Dell....stating that the disk we have installed are not original to the server. this requires me to investigate

The reason for the task being declined. I verified that the failed disk is indeed 1.9TB but is a SSD. The original order and showing on the disk caddy label is for an Intel 1.6TB SSD S3610. Assigning to @wiki_willy

Denial Notes
We are unable to proceed with your request as the requested part(1.9TB HDD) is not on the original order. If this part was purchased separately, please resubmit the request with the Dell Order number on which this part was purchased.

@Bstorm - I was able to confirm we originally ordered this machine to include 1.6tb drives via https://phabricator.wikimedia.org/T155075 , but wasn't able to find any other tasks that showed when/how they were replaced with 1.9tb drives (which Dell won't support). Do you have any details from previous records on where these 1.9tb disks came from? (ie swapped from another server, ordered separately, etc) Thanks, Willy

Andrew added a parent task: Unknown Object (Task).Sep 4 2019, 9:01 PM

Assigning to @Bstorm to follow up on the previous comment.

T229156 according to that ticket, this is the disk that came from Dell for that ticket.

There are two disks in there with the larger size.

Note that the size matches this: T229156#5399581 -- which is this disk replaced in the last ticket. This suggests the disk was replaced at that larger size at that time, but that the larger disk may have come from the earlier ticket here T216004

In the February 2019 maintenance where two disks were failed at the same time, all disks matched (see T216004#4950812) which makes me wonder. However, that lists case numbers where Dell disks were sent. The only place I can imagine bigger disks came from is one or both of those tickets.

I'm not sure if I can access the service requests. I can try.

Service request 986376069 does not show anything terribly useful.
Since T229156 shows the disk at its current size, I have to imagine that Dell sent us larger disks during that request. I thought the failed drives were in bays 2 and 3 in that ticket, and this is the larger disks in bay 1 and 2. However, that level of specifics is not in our ticket, and the sizes and disk serial numbers are not in the service request. Dell should be able to provide this information. I don't have any additional information to offer on how they began sending us larger disks, but T229156 clearly shows the current size and that Dell replaced it.

@wiki_willy I think we need to follow up with Dell about that. They should have some kind of tracking on the disk serial numbers, etc. that they have been sending us, right?

Hey @Bstorm - thanks for tracking all these previous tasks down. It's definitely helpful...I'll bring it up to Dell tomorrow during my bi-weekly sync up call with them, and see if I can more details. Worse case, we may just have to buy a replacement drive. Thanks, Willy

• Bstorm mentioned this in T236331: Degraded RAID on cloudvirt1018.Oct 23 2019, 11:52 PM

Talked to our Dell rep on this one, who can reach out to the Dell tech support rep directly, after we re-open the ticket. He basically confirmed the same thing @Bstorm had found from the earlier comments...that the 1.9tb drive was sent from Dell previously as a RMA. @Jclark-ctr - can you coordinate with Brooke to update the firmware on this (which might fix things, with all the drive failures), and then call in a request with Dell again, if they drive continues to fail? Shoot me the support case number as well, so I can forward it over to our account rep. Thanks, Willy

I created another dispatch ticket with Dell and will try and resolve this through them. If not a phone call will have to be made.

• Cmjohnson moved this task from Cloud Tasks to Hardware Failure / Troubleshoot on the ops-eqiad board.Nov 18 2019, 4:55 PM

Replaced Failed drive

@Jclark-ctr replaced the drive in slot 4 with a new 1.9TB drive today. I've confirmed that the system and RAID set are both healthy.

	F30874039: Screen Shot 2019-10-21 at 5.42.01 PM.png
	Oct 22 2019, 1:02 AM

Degraded RAID on cloudvirt1018Closed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Degraded RAID on cloudvirt1018
Closed, ResolvedPublic
Actions

Related Objects
Search...