Page MenuHomePhabricator

Degraded RAID on cloudvirt1018
Open, Needs TriagePublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host cloudvirt1018. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: 1 failed LD(s) (Degraded)

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli
=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 0 (Target Id: 0)
	RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
	State: =====> Degraded <=====
	Number Of Drives: 8
	Number of Spans: 1
	Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU

		Span: 0 - Number of PDs: 8

			PD: 1 Information
			Enclosure Device ID: 32
			Slot Number: 8
			Drive's position: DiskGroup: 0, Span: 0, Arm: 1
			Media Error Count: 0
			Other Error Count: 13411
			Predictive Failure Count: 0
			Last Predictive Failure Event Seq Number: 0

				Raw Size: 1.455 TB [0xba4d4ab0 Sectors]
				Firmware state: =====> Rebuild <=====
				Media Type: Solid State Device
				Drive Temperature: 27C (80.60 F)

=== RaidStatus completed

Event Timeline

Bstorm added a subscriber: Bstorm.Aug 15 2019, 8:32 PM

Looks like a bad disk here:

Enclosure Device ID: 32
Slot Number: 1
Enclosure position: 1
Device Id: 1
WWN: 55cd2e41505091c6
Sequence Number: 4
Media Error Count: 31
Other Error Count: 12965
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA

Raw Size: 1.746 TB [0xdf8fe2b0 Sectors]
Non Coerced Size: 1.745 TB [0xdf7fe2b0 Sectors]
Coerced Size: 1.745 TB [0xdf7c0000 Sectors]
Sector Size:  512
Logical Sector Size:  512
Physical Sector Size:  4096
Firmware state: Unconfigured(bad)
Device Firmware Level: DL61
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x500056b3f5bd95c1
Connected Port Number: 0(path0) 
Inquiry Data:   PHYG8450001W1P9DGNSSDSC2KG019T8R                          XCV1DL61
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None 
Device Speed: 6.0Gb/s 
Link Speed: 6.0Gb/s 
Media Type: Solid State Device
Drive Temperature :24C (75.20 F)
PI Eligibility:  No 
Drive is formatted for PI information:  No
PI: No PI
Drive's NCQ setting : N/A
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s 
Drive has flagged a S.M.A.R.T alert : No

The rebuilding disk is the hot spare replacing it. Disk in position 1 would be the bad one.

Looks like the exact same thing as T229156: Degraded RAID on cloudvirt1018. Same disk, same error and even same hot spare rebuilding.

That's not great.

Cmjohnson moved this task from Backlog to Cloud Tasks on the ops-eqiad board.Aug 20 2019, 2:24 PM

Another ticket has been placed with Dell

The ticket was declined by Dell....stating that the disk we have installed are not original to the server. this requires me to investigate

Cmjohnson reassigned this task from Cmjohnson to wiki_willy.Aug 27 2019, 7:51 PM
Cmjohnson added subscribers: wiki_willy, Cmjohnson.

The reason for the task being declined. I verified that the failed disk is indeed 1.9TB but is a SSD. The original order and showing on the disk caddy label is for an Intel 1.6TB SSD S3610. Assigning to @wiki_willy

Denial Notes
We are unable to proceed with your request as the requested part(1.9TB HDD) is not on the original order. If this part was purchased separately, please resubmit the request with the Dell Order number on which this part was purchased.

@Bstorm - I was able to confirm we originally ordered this machine to include 1.6tb drives via https://phabricator.wikimedia.org/T155075 , but wasn't able to find any other tasks that showed when/how they were replaced with 1.9tb drives (which Dell won't support). Do you have any details from previous records on where these 1.9tb disks came from? (ie swapped from another server, ordered separately, etc) Thanks, Willy

Andrew added a parent task: Unknown Object (Task).Sep 4 2019, 9:01 PM
wiki_willy reassigned this task from wiki_willy to Bstorm.Sep 4 2019, 9:17 PM

Assigning to @Bstorm to follow up on the previous comment.

T229156 according to that ticket, this is the disk that came from Dell for that ticket.

There are two disks in there with the larger size.

Note that the size matches this: T229156#5399581 -- which is this disk replaced in the last ticket. This suggests the disk was replaced at that larger size at that time, but that the larger disk may have come from the earlier ticket here T216004

In the February 2019 maintenance where two disks were failed at the same time, all disks matched (see T216004#4950812) which makes me wonder. However, that lists case numbers where Dell disks were sent. The only place I can imagine bigger disks came from is one or both of those tickets.

I'm not sure if I can access the service requests. I can try.

Service request 986376069 does not show anything terribly useful.
Since T229156 shows the disk at its current size, I have to imagine that Dell sent us larger disks during that request. I thought the failed drives were in bays 2 and 3 in that ticket, and this is the larger disks in bay 1 and 2. However, that level of specifics is not in our ticket, and the sizes and disk serial numbers are not in the service request. Dell should be able to provide this information. I don't have any additional information to offer on how they began sending us larger disks, but T229156 clearly shows the current size and that Dell replaced it.

@wiki_willy I think we need to follow up with Dell about that. They should have some kind of tracking on the disk serial numbers, etc. that they have been sending us, right?