Page MenuHomePhabricator

Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only
Open, Needs TriagePublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host cloudvirt1024. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: 1 failed LD(s) (Degraded)

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli
=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 0 (Target Id: 0)
	RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
	State: =====> Degraded <=====
	Number Of Drives: 8
	Number of Spans: 1
	Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU

		Span: 0 - Number of PDs: 8

			PD: 4 Information
			Enclosure Device ID: 32
			Slot Number: 8
			Drive's position: DiskGroup: 0, Span: 0, Arm: 4
			Media Error Count: 0
			Other Error Count: 518
			Predictive Failure Count: 0
			Last Predictive Failure Event Seq Number: 0

				Raw Size: 1.746 TB [0xdf8fe2b0 Sectors]
				Firmware state: =====> Rebuild <=====
				Media Type: Solid State Device
				Drive Temperature: 24C (75.20 F)

=== RaidStatus completed

Event Timeline

wiki_willy added a subscriber: wiki_willy.

Just a heads up Chris, the system is under warranty thru June 2021. Thanks, Willy

There are no workloads on this host now. We're good to have this replaced anytime. Thanks!

Bstorm added a subscriber: Bstorm.Aug 14 2019, 1:31 AM

Per T230442, this appears to be something strange going on, possibly a controller freaking out. It lost 4 disks in a very short time and is now a read-only volume. Feel free to reboot or whatever @Cmjohnson . I included some troubleshooting info on the other ticket.

Bstorm renamed this task from Degraded RAID on cloudvirt1024 to Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only.Aug 14 2019, 1:31 AM
Cmjohnson moved this task from Backlog to Cloud Tasks on the ops-eqiad board.Aug 15 2019, 3:03 PM

A ticket has been placed with Dell

@Bstorm can you try rebooting the server and see if the disks get back to the correct order. I know that works for analytics. Please try that...i do have a disk but I'm not sure which disk is bad

Yup, I can do that. I'm not sure which either, per T230442#5429068
It dropped the failures from the list, and I'm not even entirely convinced the disks are bad with how it behaved. It's not accepting ssh connections anymore, so I'll have to do with via mgmt.

The disk was replaced but from what I can tell is that the raid configuration is not accepting the new disk. When I am in the raid utility it shows that all the disks are good but the raid is missing a disk. This may need the raid config manually updated and a re-install. Let me know

Mentioned in SAL (#wikimedia-operations) [2019-08-21T17:17:44Z] <bstorm_> reboot cloudvirt1024 to try and reset raid T230289

It wasn't showing the right number of disks when I was running things. It was missing four, I believe? Two have failed and logged tickets, but it would have to have lost two more to go read-only (and I seem to recall this was a 10 disk machine)--would need to check to be sure.

Reboot sent it into a re-image (stalled at confirmation about writing partitioning scheme to disk). It's not healthy. :) Feel free to muck around in the console.

copied from T230442#5413070

                    Versions
                ================
Product Name    : PERC H730P Adapter
Serial No       : 87U048Y
FW Package Build: 25.5.3.0005

                    Mfg. Data
                ================
Mfg. Date       : 08/04/18
Rework Date     : 08/04/18
Revision No     : A04
Battery FRU     : N/A

                Image Versions in Flash:
                ================
BIOS Version       : 6.33.01.0_4.16.07.00_0x06120301
Ctrl-R Version     : 5.18-0700
FW Version         : 4.270.00-8178
NVDATA Version     : 3.1511.00-0014
Boot Block Version : 3.07.00.00-0003

                Pending Images in Flash
                ================
None

Is there any FW version to update? I don't want to put this back in service if it is marking disks bad at such a rate (especially if they just are marked ok later even if not changed).

https://www.dell.com/support/home/en/en/sebsdt1/drivers/driversdetails?driverid=f675y
Looks like there's a number of fixes on this update of the controller firmware, but I don't see any very specific to our issue (lots of INTERNAL_DEVICE_RESET, etc). Can we try that before putting it back in service? I reimage it if that is required to update the firmware (I'm sure we'll need to at this point anyway).

updated the idrac and raid f/w