Page MenuHomePhabricator

Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host cloudvirt1024. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: 1 failed LD(s) (Degraded)

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli
=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 0 (Target Id: 0)
	RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
	State: =====> Degraded <=====
	Number Of Drives: 8
	Number of Spans: 1
	Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU

		Span: 0 - Number of PDs: 8

			PD: 4 Information
			Enclosure Device ID: 32
			Slot Number: 8
			Drive's position: DiskGroup: 0, Span: 0, Arm: 4
			Media Error Count: 0
			Other Error Count: 518
			Predictive Failure Count: 0
			Last Predictive Failure Event Seq Number: 0

				Raw Size: 1.746 TB [0xdf8fe2b0 Sectors]
				Firmware state: =====> Rebuild <=====
				Media Type: Solid State Device
				Drive Temperature: 24C (75.20 F)

=== RaidStatus completed

Event Timeline

wiki_willy added a subscriber: wiki_willy.

Just a heads up Chris, the system is under warranty thru June 2021. Thanks, Willy

There are no workloads on this host now. We're good to have this replaced anytime. Thanks!

Per T230442, this appears to be something strange going on, possibly a controller freaking out. It lost 4 disks in a very short time and is now a read-only volume. Feel free to reboot or whatever @Cmjohnson . I included some troubleshooting info on the other ticket.

Bstorm renamed this task from Degraded RAID on cloudvirt1024 to Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only.Aug 14 2019, 1:31 AM

A ticket has been placed with Dell

@Bstorm can you try rebooting the server and see if the disks get back to the correct order. I know that works for analytics. Please try that...i do have a disk but I'm not sure which disk is bad

Yup, I can do that. I'm not sure which either, per T230442#5429068
It dropped the failures from the list, and I'm not even entirely convinced the disks are bad with how it behaved. It's not accepting ssh connections anymore, so I'll have to do with via mgmt.

The disk was replaced but from what I can tell is that the raid configuration is not accepting the new disk. When I am in the raid utility it shows that all the disks are good but the raid is missing a disk. This may need the raid config manually updated and a re-install. Let me know

Mentioned in SAL (#wikimedia-operations) [2019-08-21T17:17:44Z] <bstorm_> reboot cloudvirt1024 to try and reset raid T230289

It wasn't showing the right number of disks when I was running things. It was missing four, I believe? Two have failed and logged tickets, but it would have to have lost two more to go read-only (and I seem to recall this was a 10 disk machine)--would need to check to be sure.

Reboot sent it into a re-image (stalled at confirmation about writing partitioning scheme to disk). It's not healthy. :) Feel free to muck around in the console.

copied from T230442#5413070

                    Versions
                ================
Product Name    : PERC H730P Adapter
Serial No       : 87U048Y
FW Package Build: 25.5.3.0005

                    Mfg. Data
                ================
Mfg. Date       : 08/04/18
Rework Date     : 08/04/18
Revision No     : A04
Battery FRU     : N/A

                Image Versions in Flash:
                ================
BIOS Version       : 6.33.01.0_4.16.07.00_0x06120301
Ctrl-R Version     : 5.18-0700
FW Version         : 4.270.00-8178
NVDATA Version     : 3.1511.00-0014
Boot Block Version : 3.07.00.00-0003

                Pending Images in Flash
                ================
None

Is there any FW version to update? I don't want to put this back in service if it is marking disks bad at such a rate (especially if they just are marked ok later even if not changed).

https://www.dell.com/support/home/en/en/sebsdt1/drivers/driversdetails?driverid=f675y
Looks like there's a number of fixes on this update of the controller firmware, but I don't see any very specific to our issue (lots of INTERNAL_DEVICE_RESET, etc). Can we try that before putting it back in service? I reimage it if that is required to update the firmware (I'm sure we'll need to at this point anyway).

Bstorm triaged this task as High priority.Sep 30 2019, 5:40 PM
Bstorm added subscribers: Cmjohnson, Jclark-ctr.

Apparently, unfortunately, this is still misbehaving. Will gather more details shortly.

T234018 <-- if this turns out to be a normal failed disk, that'd be great.

Nope. It is only showing 6 disks instead of the 10 it has on board. It is definitely malfunctioning.

The pattern I'm seeing is that it complains that a disk isn't functioning correctly, it resets it and then it is logged as removed. It is, notably, the same ones I think.
From the LC logs:

Log Sequence Number:
1756
Detailed Description:
A physical disk has been removed from the disk group. This alert can also be caused by loose or defective cables or by problems with the enclosure.
Recommended Action:
Do one of the following: 1) If a physical disk was removed from the disk group, either replace the disk or restore the original disk. Identify the disk that was removed by locating the disk that has a red "X" for its status. 2) Perform a rescan after replacing or restoring the disk. 3) If a disk was not removed from the disk group, then check for cable problems. Refer to product documentation for more information on checking the cables. 4) Make sure that the enclosure is powered on. 5) If the problem persists, check the enclosure documentation for further diagnostic information.

The ones that have been removed do not appear to be the same ones as last time. Checking that to be sure, in case there's some record in all this.

Screen Shot 2019-09-30 at 4.36.50 PM.png (798×2 px, 247 KB)

I don't see a record of whether they are the same four disks that were removed by the controller. However, I did record that it removed four disks in the last event. We reproduced this by stress testing the system, so we can probably reproduce it again on any further actions.

@wiki_willy, is this currently in your court or ours?

@Andrew or @Bstorm - are you ok with us taking the machine down to troubleshoot? Thanks, Willy

@Andrew or @Bstorm - are you ok with us taking the machine down to troubleshoot? Thanks, Willy

Yep, that's fine. Nothing on there but test VMs.

Cleared Foreign state on ofline drives. offline drives now list as ready

Setup raid to raid10 and 2 spare disks.

I've stress-tested this box quite a bit; now I'm building a couple of VMs for the 'video' project (encoding04 and encoding05) over there for a real-world test.

The drives in slots 32:8 and 32:9 are marked as a hot spare now.

# Check that the drive is unconfigured
cloudvirt1024:~# megacli -PDInfo -PhysDrv [32:9] -a0 | grep "Firmware state"
Firmware state: Unconfigured(good), Spun Up

# Mark as a global hot spare
cloudvirt1024:~# megacli -PDHSP -Set -PhysDrv [32:9] -a0

Adapter: 0: Set Physical Drive at EnclId-32 SlotId-9 as Hot Spare Success.

Exit Code: 0x00

# Confirm the change
cloudvirt1024:~# megacli -PDInfo -PhysDrv [32:9] -a0 | grep "Firmware state"
Firmware state: Hotspare, Spun Up

Change 554671 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] nova cloudvirt pool: update cloudvirt1024 comment to reflect hardware fixes

https://gerrit.wikimedia.org/r/554671

Change 554671 merged by Andrew Bogott:
[operations/puppet@production] nova cloudvirt pool: update cloudvirt1024 comment to reflect hardware fixes

https://gerrit.wikimedia.org/r/554671