Page MenuHomePhabricator

Degraded RAID on mw2442
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host mw2442. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 0, Spare: 0

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-md
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid1 sda2[0]
      936738816 blocks super 1.2 [2/1] [U_]
      bitmap: 3/7 pages [12KB], 65536KB chunk

unused devices: <none>

Event Timeline

this is dell's recommended fix for this error. is this possible? if not I will submit a dispatch for a new disk. (Physical disk was not removed, disregard option 1)

Do one of the following: 1) If the Physical Drive (PD) was removed at the time of event, this event is expected. Else, replace the PD. 2) If it is a non-redundant Virtual Drive (VD), delete the VD, reinsert the PD, recreate the VD, and then restore the data from backup. 3) If it is a redundant VD, the data rebuild operation will automatically start if a hot-spare is already configured. If it does not start, assign a PD as a global hot-spare so the rebuild operation is automatically started.

Sorry for the late reply. I'm not sure what you're asking thb. As I understand it the disk most likely broke, so the "replace the PD" option would be the way to go here.

From the kern.log:

Feb 13 05:26:40 mw2442 kernel: [3489985.628538] megaraid_sas 0000:65:00.0: scanning for scsi0...
Feb 13 05:26:40 mw2442 kernel: [3489985.630288] megaraid_sas 0000:65:00.0: 957 (761117192s/0x0021/FATAL) - Controller cache pinned for missing or offline VD 01/1
Feb 13 05:26:40 mw2442 kernel: [3489985.630345] megaraid_sas 0000:65:00.0: 958 (761117192s/0x0001/FATAL) - VD 01/1 is now OFFLINE
Feb 13 05:26:42 mw2442 kernel: [3489988.205310] sd 0:2:1:0: [sdb] tag#3207 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=2s
Feb 13 05:26:42 mw2442 kernel: [3489988.205329] sd 0:2:1:0: [sdb] tag#3207 CDB: Write(10) 2a 00 00 08 f0 10 00 00 04 00
Feb 13 05:26:42 mw2442 kernel: [3489988.205337] blk_update_request: I/O error, dev sdb, sector 585744 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
Feb 13 05:26:42 mw2442 kernel: [3489988.216037] md: super_written gets error=-5
Feb 13 05:26:42 mw2442 kernel: [3489988.220410] md/raid1:md0: Disk failure on sdb2, disabling device.
Feb 13 05:26:42 mw2442 kernel: [3489988.220410] md/raid1:md0: Operation continuing on 1 devices.
Feb 13 05:26:43 mw2442 kernel: [3489989.051118] sd 0:2:1:0: SCSI device is removed

Mentioned in SAL (#wikimedia-operations) [2024-02-19T09:49:11Z] <claime> Draining mw2442 - failed RAID - T357380

SR185570210 requested replacement disk from dell

drive has been replaced. Physically I don't have any alarms, but let me know if the you are still having issues with the RAID.

Host rebooted by jayme@cumin1002 with reason: hopefully detect new disk

Host rebooted by jayme@cumin1002 with reason: hopefully detect new disk

The new disk was not detected by the host, even after scsi scan (maybe that's not a thing anymore? ;))
Anyhow. I rebooted the node and it did not came back up. Powercycling again with console attached showed the following prompt:

There are offline or missing virtual drives with preserved cache.  
Please check the cables and ensure that all drives are present.  
Press F to import any foreign disks or press D to discard preserved cache.

I optimistically tried to preserve the cache and import the "foreign disk" (which probably is the replacement) which failed without a visible error message. Question was repeated again so I decided to ditch the cache which made the server come back again.
As the new drive did still not show up in the OS after booting I checked the iDRAC GUI and saw that the storage controller is in RAID mode with one virtual disk configured per physical disk. Me discarding the cache discarded the physical disk as well. Other servers of the same model are configured this way as well, so I recreated the physical disk manually to restore operation, copied the partition table and added it back to the md.

@MatthewVernon pointed out (thanks) that this could have helped (if done before the reboot obviously):

https://wikitech.wikimedia.org/wiki/Swift/How_To#Replacing_a_disk_without_touching_the_rings

megacli -GetPreservedCacheList -a0
megacli -DiscardPreservedCache -L'disk_number' -a0
megacli -CfgEachDskRaid0 WB RA Direct CachedBadBBU -a0

After the reboot, you could still have made the new virtual drive with the last of those lines:

megacli -CfgEachDskRaid0 WB RA Direct CachedBadBBU -a0

After the reboot, you could still have made the new virtual drive with the last of those lines:

megacli -CfgEachDskRaid0 WB RA Direct CachedBadBBU -a0

...if it comes back up and is not stuck during initialization of the controller :)

JMeybohm claimed this task.

T358489 as follow-up for the strange RAID config, resolving this one.