Page MenuHomePhabricator

Degraded RAID on ms-be1013
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host ms-be1013. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-md
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md1 : active (auto-read-only) raid1 sdm2[0] sdn2[1]
      976320 blocks super 1.2 [2/2] [UU]
      
md0 : active raid1 sdm1[0](F) sdn1[1]
      29279232 blocks super 1.2 [2/1] [_U]
      
unused devices: <none>

Event Timeline

Tried to reboot the host in the hope the controller freaked out and a reboot would "fix" it or at least reset. However the host isn't coming back, and console says No more sessions are available for this type of connection!. An mc reset cold as per https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/Dell_PowerEdge_RN20_Gen8#Troubleshooting also didn't seem to help.

At any rate, I've started the decom of this host as per T220590: Decom ms-be101[345] as we need to do that regardless.

@fgiunchedi do you want to power off unplug and power on...that will clear the issue

@fgiunchedi do you want to power off unplug and power on...that will clear the issue

Yes please drain the power! Thanks

@fgiunchedi do you want to power off unplug and power on...that will clear the issue

Yes please drain the power! Thanks

Before draining the power please unplug the production network or disable this host interface on the switch.

This host has been offline for a while now and we are decomissioning it anyways, I suspect if we bring it back on the network now it'll start pushing/replicating its outdated versions of objects. Best to keep it disconnected from the network for now.

Cmjohnson changed the task status from Open to Stalled.May 7 2019, 2:47 PM
Cmjohnson moved this task from Hardware Failure / Troubleshoot to Stalled on the ops-eqiad board.

I drained the flea power and you should not have any issues with the idrac. The server is still out of warranty so not much I can do about the raid at this point in time. I did disabled the network port switch. If you need access again please let me know and I will enable the port.

Stalling on my end for now.

fgiunchedi claimed this task.

I'm resolving this since we're going to decom this host in T220907: Degraded RAID on ms-be1013, thanks @Cmjohnson !