Page MenuHomePhabricator

Degraded RAID on mw2382
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host mw2382. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 1, Spare: 0

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-md
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid1 sda2[0] sdb2[1](F)
      937267200 blocks super 1.2 [2/1] [U_]
      bitmap: 4/7 pages [16KB], 65536KB chunk

unused devices: <none>

Event Timeline

Dzahn subscribed.
mw2382 is kubernetes::worker
mw2382 is a Kubernetes worker node (kubernetes::worker)
Bare Metal host on site codfw and rack A3

I suppose that can be hotswapped? Let us know if it can't, we'll drain and cordon the host for the disk swap.

Apologies for the wait on this one. I checked out the server and the drives look to be working physically. But when I logged into the idrac it sees zero disks. Checked the warranty and it expired in February. I do have a pair of decommed 960GB drives that could replace it. However, I cannot tell which drive needs to be replaced. Please let me know if this still needs attention and how I can help.

Mentioned in SAL (#wikimedia-operations) [2024-04-30T09:26:39Z] <jayme> draining mw2382.codfw.wmnet - T362938

Icinga downtime and Alertmanager silence (ID=b2b315a7-d925-49a5-80d5-19849b998b72) set by jayme@cumin1002 for 2 days, 0:00:00 on 1 host(s) and their services with reason: Degraded RAID/storage controller issues

mw2382.codfw.wmnet

@Jhancock.wm I've tried powercycling the system and to restart iDRAC to see if the storage controller "comes back" but no luck. During boot I did see 2 SATA drives listed, though.
Ofc. /dev/sdb is now no longer discovered from mdadm so it should be without IO (if that helps identifying). Not really sure how to proceed here as it seems odd that the storage controller fully disappeared from iDRAC

@Jhancock.wm I did shutdown the server for now. Could you please try do drain flea power and see if the controller comes back after? If not please open a case with Dell

Host is set pooled=inactive, cordoned in k8s, removed from BGP and shut down, so all yours

draining didn't fix it. I'm gonna update the firmware and bios and then see where it is.

idrac upgraded to 7.0.0. won't go any higher. Bios is already at 2.9.3. Reset the factory defaults and tried rebooting the idrac. reseated the backplane. None of these have fixed the issue. Going to look into getting a replacement part. Might need to be salvaged from decommissioned servers. Will update when we have a solution

Scap failed to connect to this host today during the MediaWiki train while trying to preload the MW image:
15:08:17 /usr/bin/sudo /usr/local/sbin/mediawiki-image-download 2024-05-01-150512-publish (ran as mwdeploy@mw2382.codfw.wmnet) returned [255]: ssh: connect to host mw2382.codfw.wmnet port 22: Connection timed out

Would it be possibly to remove it temporarily from the list of K8s workers while work is done on it?

Change #1026446 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Remove mw2382 as kubernetes node to prevent scap failures

https://gerrit.wikimedia.org/r/1026446

Scap failed to connect to this host today during the MediaWiki train while trying to preload the MW image:
15:08:17 /usr/bin/sudo /usr/local/sbin/mediawiki-image-download 2024-05-01-150512-publish (ran as mwdeploy@mw2382.codfw.wmnet) returned [255]: ssh: connect to host mw2382.codfw.wmnet port 22: Connection timed out

Would it be possibly to remove it temporarily from the list of K8s workers while work is done on it?

Will do...but I think the right thing to do here is to fix scap (T363971: scap should not run mediawiki-image-download on pooled=inactive servers).

Change #1026446 merged by JMeybohm:

[operations/puppet@production] Remove mw2382 as kubernetes node to prevent scap failures

https://gerrit.wikimedia.org/r/1026446

Would it be possibly to remove it temporarily from the list of K8s workers while work is done on it?

Will do...but I think the right thing to do here is to fix scap (T363971: scap should not run mediawiki-image-download on pooled=inactive servers).

Thanks for removing the host @JMeybohm. If there's an easy way for scap to check the pooled state of the hosts, that would definitely be a good improvement to add.

Icinga downtime and Alertmanager silence (ID=e3dd1140-411c-45b4-a1c6-3961f47c4f12) set by jayme@cumin1002 for 7 days, 0:00:00 on 1 host(s) and their services with reason: Degraded RAID/storage controller issues

mw2382.codfw.wmnet

@JMeybohm papaul helped me identify the missing disk. I replaced it with a compatible drive. please let me know if that fixed the issue. Thanks.

@JMeybohm papaul helped me identify the missing disk. I replaced it with a compatible drive. please let me know if that fixed the issue. Thanks.

Thanks! I see the server still sitting in some BIOS screen that only shows one connected disk. Did not touch anything for now.
Please boot the server normally when things are done from your end and I'll happily check what the OS things about the new disk.

Forgot I left it there. All yours now!

JMeybohm claimed this task.

Forgot I left it there. All yours now!

Thanks. There was mdadm metadata sill on the "new" disk, I had to:

mdadm --stop /dev/md127
mdadm --zero-superblock /dev/sdb2
mdadm --manage /dev/md0 --add /dev/sdb2

resync is running now.

Server has been added back to the list of k8s nodes, re-enabled BGP, set to pooled=yes and uncordoned.