Degraded RAID on mw2382
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ops-monitoring-bot
	Apr 18 2024, 10:22 PM

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host mw2382. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 1, Spare: 0

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-md
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid1 sda2[0] sdb2[1](F)
      937267200 blocks super 1.2 [2/1] [U_]
      bitmap: 4/7 pages [16KB], 65536KB chunk

unused devices: <none>

Details

	Subject	Repo	Branch	Lines +/-
	Remove mw2382 as kubernetes node to prevent scap failures	operations/puppet	production	+0 -1

Customize query in gerrit

Related Objects

Mentioned In: T363971: scap should not run mediawiki-image-download on pooled=inactive servers
T363838: Degraded RAID on mw2382
T363847: Degraded RAID on mw2382
Mentioned Here: T363971: scap should not run mediawiki-image-download on pooled=inactive servers

Event Timeline

ops-monitoring-bot created this task.Apr 18 2024, 10:22 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 18 2024, 10:22 PM

mw2382 is kubernetes::worker
mw2382 is a Kubernetes worker node (kubernetes::worker)
Bare Metal host on site codfw and rack A3

I suppose that can be hotswapped? Let us know if it can't, we'll drain and cordon the host for the disk swap.

Jhancock.wm moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-codfw board.Tue, Apr 23, 12:37 PM

Apologies for the wait on this one. I checked out the server and the drives look to be working physically. But when I logged into the idrac it sees zero disks. Checked the warranty and it expired in February. I do have a pair of decommed 960GB drives that could replace it. However, I cannot tell which drive needs to be replaced. Please let me know if this still needs attention and how I can help.

Mentioned in SAL (#wikimedia-operations) [2024-04-30T09:26:39Z] <jayme> draining mw2382.codfw.wmnet - T362938

Icinga downtime and Alertmanager silence (ID=b2b315a7-d925-49a5-80d5-19849b998b72) set by jayme@cumin1002 for 2 days, 0:00:00 on 1 host(s) and their services with reason: Degraded RAID/storage controller issues

mw2382.codfw.wmnet

@Jhancock.wm I've tried powercycling the system and to restart iDRAC to see if the storage controller "comes back" but no luck. During boot I did see 2 SATA drives listed, though.
Ofc. /dev/sdb is now no longer discovered from mdadm so it should be without IO (if that helps identifying). Not really sure how to proceed here as it seems odd that the storage controller fully disappeared from iDRAC

@Jhancock.wm I did shutdown the server for now. Could you please try do drain flea power and see if the controller comes back after? If not please open a case with Dell

Host is set pooled=inactive, cordoned in k8s, removed from BGP and shut down, so all yours

taavi merged a task: T363811: Degraded RAID on mw2382.Tue, Apr 30, 1:26 PM

draining didn't fix it. I'm gonna update the firmware and bios and then see where it is.

idrac upgraded to 7.0.0. won't go any higher. Bios is already at 2.9.3. Reset the factory defaults and tried rebooting the idrac. reseated the backplane. None of these have fixed the issue. Going to look into getting a replacement part. Might need to be salvaged from decommissioned servers. Will update when we have a solution

Jhancock.wm mentioned this in T363847: Degraded RAID on mw2382.Wed, May 1, 2:53 PM

Jhancock.wm mentioned this in T363838: Degraded RAID on mw2382.

Scap failed to connect to this host today during the MediaWiki train while trying to preload the MW image:
15:08:17 /usr/bin/sudo /usr/local/sbin/mediawiki-image-download 2024-05-01-150512-publish (ran as mwdeploy@mw2382.codfw.wmnet) returned [255]: ssh: connect to host mw2382.codfw.wmnet port 22: Connection timed out

Would it be possibly to remove it temporarily from the list of K8s workers while work is done on it?

JMeybohm mentioned this in T363971: scap should not run mediawiki-image-download on pooled=inactive servers.Thu, May 2, 8:10 AM

Change #1026446 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Remove mw2382 as kubernetes node to prevent scap failures

https://gerrit.wikimedia.org/r/1026446

gerritbot added a project: Patch-For-Review.Thu, May 2, 8:14 AM

In T362938#9761317, @jnuche wrote:

Scap failed to connect to this host today during the MediaWiki train while trying to preload the MW image:
15:08:17 /usr/bin/sudo /usr/local/sbin/mediawiki-image-download 2024-05-01-150512-publish (ran as mwdeploy@mw2382.codfw.wmnet) returned [255]: ssh: connect to host mw2382.codfw.wmnet port 22: Connection timed out

Would it be possibly to remove it temporarily from the list of K8s workers while work is done on it?

Will do...but I think the right thing to do here is to fix scap (T363971: scap should not run mediawiki-image-download on pooled=inactive servers).

Change #1026446 merged by JMeybohm:

[operations/puppet@production] Remove mw2382 as kubernetes node to prevent scap failures

https://gerrit.wikimedia.org/r/1026446

Would it be possibly to remove it temporarily from the list of K8s workers while work is done on it?

Will do...but I think the right thing to do here is to fix scap (T363971: scap should not run mediawiki-image-download on pooled=inactive servers).

Thanks for removing the host @JMeybohm. If there's an easy way for scap to check the pooled state of the hosts, that would definitely be a good improvement to add.

Maintenance_bot removed a project: Patch-For-Review.Thu, May 2, 8:30 AM

Icinga downtime and Alertmanager silence (ID=e3dd1140-411c-45b4-a1c6-3961f47c4f12) set by jayme@cumin1002 for 7 days, 0:00:00 on 1 host(s) and their services with reason: Degraded RAID/storage controller issues

mw2382.codfw.wmnet

@JMeybohm papaul helped me identify the missing disk. I replaced it with a compatible drive. please let me know if that fixed the issue. Thanks.

In T362938#9764496, @Jhancock.wm wrote:

@JMeybohm papaul helped me identify the missing disk. I replaced it with a compatible drive. please let me know if that fixed the issue. Thanks.

Thanks! I see the server still sitting in some BIOS screen that only shows one connected disk. Did not touch anything for now.
Please boot the server normally when things are done from your end and I'll happily check what the OS things about the new disk.

Dzahn unsubscribed.Mon, May 6, 3:03 PM

Forgot I left it there. All yours now!

Hi team, this alert keeps firing since April 18th.

andrea.denisse unsubscribed.Mon, May 6, 3:24 PM

In T362938#9774724, @Jhancock.wm wrote:

Forgot I left it there. All yours now!

Thanks. There was mdadm metadata sill on the "new" disk, I had to:

mdadm --stop /dev/md127
mdadm --zero-superblock /dev/sdb2
mdadm --manage /dev/md0 --add /dev/sdb2

resync is running now.

Server has been added back to the list of k8s nodes, re-enabled BGP, set to pooled=yes and uncordoned.

Degraded RAID on mw2382Closed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Degraded RAID on mw2382
Closed, ResolvedPublic
Actions