Page MenuHomePhabricator

Degraded RAID on elastic2052
Closed, ResolvedPublic2 Estimated Story Points

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host elastic2052. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 5, Working: 5, Failed: 1, Spare: 0

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-md
Personalities : [raid0] [raid1] [linear] [multipath] [raid6] [raid5] [raid4] [raid10] 
md1 : active (auto-read-only) raid1 sdb3[1] sda3[0]
      999424 blocks super 1.2 [2/2] [UU]
      	resync=PENDING
      
md0 : active raid1 sda2[0](F) sdb2[1]
      78058496 blocks super 1.2 [2/1] [_U]
      
md2 : active raid0 sdb4[1] sda4[0]
      2966525952 blocks super 1.2 512k chunks
      
unused devices: <none>

Details

Other Assignee
RKemper

Event Timeline

bking updated Other Assignee, added: bking.
bking added a project: Discovery-Search.

Mentioned in SAL (#wikimedia-operations) [2022-10-11T16:54:38Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on elastic2052.codfw.wmnet with reason: T320482

Mentioned in SAL (#wikimedia-operations) [2022-10-11T16:55:03Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on elastic2052.codfw.wmnet with reason: T320482

Looks like there is a documented procedure for DC Ops to follow.

@Papaul I've downtimed the host for the next 2 weeks, feel free to replace the disk at your convenience. If a reimage makes the procedure easier, feel free to do that as well.

bking updated Other Assignee, removed: bking.
bking added a subscriber: RKemper.

@bking this host is out of warranty. If it is a critical host you will have to let us know and request to purchase a disk. Another option is to check also if we have any disk similar from the decommissioned nodes that we can use.
Thanks.

MPhamWMF updated Other Assignee, added: RKemper.
RKemper set the point value for this task to 2.Oct 24 2022, 3:29 PM

@bking this host is out of warranty. If it is a critical host you will have to let us know and request to purchase a disk. Another option is to check also if we have any disk similar from the decommissioned nodes that we can use.
Thanks.

@Papaul We very recently decom'd these hosts: https://phabricator.wikimedia.org/T321243

Is one of their disks available to grab? If not, we could just decommission this host since I believe it only has about a year of lifespan left until refresh.

I will check when i am on site tomorrow.

@RKemper can you please double check this alert , looking at the disk here looks good to me no sign of failure led is green and showing activity.

jbond added a subscriber: jbond.

FYI we are still reciving alerts for this disk to root

mdadm monitoring root@elastic2052.codfw.wmnet via wikimedia.org 
	
7:25 AM (5 hours ago)
	
to root
This is an automatically generated mail message from mdadm
running on elastic2052

A DegradedArray event had been detected on md device /dev/md/0.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [raid1] [raid0] [linear] [multipath] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sdb2[1]
      78058496 blocks super 1.2 [2/1] [_U]

md1 : active (auto-read-only) raid1 sda3[0] sdb3[1]
      999424 blocks super 1.2 [2/2] [UU]
        resync=PENDING

md2 : active raid0 sda4[0] sdb4[1]
      2966525952 blocks super 1.2 512k chunks

unused devices: <none>

@Papaul Yup per jbond's comment above we're still seeing the RAID issue. Could we try either rebuilding raid with the current disk, or swapping in a new one and rebuilding? (I suspect the latter is necessary but I'm not totally sure)

@Papaul Yup per jbond's comment above we're still seeing the RAID issue. Could we try either rebuilding raid with the current disk, or swapping in a new one and rebuilding? (I suspect the latter is necessary but I'm not totally sure)

Note that re-imaging this server from scratch is trivial. It might be easier than rebuilding the RAID array.

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic2052.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic2052.codfw.wmnet with OS bullseye completed:

  • elastic2052 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202211171954_ryankemper_1739297_elastic2052.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status failed -> active