Page MenuHomePhabricator

Degraded RAID on mw2380
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host mw2380. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 1, Spare: 0

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-md
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid1 sda1[0](F) sdb1[1]
      937559040 blocks super 1.2 [2/1] [_U]
      bitmap: 3/7 pages [12KB], 65536KB chunk

unused devices: <none>

Event Timeline

Papaul triaged this task as Medium priority.

Create Dispatch: Success
You have successfully submitted request SR1063712714.

@Dzahn @jijiki @Joe I received the disk today, I will be replacing it tomorrow Thursday at 10:00am CT. If you need to do anything on this server before I replace the disk please let me know or you can just de-pool it and shut it down for me.

Thanks.

Mentioned in SAL (#wikimedia-operations) [2021-07-01T14:53:06Z] <effie> depool mw2380 for disk repair - T285603

@Papaul sorry for the delay, the server can be turned off any time

Disk replaced. Please go ahead and re-image the server.

thanks

It appears that the host gets stuck at

image.png (120×560 px, 18 KB)
probably something got messed up with the boot order, we will take a better look later.

Mentioned in SAL (#wikimedia-operations) [2021-07-02T12:06:24Z] <mutante> mw2380 /puppetmaster: reimaged, revoking old cert, signing new cert, initial puppet run T285603

I reimaged mw2380 and it is booting again now.