Page MenuHomePhabricator

Degraded RAID on an-coord1002
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host an-coord1002. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 1, Spare: 0

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-md
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid1 sda2[0](F) sdb2[1]
      194936832 blocks super 1.2 [2/1] [_U]
      bitmap: 2/2 pages [8KB], 65536KB chunk

unused devices: <none>

Event Timeline

Volans triaged this task as Medium priority.Dec 23 2020, 2:36 PM
Volans added a project: Analytics.
Volans added subscribers: elukey, Ottomata.

Change 651786 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/dns@master] Failover analytics-hive.eqiad.wmnet to an-coord1001

https://gerrit.wikimedia.org/r/651786

Change 651786 merged by Ottomata:
[operations/dns@master] Failover analytics-hive.eqiad.wmnet to an-coord1001

https://gerrit.wikimedia.org/r/651786

Mentioned in SAL (#wikimedia-analytics) [2020-12-23T15:53:00Z] <ottomata> point analytics-hive.eqiad.wmnet back at an-coord1001 - T268028 T270768

This node should now be in standby mode and should be safe to take offline at any time.

As it is in standby, I believe it should be fine to wait until after the holidays to proceed.

Please ping Analytics before shutting down the host (if needed) since there is a database running on it, so I'd prefer to do things gracefully and stop replication from an-coord1001 before proceeding! :)

+1 to do it after holidays, not super urgent!

Puppet was stuck in D state, so I attempted a graceful reboot to see if the OS could boot on its remaining disks. During boot it seems that the disk/raid controller was not picked up, so PXE boot started. I left the host into d-i (interrupted manually) for further investigations, and downtimed.

My great ignorance in sw-RAID setups forced me to step on a mine, namely T215183. The failed disk is the one containing the grub partition table, since it was not mirrored correctly during install, and now it cannot boot. In theory a solution could be to boot debian into rescue mode, but I have never done it on our hosts..

The host boots, see T215183#6718961, but we still need to get the new disk :)

The server is out of warranty but there should be some disks on-site I can use. In the past, anytime /dev/sda goes bad a re-image needs to happen. Let me know if you want me to replace it.

The server is out of warranty but there should be some disks on-site I can use. In the past, anytime /dev/sda goes bad a re-image needs to happen. Let me know if you want me to replace it.

If you could replace it I'd be really grateful, if the RAID1 will not be rebuilt I'll reimage. Thanks!

Assuming hot swapping of course, if you need to turn off the server please let me know beforehand so I can gracefully stop things etc..

Cmjohnson claimed this task.

@elukey I swapped the SSD. The only spare I had is 300GB. It's new. Feel free to do what you need. I am resolving this task since the on-site portion has been completed. If you have any issues please ping me and re-open the task.

elukey added a subscriber: wiki_willy.

@Cmjohnson sorry if it took me so long to answer but I noticed this updated only now. The two disks that I have now on an-coord1002 may not be ok to be put in RAID 1:

elukey@an-coord1002:~$ sudo fdisk -l
Disk /dev/sdb: 186.3 GiB, 200049647616 bytes, 390721968 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 023A5E8E-0DE7-4FDA-9294-6F9AF3FC3E19

[..]

Disk /dev/sda: 279.5 GiB, 300069052416 bytes, 586072368 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

I know that the host is OOW, but would it be possible to buy a disk like the one on /dev/sdb instead of the spare one of 300G ? @wiki_willy adding you as well to see if it is possible or not :)

I am going to attempt to add the new disk to the existing md array, let's see how it goes :)

It seems to have worked, thanks!