Degraded RAID on an-coord1002
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ops-monitoring-bot
	Dec 23 2020, 2:29 PM

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host an-coord1002. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 1, Spare: 0

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-md
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid1 sda2[0](F) sdb2[1]
      194936832 blocks super 1.2 [2/1] [_U]
      bitmap: 2/2 pages [8KB], 65536KB chunk

unused devices: <none>

Details

	Subject	Repo	Branch	Lines +/-
	Failover analytics-hive.eqiad.wmnet to an-coord1001	operations/dns	master	+3 -4

Customize query in gerrit

Related Objects

Mentioned In: T271098: Degraded RAID on an-coord1002
T268028: Move oozie's hive2 actions to analytics-hive.eqiad.wmnet
Mentioned Here: T215183: Redundant bootloaders for software RAID
T268028: Move oozie's hive2 actions to analytics-hive.eqiad.wmnet

Event Timeline

ops-monitoring-bot created this task.Dec 23 2020, 2:29 PM

Volans triaged this task as Medium priority.Dec 23 2020, 2:36 PM

Volans added a project: Analytics.

Volans added subscribers: elukey, Ottomata.

Volans added a subscriber: • razzi.Dec 23 2020, 3:31 PM

Change 651786 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/dns@master] Failover analytics-hive.eqiad.wmnet to an-coord1001

https://gerrit.wikimedia.org/r/651786

Change 651786 merged by Ottomata:
[operations/dns@master] Failover analytics-hive.eqiad.wmnet to an-coord1001

https://gerrit.wikimedia.org/r/651786

Mentioned in SAL (#wikimedia-analytics) [2020-12-23T15:53:00Z] <ottomata> point analytics-hive.eqiad.wmnet back at an-coord1001 - T268028 T270768

Stashbot mentioned this in T268028: Move oozie's hive2 actions to analytics-hive.eqiad.wmnet.Dec 23 2020, 3:53 PM

This node should now be in standby mode and should be safe to take offline at any time.

As it is in standby, I believe it should be fine to wait until after the holidays to proceed.

• Cmjohnson moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.Dec 23 2020, 4:58 PM

Please ping Analytics before shutting down the host (if needed) since there is a database running on it, so I'd prefer to do things gracefully and stop replication from an-coord1001 before proceeding! :)

+1 to do it after holidays, not super urgent!

Puppet was stuck in D state, so I attempted a graceful reboot to see if the OS could boot on its remaining disks. During boot it seems that the disk/raid controller was not picked up, so PXE boot started. I left the host into d-i (interrupted manually) for further investigations, and downtimed.

My great ignorance in sw-RAID setups forced me to step on a mine, namely T215183. The failed disk is the one containing the grub partition table, since it was not mirrored correctly during install, and now it cannot boot. In theory a solution could be to boot debian into rescue mode, but I have never done it on our hosts..

RhinosF1 subscribed.Jan 1 2021, 6:03 PM

The host boots, see T215183#6718961, but we still need to get the new disk :)

wiki_willy mentioned this in T271098: Degraded RAID on an-coord1002.Jan 4 2021, 9:40 PM

The server is out of warranty but there should be some disks on-site I can use. In the past, anytime /dev/sda goes bad a re-image needs to happen. Let me know if you want me to replace it.

In T270768#6729455, @Cmjohnson wrote:

The server is out of warranty but there should be some disks on-site I can use. In the past, anytime /dev/sda goes bad a re-image needs to happen. Let me know if you want me to replace it.

If you could replace it I'd be really grateful, if the RAID1 will not be rebuilt I'll reimage. Thanks!

Assuming hot swapping of course, if you need to turn off the server please let me know beforehand so I can gracefully stop things etc..

• razzi edited projects, added Analytics-Radar; removed Analytics.Jan 14 2021, 5:40 PM

@elukey I swapped the SSD. The only spare I had is 300GB. It's new. Feel free to do what you need. I am resolving this task since the on-site portion has been completed. If you have any issues please ping me and re-open the task.

@Cmjohnson sorry if it took me so long to answer but I noticed this updated only now. The two disks that I have now on an-coord1002 may not be ok to be put in RAID 1:

elukey@an-coord1002:~$ sudo fdisk -l
Disk /dev/sdb: 186.3 GiB, 200049647616 bytes, 390721968 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 023A5E8E-0DE7-4FDA-9294-6F9AF3FC3E19

[..]

Disk /dev/sda: 279.5 GiB, 300069052416 bytes, 586072368 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

I know that the host is OOW, but would it be possible to buy a disk like the one on /dev/sdb instead of the spare one of 300G ? @wiki_willy adding you as well to see if it is possible or not :)

I am going to attempt to add the new disk to the existing md array, let's see how it goes :)

It seems to have worked, thanks!

Degraded RAID on an-coord1002Closed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Degraded RAID on an-coord1002
Closed, ResolvedPublic
Actions