Degraded RAID on cloudcephosd1008
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ops-monitoring-bot
	Aug 1 2021, 7:12 PM

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host cloudcephosd1008. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 1, Spare: 0

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-md
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid1 sda2[0] sdb2[1](F)
      234005504 blocks super 1.2 [2/1] [U_]
      bitmap: 1/2 pages [4KB], 65536KB chunk

unused devices: <none>

Related Objects

Mentioned In: T289069: Degraded RAID on cloudcephosd1008
T288203: ceph: Test behavior when an osd host goes down on codfw

Event Timeline

ops-monitoring-bot created this task.Aug 1 2021, 7:12 PM

RhinosF1 subscribed.Aug 1 2021, 7:21 PM

MoritzMuehlenhoff triaged this task as Medium priority.Aug 2 2021, 7:01 AM

MoritzMuehlenhoff added a project: cloud-services-team.

Restricted Application edited projects, added cloud-services-team (Kanban); removed cloud-services-team. · View Herald TranscriptAug 2 2021, 7:01 AM

The failed drive is an OS drive, not one containing ceph storage. So neither this failure nor a replacement should cause Ceph thrashing.

@Jclark-ctr, if that drive can be hot-swapped then let's just do it. Please let me know beforehand so I can keep an eye out. Best if we can wait until later in the work when I won't be traveling.

Thanks!

a disk has been ordered through Dell, hopefully, they do not push back because the disk does not show failed in the h/w log I sent them.

You have successfully submitted request SR1066679639.

• Cmjohnson moved this task from Backlog to High Priority Task on the ops-eqiad board.Aug 2 2021, 6:54 PM

These disks are not hot swappable, it appears that they're software raid 1, the disk was swapped but will need to be manually added back to the raid configuration.

Let me know when the drive shows up and I'll take that host out of service so you can power it down.

Andrew mentioned this in T288203: ceph: Test behavior when an osd host goes down on codfw.Aug 5 2021, 6:23 PM

Chris replaced this drive (apparently possible without power-down) but now we need to rebuild the RAID. DC-ops will probably do this, but for reference the instructions are here:

https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Sw_raid_rebuild_directions

Please note 'hot swappable' just means 'can the disk be swapped without powering down the host' it doesn't have anything to do with the raid's automatic failure or rebuild of raid arrays. This is a pretty common misconception, so I just wanted to call it out here!

All our hosts with externally accessible to swap disk bays are hot swap supported. You may need to fail a disk out of an array manually before you remove it, but the actual chassis and host OS do not require power off to accomplish a disk swap. We may still have some hosts floating around in service with 'internal' only disk bays, and those require powering off to swap those disks.

Please note this appears ready for DC ops rebuild of the raid array, but as I wrote the directions on how to do this, someone else in DC ops should proof them by following them for this repair: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Sw_raid_rebuild_directions

Be careful when you type in the drive letters, as you want to write from sda to sdb, not overwrite sda.

just tried these steps and the disk is not being seen. I need to reseat the disk and try again

In T287838#7275903, @Cmjohnson wrote:

just tried these steps and the disk is not being seen. I need to reseat the disk and try again

Updated directions to list checking for the new disk, I now realize the directions made assumptions they shouldn't have made!

IRC Update:

Reseating the disk (Chris did so) did not fix this, as it doesn't fire off the redetection of disks automatically.

I think if we follow the directions listed here, we'll be able to detect the disk without reboot: https://www.golinuxhub.com/2014/07/how-to-detect-new-hard-disk-attached/

I don't think we should do this without also having the ability to reboot the machine if it doesn't immediately work, so we need to schedule a maint window.

Next steps:

schedule maint window for potential downtime
attempt directions on https://www.golinuxhub.com/2014/07/how-to-detect-new-hard-disk-attached/ to detect the new disk then directions on https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Sw_raid_rebuild_directions to add it back to the array.
if the above step doesnt work, reboot the machine to redetect the disk and add back with https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Sw_raid_rebuild_directions

@andrewbogott or @dcaro The disk did not show up as available even after attempting @RobH's update. We will need to schedule downtime to reboot the server.

@Cmjohnson, we don't need to do a lot regarding downtime here, but I would like to be present along with @dcaro when we shut this down. Is it possible to schedule this for some morning this week? The best possible timing would be around between 9 and 10AM in your timezone (13:00-14:00 UTC).

This is planned for tomorrow 17 August

Mentioned in SAL (#wikimedia-cloud) [2021-08-17T15:11:17Z] <andrewbogott> rebooting cloudcephosd1008 to force raid rebuild -- T287838

The issue was the raid controller took the disk and it was raid-capable. I booted into the raid-bios, changed the disk to non-raid, and power cycled. Back at the OS, I was able to rebuild the disk utilizing the steps @RobH posted previously.

cmjohnson@cloudcephosd1008:~$ sudo cat /proc/mdstat
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sdb2[2] sda2[0]

234005504 blocks super 1.2 [2/1] [U_]
[>....................]  recovery =  2.1% (5034624/234005504) finish=190.2min speed=20055K/sec
bitmap: 1/2 pages [4KB], 65536KB chunk

• Cmjohnson mentioned this in T289069: Degraded RAID on cloudcephosd1008.Aug 17 2021, 3:59 PM

RobH unsubscribed.Aug 17 2021, 4:35 PM

cmjohnson@cloudcephosd1008:~$ sudo cat /proc/mdstat
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sdb2[2] sda2[0]

234005504 blocks super 1.2 [2/2] [UU]
bitmap: 1/2 pages [4KB], 65536KB chunk

Degraded RAID on cloudcephosd1008Closed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Degraded RAID on cloudcephosd1008
Closed, ResolvedPublic
Actions