Page MenuHomePhabricator

Degraded RAID on an-worker1128
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host an-worker1128. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: 1 failed LD(s) (Degraded)

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli
=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 0 (Target Id: 0)
	RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
	State: =====> Degraded <=====
	Number Of Drives: 2
	Number of Spans: 1
	Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU

		Span: 0 - Number of PDs: 2

			PD: 0 Information
			Enclosure Device ID: 32
			Slot Number: 12
			Drive's position: DiskGroup: 0, Span: 0, Arm: 0
			Media Error Count: 0
			Other Error Count: 14
			Predictive Failure Count: 0
			Last Predictive Failure Event Seq Number: 0

				Raw Size: 447.130 GB [0x37e436b0 Sectors]
				Firmware state: =====> Failed <=====
				Media Type: Solid State Device
				Drive Temperature: 34C (93.20 F)

=== RaidStatus completed

Details

Other Assignee
BTullis

Event Timeline

Since this unit is out of warrenty, Will locate another disk to use as a replacement.

@elukey I was able to locate a spare 480 gig SSD for this unit. Would you be able to let me know a good time to replace this?

@LSobanski it seems that Luca is out, but you were listed on the orginal install ticket for this unit. Would you be able to assist is in schedualing time for this mantainence?

Adding @BTullis and @Stevemunene for feedback on an appropriate window for an-worker1128

@LSobanski it seems that Luca is out, but you were listed on the orginal install ticket for this unit. Would you be able to assist is in schedualing time for this mantainence?

Jclark-ctr subscribed.

@VRiley-WMF @wiki_willy Ben mentioned in previous tickets of both of ours to check that Data-Platform-SRE is added as a subtask. This might help get some traction.

Also, in this ticket, I am commenting below with what he mentioned: “Had the failed drive been one of the O/S drives, these use a RAID1 configuration, so the hot-swap would have been handled completely at the RAID controller level, and the rebuild would have been automatic.”

Hi @VRiley-WMF - Many thanks for sourcing and replacing this drive. I hope you don't mind if I make a request though, please.

In future, we would be really grateful if you could let us know in advance of performing a drive swap (or similar), so that we can prepare the system and correlate any errors that we see.
For reference: I created T358691: Hadoop datanode on an-worker1173 is showing errors to investigate the hardware/HDFS errors relating to this host, then a Phabricator search led me to this ticket.


In this instance (in fact, on every an-worker* host), these 12 data drives are all configured as individual RAID 0 volumes, so there is no redundancy at the RAID controller level.
The redundancy for this data is performed in software by the Hadoop HDFS file system, which creates three copies of all data, spread across all of the (~90) hosts in the cluster.

So when you removed the failed drive, what the the operating observed was the hardware path to /dev/sdi locking up. An HDFS process was still trying to access that device, irrespective of whether or not it was actually working at the time.
When you inserted the new drive it registered itself with the RAID controller correctly, but with a status of Unconfigured Good so it didn't actually get passed through to the operating system for use.
Now that I've found out what the issue was, I can configure this RAID 0 volume myself and put the new drive into service. All good :-)

Thankfully, this isn't a serious issue, but I worry a little that in another case it could have been a bit worse.
There are several other systems and teams that use this distributed storage model with RAID 0 drives (or controllers in JBOD mode) so I'd recommend erring on the side of caution and contacting the service owners, every time.

Had the failed drive been one of the O/S drives, these use a RAID1 configuration so the hot-swap would have been handled completely at the RAID controller level and the rebuild would have been automatic.
Similarly, if it had been a hardware RAID10 array, such as one of the dbstore* servers, then the operating system wouldn't have noticed anything apart from, perhaps, a reduction in poerformance while the rebuild takes place.
However, even in these cases where the O/S should be oblivious to the hardware replacement, we would still prefer to know about it, if that's OK with you.

Feel free to tag us with Data-Platform-SRE or ping any of us in the team if you have queries or concerns

Gehel triaged this task as High priority.Sep 2 2025, 1:42 PM

Thanks for checking with us @VRiley-WMF and apologies for the delay in getting back to you. - You can replace this disk at any time convenient to you.
This failure is one of the O/S disks in this host, so it is part of a hardware RAID 1 mirror. This means that we don't need to take any special action before you replace the disk.

This disk has been replaced.