Page MenuHomePhabricator

Degraded RAID on es1022
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host es1022. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: 1 failed LD(s) (Degraded)

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli
=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 0 (Target Id: 0)
	RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
	State: =====> Degraded <=====
	Number Of Drives: 12
	Number of Spans: 1
	Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU

		Span: 0 - Number of PDs: 12

			PD: 6 Information
			ERROR: =====> MISSING DRIVE INFO <=====

=== RaidStatus completed

Event Timeline

Restricted Application added a subscriber: ABran-WMF. · View Herald Transcript

Mentioned in SAL (#wikimedia-operations) [2024-09-20T08:37:22Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'depool es1022 - T375257', diff saved to https://phabricator.wikimedia.org/P69377 and previous config saved to /var/cache/conftool/dbconfig/20240920-083722-arnaudb.json

ABran-WMF changed the task status from Open to In Progress.Sep 20 2024, 8:38 AM
ABran-WMF triaged this task as High priority.

This instance has been depooled, it's ready to be handled

those servers are a bit sensitive, @wiki_willy do you think this would be manageable to check if we have a spare disk during this week?

Hi @ABran-WMF - can you check with the onsite engineers @VRiley-WMF and @Jclark-ctr? Please also keep in mind this server is due to be refreshed in Q2, so a new system will be on its way in another month or so.

those servers are a bit sensitive, @wiki_willy do you think this would be manageable to check if we have a spare disk during this week?

Hey @ABran-WMF as it turns out, we don't happen to have any 2TB to use as a replacment. However, we do have plenty of 4TB drives that should work. Is it okay to move forward with swapping it out with a 4TB drive?

Hi @ABran-WMF - can you check with the onsite engineers @VRiley-WMF and @Jclark-ctr? Please also keep in mind this server is due to be refreshed in Q2, so a new system will be on its way in another month or so.

thanks for the clarifications!

Hey @ABran-WMF as it turns out, we don't happen to have any 2TB to use as a replacment. However, we do have plenty of 4TB drives that should work. Is it okay to move forward with swapping it out with a 4TB drive?

@VRiley-WMF Please proceed and replace the drive with the 4TB you have in abundance, you can go for it anytime as the server has been depooled.

This drive has been replaced. Please let us know if there are any further issues.

I double checked the RAID status and everything is looking good (overal status and all disk status). I repooled the server at 100% weight. Thanks to @VRiley-WMF for the quick response. I will resolve the ticket.