Degraded RAID on es1022
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ops-monitoring-bot
	Sep 20 2024, 8:31 AM

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host es1022. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: 1 failed LD(s) (Degraded)

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli
=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 0 (Target Id: 0)
	RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
	State: =====> Degraded <=====
	Number Of Drives: 12
	Number of Spans: 1
	Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU

		Span: 0 - Number of PDs: 12

			PD: 6 Information
			ERROR: =====> MISSING DRIVE INFO <=====

=== RaidStatus completed

Related Objects

Mentioned Here: P69377 dbctl commit (dc=all): 'depool es1022 - T375257'

Event Timeline

ops-monitoring-bot created this task.Sep 20 2024, 8:31 AM

Restricted Application added projects: DC-Ops, DBA. · View Herald TranscriptSep 20 2024, 8:31 AM

Restricted Application added a subscriber: ABran-WMF. · View Herald Transcript

ABran-WMF claimed this task.Sep 20 2024, 8:36 AM

Mentioned in SAL (#wikimedia-operations) [2024-09-20T08:37:22Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'depool es1022 - T375257', diff saved to https://phabricator.wikimedia.org/P69377 and previous config saved to /var/cache/conftool/dbconfig/20240920-083722-arnaudb.json

This instance has been depooled, it's ready to be handled

ABran-WMF moved this task from Triage to In progress on the DBA board.Sep 20 2024, 8:38 AM

those servers are a bit sensitive, @wiki_willy do you think this would be manageable to check if we have a spare disk during this week?

Hi @ABran-WMF - can you check with the onsite engineers @VRiley-WMF and @Jclark-ctr? Please also keep in mind this server is due to be refreshed in Q2, so a new system will be on its way in another month or so.

In T375257#10166448, @ABran-WMF wrote:

those servers are a bit sensitive, @wiki_willy do you think this would be manageable to check if we have a spare disk during this week?

Hey @ABran-WMF as it turns out, we don't happen to have any 2TB to use as a replacment. However, we do have plenty of 4TB drives that should work. Is it okay to move forward with swapping it out with a 4TB drive?

In T375257#10168468, @wiki_willy wrote:

Hi @ABran-WMF - can you check with the onsite engineers @VRiley-WMF and @Jclark-ctr? Please also keep in mind this server is due to be refreshed in Q2, so a new system will be on its way in another month or so.

thanks for the clarifications!

In T375257#10168489, @VRiley-WMF wrote:

Hey @ABran-WMF as it turns out, we don't happen to have any 2TB to use as a replacment. However, we do have plenty of 4TB drives that should work. Is it okay to move forward with swapping it out with a 4TB drive?

@VRiley-WMF Please proceed and replace the drive with the 4TB you have in abundance, you can go for it anytime as the server has been depooled.

This drive has been replaced. Please let us know if there are any further issues.

I double checked the RAID status and everything is looking good (overal status and all disk status). I repooled the server at 100% weight. Thanks to @VRiley-WMF for the quick response. I will resolve the ticket.

Maintenance_bot moved this task from In progress to Done on the DBA board.Sep 27 2024, 1:29 PM

Degraded RAID on es1022Closed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Degraded RAID on es1022
Closed, ResolvedPublic
Actions