Page MenuHomePhabricator

Degraded RAID on db1123
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host db1123. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: 1 failed LD(s) (Degraded)

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli
=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 0 (Target Id: 0)
	RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
	State: =====> Degraded <=====
	Number Of Drives: 10
	Number of Spans: 1
	Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU

=== RaidStatus completed

Event Timeline

Restricted Application added a subscriber: Marostegui. · View Herald TranscriptDec 12 2019, 8:42 AM
Marostegui triaged this task as High priority.EditedDec 12 2019, 8:44 AM
Marostegui added a project: DBA.
Marostegui added subscribers: wiki_willy, Papaul, Jclark-ctr.

This is s3 primary master, can we make sure we get this replaced before the holidays break?

Marostegui moved this task from Triage to In progress on the DBA board.

Assigning to @wiki_willy to make sure it is under DC-Ops radar

@Jclark-ctr - looks like this one is still under warranty, so you should be able to just RMA it. Thanks, Willy

jcrespo moved this task from Backlog to Acknowledged on the Operations board.Dec 12 2019, 6:03 PM

@Jclark-ctr - looks like this one is still under warranty, so you should be able to just RMA it. Thanks, Willy

If we could try to RMA it today, as it usually takes 2-4 days for the disk to arrive...we're on a tight schedule if we want it replaced by Friday

Confirmed: Service Request 1007375142 was successfully submitted

@Marostegui disk arrived today message me on irc if available to change

Can you coordinate with @jcrespo to get it replaced next week?
His nick in IRC is jynus, in case via IRC is easier.
Thanks!

drive changed

root@db1123:~$ megacli -PDRbld -ShowProg -PhysDrv [32:9] -aALL
                                     
Rebuild Progress on Device at Enclosure 32, Slot 9 Completed 12% in 9 Minutes.

Exit Code: 0x00

@Marostegui I've noticed this server has a strip size of 64K. I think as a rule we should audit the RAID configuration on first setup, as it cannot be changed without destroying all data on disks later, and it is not the first time the requested 256K stripe size is not setup, causing performance degradation.

jcrespo closed this task as Resolved.Dec 24 2019, 8:37 AM
megacli -PDRbld -ShowProg -PhysDrv [32:9] -aALL
                                     
Device(Encl-32 Slot-9) is not in rebuild process

Exit Code: 0x00

MegaRAID OK: optimal, 1 logical, 10 physical, WriteBack policy