Page MenuHomePhabricator

Degraded RAID on db1171
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host db1171. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: 1 failed LD(s) (Degraded)

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli
=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 0 (Target Id: 0)
	RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
	State: =====> Degraded <=====
	Number Of Drives: 10
	Number of Spans: 1
	Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU

=== RaidStatus completed

Event Timeline

@jcrespo this is a backup source. I would assume you want the disk replaced with a spare, but just tagging you to confirm.

Yes, please, dc ops, file a servicing request or help us with a spare here.

@jcrespo can this be swapped at anytime or do we need to schedule?

Go ahead if it doesn't require shutdown. If it requires or it is preferred, just let me know and I will perform it myself right now, will tell you when stopped. Otherwise it can be done at any time.

@jcrespo This server is out of warranty. I replaced the disk with one from a decommissioned server; the drive was erased prior to installation. No new errors are currently listed, aside from previous ones on the controller from 11/8. The RAID appears healthy, but iDRAC still shows a warning. There are no failed drives, and the rebuild is currently in progress. I will check it again later. Additionally, I updated the iDRAC firmware from version 4.40.00.00 to 7.00.00.182.

I will keep an eye on it until it gets rebuilt, thanks for the quick help. I will also have a look at the warnings.

I saw the warnings, but I see no problem on the logs, other than it detecting your disk change and firmware update. Once the disk rebuild finishes I will do a server restart to see if that clears the warnings.

Idrac is showing SYSTEM IS HEALTHY after rebuilding.

Icinga downtime and Alertmanager silence (ID=efc351b5-c9f2-4bac-808a-7ec10adef598) set by jynus@cumin1003 for 2:00:00 on 1 host(s) and their services with reason: Restart

db1171.eqiad.wmnet