Page MenuHomePhabricator

(OoW) Degraded RAID on analytics1039
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host analytics1039. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: 1 failed LD(s) (Offline)

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli
Failed to execute '['/usr/lib/nagios/plugins/check_nrpe', '-4', '-H', 'analytics1039', '-c', 'get_raid_status_megacli']': RETCODE: 2
STDOUT:
CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds.

STDERR:
None

Event Timeline

jbond triaged this task as Normal priority.Jun 26 2019, 10:51 AM
jbond added a subscriber: jbond.EditedJun 26 2019, 11:14 AM

sudo megacli -LDInfo -Lall -aALL

Virtual Drive: 4 (Target Id: 4)
Name                :
RAID Level          : Primary-0, Secondary-0, RAID Level Qualifier-0
Size                : 3.637 TB
Sector Size         : 512
Parity Size         : 0
State               : Offline
Strip Size          : 64 KB
Number Of Drives    : 1
Span Depth          : 1
Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy   : Disk's Default
Encryption Type     : None
Default Power Savings Policy: Controller Defined
Current Power Savings Policy: None
Can spin up in 1 minute: Yes
LD has drives that support T10 power conditions: No
LD's IO profile supports MAX power savings with cached writes: No
Bad Blocks Exist: Yes
Is VD Cached: Yes
Cache Cade Type : Read Only

sudo megacli -PDList -aALL

Enclosure Device ID: 32
Slot Number: 3
Drive's position: DiskGroup: 2, Span: 0, Arm: 0
Enclosure position: 1
Device Id: 3
WWN: 5000c50066a8d3b1
Sequence Number: 3
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA

Raw Size: 3.638 TB [0x1d1c0beb0 Sectors]
Non Coerced Size: 3.637 TB [0x1d1b0beb0 Sectors]
Coerced Size: 3.637 TB [0x1d1b00000 Sectors]
Sector Size:  0
Firmware state: Failed
Device Firmware Level: GA0A
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x500056b36789abef
Connected Port Number: 0(path0) 
Inquiry Data: ATA     ST4000NM0033-9ZMGA0A            Z1Z3PDQ5
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None 
Device Speed: 3.0Gb/s 
Link Speed: 3.0Gb/s 
Media Type: Hard Disk Device
Drive Temperature :32C (89.60 F)
PI Eligibility:  No 
Drive is formatted for PI information:  No
PI: No PI
Drive's NCQ setting : N/A
Port-0 :
Port status: Active
Port's Linkspeed: 3.0Gb/s 
Drive has flagged a S.M.A.R.T alert : No

Volans added a subscriber: Volans.Jun 26 2019, 11:18 AM

@jbond FYI if you want to mimic the automation, just run:

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli
=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 4 (Target Id: 4)
	RAID Level: Primary-0, Secondary-0, RAID Level Qualifier-0
	State: =====> Offline <=====
	Number Of Drives: 1
	Number of Spans: 1
	Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU

		Span: 0 - Number of PDs: 1

			PD: 0 Information
			Enclosure Device ID: 32
			Slot Number: 3
			Drive's position: DiskGroup: 2, Span: 0, Arm: 0
			Media Error Count: 0
			Other Error Count: 0
			Predictive Failure Count: 0
			Last Predictive Failure Event Seq Number: 0

				Raw Size: 3.638 TB [0x1d1c0beb0 Sectors]
				Firmware state: =====> Failed <=====
				Media Type: Hard Disk Device
				Drive Temperature: 32C (89.60 F)

=== RaidStatus completed
jbond added a comment.EditedJun 26 2019, 11:21 AM

Thanks @Volans perhaps the runbook should be updated. im not sure if the Full info requested there is still useful to dc-ops?

I'll let them reply :) we have also an hpssacli version of kinda the same script fwiw.

Is there a way to stop this check for some hosts? In this case, this is the hadoop testing cluster, all OOW hardware..

@elukey you can disable icinga notification on a cluster via hiera. Alternatively to disable only this kind of check you can disable event handler from Icinga UI. I'm not sure if we have any more fine-tuned way to skip this ones.

Cmjohnson moved this task from Backlog to Stalled on the ops-eqiad board.Jun 27 2019, 4:20 PM
wiki_willy renamed this task from Degraded RAID on analytics1039 to (OoW) Degraded RAID on analytics1039.Jul 2 2019, 9:41 PM
Cmjohnson moved this task from Stalled to Blocked on the ops-eqiad board.Jul 12 2019, 12:35 AM
wiki_willy assigned this task to elukey.Jul 15 2019, 7:25 PM

@Cmjohnson - if you have a bunch of spare 4tb SATA drives lying around onsite that match up with the disks on analytics1039, feel free to use them for this task. Thanks, Willy

@wiki_willy I'll try to disable this alarm for good, the host does not use the disk and there is no real reason to waste a spare :)

Thanks @elukey , much appreciated! ~Willy

elukey closed this task as Resolved.Aug 6 2019, 10:12 AM

The alert should not fire again (I hope), I have disabled it via Icinga UI. Closing :)