Page MenuHomePhabricator

Degraded RAID on analytics1039
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host analytics1039. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: 1 failed LD(s) (Offline)

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli
Failed to execute '['/usr/lib/nagios/plugins/check_nrpe', '-4', '-H', 'analytics1039', '-c', 'get_raid_status_megacli']': RETCODE: 2
STDOUT:
CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds.

STDERR:
None

Related Objects

Event Timeline

we're pretty sure this is a false alarm

colewhite closed this task as Resolved.Apr 16 2019, 6:02 PM
Volans reopened this task as Open.May 24 2019, 9:38 AM
Volans triaged this task as Normal priority.

Re-opening as the disk ended up in a failed state with 2 failed disks!
The automatic task was not opened because it was already in alarm in Icinga, so it didn't re-trigger. Here the status of megacli:

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli
=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 1 (Target Id: 1)
	RAID Level: Primary-0, Secondary-0, RAID Level Qualifier-0
	State: =====> Offline <=====
	Number Of Drives: 1
	Number of Spans: 1
	Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU

		Span: 0 - Number of PDs: 1

			PD: 0 Information
			Enclosure Device ID: 32
			Slot Number: 0
			Drive's position: DiskGroup: 1, Span: 0, Arm: 0
			Media Error Count: 266
			Other Error Count: 15
			Predictive Failure Count: 0
			Last Predictive Failure Event Seq Number: 0

				Raw Size: 3.638 TB [0x1d1c0beb0 Sectors]
				Firmware state: =====> Failed <=====
				Media Type: Hard Disk Device
				Drive Temperature: N/A

	Virtual Drive: 9 (Target Id: 9)
	RAID Level: Primary-0, Secondary-0, RAID Level Qualifier-0
	State: =====> Offline <=====
	Number Of Drives: 1
	Number of Spans: 1
	Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU

		Span: 0 - Number of PDs: 1

			PD: 0 Information
			Enclosure Device ID: 32
			Slot Number: 8
			Drive's position: DiskGroup: 8, Span: 0, Arm: 0
			Media Error Count: 0
			Other Error Count: 3055
			Predictive Failure Count: 0
			Last Predictive Failure Event Seq Number: 0

				Raw Size: 3.638 TB [0x1d1c0beb0 Sectors]
				Firmware state: =====> Failed <=====
				Media Type: Hard Disk Device
				Drive Temperature: 31C (87.80 F)

=== RaidStatus completed

This host is currently part of the Hadoop testing cluster that uses old/to-be-decommed nodes, really sorry for this noise. I have put a request for new (not OOW) hardware for next fiscal for a new testing cluster.

These nodes have a lot of extra drives, and 1039 is currently only using the ones running the root partition, so it is fine to avoid swapping any of the failed ones.

@elukey I do not have any 4TB disks left over in eqiad. If I understand your comment correctly you are saying it's okay to ignore this for now.

@elukey I do not have any 4TB disks left over in eqiad. If I understand your comment correctly you are saying it's okay to ignore this for now.

Correct no need, thanks!

Cmjohnson closed this task as Resolved.Jun 11 2019, 3:45 PM
Cmjohnson claimed this task.

@elukey

I found a spare disk and added the disk back, it's now online

Adapter #0

Enclosure Device ID: 32
Slot Number: 0
Drive's position: DiskGroup: 11, Span: 0, Arm: 0
Enclosure position: 1
Device Id: 0
WWN: 5000C500847F30AC
Sequence Number: 8
Media Error Count: 0
Other Error Count: 1
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS

Raw Size: 3.638 TB [0x1d1c0beb0 Sectors]
Non Coerced Size: 3.637 TB [0x1d1b0beb0 Sectors]
Coerced Size: 3.637 TB [0x1d1b00000 Sectors]
Sector Size: 0
Firmware state: Online, Spun Up
Device Firmware Level: 0004
Shield Counter: 0
Successful diagnostics completion on : N/A
SAS Address(0): 0x5000c500847f30ad
SAS Address(1): 0x0
Connected Port Number: 0(path0)
Inquiry Data: SEAGATE ST4000NM0023 0004Z1Z9XYSY
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None
Device Speed: 6.0Gb/s
Link Speed: 6.0Gb/s
Media Type: Hard Disk Device
Drive: Not Certified
Drive Temperature :26C (78.80 F)
PI Eligibility: No
Drive is formatted for PI information: No
PI: No PI
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s
Port-1 :