Page MenuHomePhabricator

Degraded RAID on helium
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host helium. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: 1 failed LD(s) (Partially Degraded)

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli
Failed to execute '['/usr/lib/nagios/plugins/check_nrpe', '-4', '-H', 'helium', '-c', 'get_raid_status_megacli']': RETCODE: 2
STDOUT:
CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds.

STDERR:
None

Event Timeline

jijiki triaged this task as Normal priority.Jun 18 2019, 1:30 PM
jijiki added subscribers: Cmjohnson, akosiaris, jijiki.

That looks about right

Physical Disk: 7
Enclosure Device ID: 15
Slot Number: 7
Drive's position: DiskGroup: 0, Span: 0, Arm: 7
Enclosure position: N/A
Device Id: 8
WWN: 5000C500629CAA04
Sequence Number: 3
Media Error Count: 0
Other Error Count: 121
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS


Raw Size: 3.638 TB [0x1d1c0beb0 Sectors]
Non Coerced Size: 3.637 TB [0x1d1b0beb0 Sectors]
Coerced Size: 3.637 TB [0x1d1b00000 Sectors]
Sector Size:  0
Firmware state: Failed
Device Firmware Level: GS0F
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x5000c500629caa05
SAS Address(1): 0x5000c500629caa06
Connected Port Number: 1(path0) 0(path1)
Inquiry Data: SEAGATE ST4000NM0023    GS0FZ1Z65R2W
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None
Device Speed: 6.0Gb/s
Link Speed: 6.0Gb/s
Media Type: Hard Disk Device
Drive Temperature : N/A
PI Eligibility:  No
Drive is formatted for PI information:  No
PI: No PI
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s
Port-1 :
Port status: Active
Port's Linkspeed: Unknown
Drive has flagged a S.M.A.R.T alert : No
Volans added a subscriber: Volans.Jul 1 2019, 10:18 AM

The automatic gathering times out because megacli takes ~3 minutes to return the status of the disks, it blocks at PD7 (the one broken) and takes very long time to get info from that disk.

As stated in T226908 I've disabled the event handler in Icinga for this specific check on helium and that needs to be re-enabled once the disk has been changed. For the same reason the icinga check itself flaps between unknown and critical, hence creating the dupes.

PD7 is broken and there is another disk in predictive failure, see the full output of sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli here:

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli
=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 0 (Target Id: 0)
	RAID Level: Primary-6, Secondary-0, RAID Level Qualifier-3
	State: =====> Partially Degraded <=====
	Number Of Drives: 12
	Number of Spans: 1
	Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU

		Span: 0 - Number of PDs: 12

			PD: 7 Information
			Enclosure Device ID: 15
			Slot Number: 7
			Drive's position: DiskGroup: 0, Span: 0, Arm: 7
			Media Error Count: 0
			Other Error Count: 121
			Predictive Failure Count: 0
			Last Predictive Failure Event Seq Number: 0

				Raw Size: 3.638 TB [0x1d1c0beb0 Sectors]
				Firmware state: =====> Failed <=====
				Media Type: Hard Disk Device
				Drive Temperature: N/A

			PD: 11 Information
			Enclosure Device ID: 15
			Slot Number: 11
			Drive's position: DiskGroup: 0, Span: 0, Arm: 11
			Media Error Count: 33586
			Other Error Count: 6
			Predictive Failure Count: =====> 9 <=====
			Last Predictive Failure Event Seq Number: 210876

				Raw Size: 3.638 TB [0x1d1c0beb0 Sectors]
				Firmware state: Online, Spun Up
				Media Type: Hard Disk Device
				Drive Temperature: 41C (105.80 F)

=== RaidStatus completed
This comment was removed by Cmjohnson.

I am not sure what I was looking at yesterday but this server is out of warranty. However, I think I have a 4TB disks that I can replace it with. I will confirm when I get back to eqiad next week.

Cmjohnson reassigned this task from Cmjohnson to RobH.Jul 16 2019, 7:51 PM
Cmjohnson added subscribers: wiki_willy, RobH.

I do not have any spare 4TB SAS disks...this will need to go to @RobH and @wiki_willy for a procurement task.

RobH reassigned this task from RobH to wiki_willy.Jul 16 2019, 7:55 PM

So this is a host that is well over 5 years old but now asking for more disks. Do we want to just order some, or is this slated for replacement somewhere that I am unaware of? (I do not see it on the upcoming procurement sheet.)

Dzahn added a comment.Jul 16 2019, 7:58 PM

Also see T186816 . That created backup1001 which was called the replacement for helium. I am not involved in the project to replace it though, so there might be more to it that others know better.

RobH added a comment.EditedJul 16 2019, 8:01 PM

Indeed, and backup1001 was setup via T189801. So it seems this shouldn't have money wasted on replacing disks in a system that has had its replacement pushed online with an OS and ready for use (backup1001 is role spare.)

T189801 is unclear, and I'm not sure last status for this though.

ayounsi removed a subscriber: ayounsi.Jul 16 2019, 8:01 PM
Dzahn added a comment.EditedJul 16 2019, 9:24 PM

Looks like backup1001 is blocked though by T227335 so it can't use its disks and that is set to High priority.

@akosiaris or @Volans - we can order drive replacements for this, since it's out of warranty, but I was trying to figure out how this correlates with the new replacement of backup1001. Do you need replacement drives on helium, to be able to complete the migration of data over to backup1001? I'll follow up on IRC with you later tonight as well. Thanks, Willy

@wiki_willy sorry but cannot help as I've no special knowledge of this host or backup1001.

@akosiaris or @Volans - we can order drive replacements for this, since it's out of warranty, but I was trying to figure out how this correlates with the new replacement of backup1001. Do you need replacement drives on helium, to be able to complete the migration of data over to backup1001? I'll follow up on IRC with you later tonight as well. Thanks, Willy

backup1001 is indeed meant to replace helium. For a number of reasons this hasn't happened yet, but is finally scheduled as a goal to happen this quarter, albeit a stretch one. I would advise to procure that disk in the meantime, just so that we can have some piece of mind (and minimize the chances for a failure) while moving forward with the migration.

Thanks for back history @akosiaris , we'll get the replacement drives ordered for you via procurement #T228302. ~Willy

Please note that replacement disks have now been ordered on T228302 and should arrive sometime next week. The 3 day shipping option was selected, so we currently expect this to ship on Friday/Monday and arrive on Wednesday/Thursday. At that time, we can coordinate with either @Cmjohnson (will be on vacation that week and returning the following) or have @Jclark-ctr swap it out.

Jclark-ctr closed subtask Unknown Object (Task) as Resolved.Jul 31 2019, 6:09 PM

Drives received last Wed, July 31 by @Jclark-ctr

Replaced the disk at slot 7. Letting that rebuild and will replace the second failed disk

Replaced the disk at slot 11

Dzahn removed a subscriber: Dzahn.Aug 8 2019, 10:21 PM

@Jclark-ctr - can we resolve this task? Thanks, Willy

Jclark-ctr closed this task as Resolved.Wed, Aug 28, 12:15 PM

The automatic gathering times out because megacli takes ~3 minutes to return the status of the disks, it blocks at PD7 (the one broken) and takes very long time to get info from that disk.
As stated in T226908 I've disabled the event handler in Icinga for this specific check on helium and that needs to be re-enabled once the disk has been changed.

And we failed to re-enable it and never found out that slot 3 is now failed

Talked to @akosiaris, who will open up a new task to replace the newly failed drive. We ordered a few of them last time, so hopefully we'll have more spares lying around.