Page MenuHomePhabricator

Degraded RAID on analytics1054
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host analytics1054. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: 1 failed LD(s) (Offline)

$ sudo /usr/local/lib/nagios/plugins/get_raid_status_megacli
=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 2 (Target Id: 2)
	RAID Level: Primary-0, Secondary-0, RAID Level Qualifier-0
	State: =====> Offline <=====
	Number Of Drives: 1
	Number of Spans: 1
	Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU

		Span: 0 - Number of PDs: 1

			PD: 0 Information
			Enclosure Device ID: 32
			Slot Number: 1
			Drive's position: DiskGroup: 2, Span: 0, Arm: 0
			Media Error Count: 0
			Other Error Count: 3
			Predictive Failure Count: 0
			Last Predictive Failure Event Seq Number: 0

				Raw Size: 3.638 TB [0x1d1c0beb0 Sectors]
				Firmware state: =====> Failed <=====
				Media Type: Hard Disk Device
				Drive Temperature: 31C (87.80 F)

=== RaidStatus completed

Event Timeline

herron triaged this task as High priority.Jan 7 2019, 3:27 PM
Milimetric moved this task from Operational Excellence to Incoming on the Analytics board.

The disk at slot 1 is failed, the server is out of warranty but I do have a spare 4TB SATA.

cmjohnson@analytics1054:~$ sudo megacli -PDList -aALL |grep "Firmware state"
Firmware state: Online, Spun Up
Firmware state: Failed
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up

@elukey the disk still shows failed do you have to manually add it back?

Milimetric moved this task from Incoming to Radar on the Analytics board.Jan 7 2019, 4:59 PM
elukey added a comment.Jan 7 2019, 5:03 PM

@elukey the disk still shows failed do you have to manually add it back?

Sorry Chris didn't get the question - do you mean that the disk has been replaced and it still shows Offline?

elukey assigned this task to Cmjohnson.Jan 7 2019, 5:14 PM

@elukey sorry, i replaced the disk and it is still showing failed, I don't know if the disk needs to be manually added back to the array?

elukey added a comment.Jan 7 2019, 5:53 PM

@Cmjohnson so I got a different than usual output from:

elukey@analytics1054:~$ sudo megacli -PDList -aAll | grep Firm
Firmware state: Online, Spun Up
Device Firmware Level: FL1H
Firmware state: Failed
Device Firmware Level: FL1H
Firmware state: Online, Spun Up
Device Firmware Level: FL1H
Firmware state: Online, Spun Up
Device Firmware Level: FL1H
Firmware state: Online, Spun Up
Device Firmware Level: FL1H
Firmware state: Online, Spun Up
Device Firmware Level: FL1H
Firmware state: Online, Spun Up
Device Firmware Level: FL1H
Firmware state: Online, Spun Up
Device Firmware Level: FL1H
Firmware state: Online, Spun Up
Device Firmware Level: FL1H
Firmware state: Online, Spun Up
Device Firmware Level: FL1H
Firmware state: Online, Spun Up
Device Firmware Level: FL1H
Firmware state: Online, Spun Up
Device Firmware Level: FL1H
Firmware state: Online, Spun Up
Device Firmware Level: AA63
Firmware state: Online, Spun Up
Device Firmware Level: AA63

(reading from my notes: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration#Swapping_broken_disk)

Never seen a Firmware state failed, any idea/experience from the past?

elukey added a comment.Jan 7 2019, 6:52 PM
Enclosure Device ID: 32
Slot Number: 1
Drive's position: DiskGroup: 2, Span: 0, Arm: 0
Enclosure position: 1
Device Id: 1
WWN: 500003964b700233
Sequence Number: 3
Media Error Count: 0
Other Error Count: 3
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA

Raw Size: 3.638 TB [0x1d1c0beb0 Sectors]
Non Coerced Size: 3.637 TB [0x1d1b0beb0 Sectors]
Coerced Size: 3.637 TB [0x1d1b00000 Sectors]
Sector Size:  512
Logical Sector Size:  512
Physical Sector Size:  512
Firmware state: Failed
Device Firmware Level: FL1H
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x500056b31234abc1
Connected Port Number: 0(path0)
Inquiry Data: ATA     TOSHIBA MG03ACA4FL1H           5521K0O0F
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None
Device Speed: 6.0Gb/s
Link Speed: 6.0Gb/s
Media Type: Hard Disk Device
Drive Temperature :31C (87.80 F)
PI Eligibility:  No
Drive is formatted for PI information:  No
PI: No PI
Drive's NCQ setting : N/A
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s
Drive has flagged a S.M.A.R.T alert : No

Change 483062 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Prevent using /var/lib/hadoop/data/c partition on an1054

https://gerrit.wikimedia.org/r/483062

Change 483062 merged by Elukey:
[operations/puppet@production] Prevent using /var/lib/hadoop/data/c partition on an1054

https://gerrit.wikimedia.org/r/483062

elukey moved this task from Backlog to Stalled on the User-Elukey board.Jan 10 2019, 9:05 AM

disk is replaced but shows as unconfigured (good)

Change 486433 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Remove hiera overrides for analytics1054 after disk swap

https://gerrit.wikimedia.org/r/486433

Mentioned in SAL (#wikimedia-operations) [2019-01-25T07:40:38Z] <elukey> drain + reboot analytics1054 after disk swap (verify reboot + restore correct fstab mountpoints) - T213038

Change 486433 merged by Elukey:
[operations/puppet@production] Remove hiera overrides for analytics1054 after disk swap

https://gerrit.wikimedia.org/r/486433

elukey closed this task as Resolved.Jan 25 2019, 7:47 AM

All good thanks a lot @Cmjohnson !