Degraded RAID on analytics1054
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ops-monitoring-bot
	Jan 6 2019, 10:00 PM

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host analytics1054. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: 1 failed LD(s) (Offline)

$ sudo /usr/local/lib/nagios/plugins/get_raid_status_megacli
=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 2 (Target Id: 2)
	RAID Level: Primary-0, Secondary-0, RAID Level Qualifier-0
	State: =====> Offline <=====
	Number Of Drives: 1
	Number of Spans: 1
	Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU

		Span: 0 - Number of PDs: 1

			PD: 0 Information
			Enclosure Device ID: 32
			Slot Number: 1
			Drive's position: DiskGroup: 2, Span: 0, Arm: 0
			Media Error Count: 0
			Other Error Count: 3
			Predictive Failure Count: 0
			Last Predictive Failure Event Seq Number: 0

				Raw Size: 3.638 TB [0x1d1c0beb0 Sectors]
				Firmware state: =====> Failed <=====
				Media Type: Hard Disk Device
				Drive Temperature: 31C (87.80 F)

=== RaidStatus completed

Details

	Subject	Repo	Branch	Lines +/-
	Remove hiera overrides for analytics1054 after disk swap	operations/puppet	production	+0 -31
	Prevent using /var/lib/hadoop/data/c partition on an1054	operations/puppet	production	+31 -0

Customize query in gerrit

Related Objects

Mentioned In: T213859: eqiad: rack a3 pdu swap / failure / replacement

Event Timeline

ops-monitoring-bot added projects: SRE, ops-eqiad.Jan 6 2019, 10:00 PM

ops-monitoring-bot subscribed.

herron triaged this task as High priority.Jan 7 2019, 3:27 PM

elukey added projects: User-Elukey, Analytics.Jan 7 2019, 3:29 PM

Milimetric moved this task from Incoming to Operational Excellence on the Analytics board.Jan 7 2019, 4:37 PM

Milimetric moved this task from Operational Excellence to Incoming on the Analytics board.

The disk at slot 1 is failed, the server is out of warranty but I do have a spare 4TB SATA.

cmjohnson@analytics1054:~$ sudo megacli -PDList -aALL |grep "Firmware state"
Firmware state: Online, Spun Up
Firmware state: Failed
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up

@elukey the disk still shows failed do you have to manually add it back?

• Cmjohnson moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.Jan 7 2019, 4:51 PM

Milimetric moved this task from Incoming to Radar on the Analytics board.Jan 7 2019, 4:59 PM

In T213038#4859708, @Cmjohnson wrote:

@elukey the disk still shows failed do you have to manually add it back?

Sorry Chris didn't get the question - do you mean that the disk has been replaced and it still shows Offline?

elukey assigned this task to • Cmjohnson.Jan 7 2019, 5:14 PM

@elukey sorry, i replaced the disk and it is still showing failed, I don't know if the disk needs to be manually added back to the array?

@Cmjohnson so I got a different than usual output from:

elukey@analytics1054:~$ sudo megacli -PDList -aAll | grep Firm
Firmware state: Online, Spun Up
Device Firmware Level: FL1H
Firmware state: Failed
Device Firmware Level: FL1H
Firmware state: Online, Spun Up
Device Firmware Level: FL1H
Firmware state: Online, Spun Up
Device Firmware Level: FL1H
Firmware state: Online, Spun Up
Device Firmware Level: FL1H
Firmware state: Online, Spun Up
Device Firmware Level: FL1H
Firmware state: Online, Spun Up
Device Firmware Level: FL1H
Firmware state: Online, Spun Up
Device Firmware Level: FL1H
Firmware state: Online, Spun Up
Device Firmware Level: FL1H
Firmware state: Online, Spun Up
Device Firmware Level: FL1H
Firmware state: Online, Spun Up
Device Firmware Level: FL1H
Firmware state: Online, Spun Up
Device Firmware Level: FL1H
Firmware state: Online, Spun Up
Device Firmware Level: AA63
Firmware state: Online, Spun Up
Device Firmware Level: AA63

(reading from my notes: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration#Swapping_broken_disk)

Never seen a Firmware state failed, any idea/experience from the past?

Enclosure Device ID: 32
Slot Number: 1
Drive's position: DiskGroup: 2, Span: 0, Arm: 0
Enclosure position: 1
Device Id: 1
WWN: 500003964b700233
Sequence Number: 3
Media Error Count: 0
Other Error Count: 3
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA

Raw Size: 3.638 TB [0x1d1c0beb0 Sectors]
Non Coerced Size: 3.637 TB [0x1d1b0beb0 Sectors]
Coerced Size: 3.637 TB [0x1d1b00000 Sectors]
Sector Size:  512
Logical Sector Size:  512
Physical Sector Size:  512
Firmware state: Failed
Device Firmware Level: FL1H
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x500056b31234abc1
Connected Port Number: 0(path0)
Inquiry Data: ATA     TOSHIBA MG03ACA4FL1H           5521K0O0F
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None
Device Speed: 6.0Gb/s
Link Speed: 6.0Gb/s
Media Type: Hard Disk Device
Drive Temperature :31C (87.80 F)
PI Eligibility:  No
Drive is formatted for PI information:  No
PI: No PI
Drive's NCQ setting : N/A
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s
Drive has flagged a S.M.A.R.T alert : No

Change 483062 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Prevent using /var/lib/hadoop/data/c partition on an1054

https://gerrit.wikimedia.org/r/483062

gerritbot added a project: Patch-For-Review.Jan 9 2019, 8:10 AM

Change 483062 merged by Elukey:
[operations/puppet@production] Prevent using /var/lib/hadoop/data/c partition on an1054

https://gerrit.wikimedia.org/r/483062

elukey moved this task from Backlog to Stalled on the User-Elukey board.Jan 10 2019, 9:05 AM

RobH mentioned this in T213859: eqiad: rack a3 pdu swap / failure / replacement.Jan 17 2019, 8:55 PM

disk is replaced but shows as unconfigured (good)

Change 486433 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Remove hiera overrides for analytics1054 after disk swap

https://gerrit.wikimedia.org/r/486433

Mentioned in SAL (#wikimedia-operations) [2019-01-25T07:40:38Z] <elukey> drain + reboot analytics1054 after disk swap (verify reboot + restore correct fstab mountpoints) - T213038

Change 486433 merged by Elukey:
[operations/puppet@production] Remove hiera overrides for analytics1054 after disk swap

https://gerrit.wikimedia.org/r/486433

All good thanks a lot @Cmjohnson !

Aklapper edited projects, added Analytics-Radar; removed Analytics.Jun 10 2020, 6:44 AM

Degraded RAID on analytics1054Closed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Degraded RAID on analytics1054
Closed, ResolvedPublic
Actions