Page MenuHomePhabricator

Degraded RAID on cloudvirt1024
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host cloudvirt1024. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: 1 failed LD(s) (Degraded)

$ sudo /usr/local/lib/nagios/plugins/get_raid_status_megacli
=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 0 (Target Id: 0)
	RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
	State: =====> Degraded <=====
	Number Of Drives: 10
	Number of Spans: 1
	Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU

=== RaidStatus completed

Event Timeline

A ticket with Dell has been created

You have successfully submitted request SR986375888.

Volans added a subscriber: Volans.Feb 13 2019, 1:48 PM

It seems that PD3 is totally gone, from sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli -a:

			PD: 3 Information

			PD: 4 Information
			Enclosure Device ID: 32

Sorry I have to amend what I said above, both PD0 and PD3 are missing. I'm sending a patch to improve the get-raid-status-megacli script. With it the new output would have been:

=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 0 (Target Id: 0)
	RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
	State: =====> Degraded <=====
	Number Of Drives: 10
	Number of Spans: 1
	Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU

		Span: 0 - Number of PDs: 10

			PD: 0 Information
			ERROR: =====> MISSING DRIVE INFO <=====

			PD: 3 Information
			ERROR: =====> MISSING DRIVE INFO <=====

=== RaidStatus completed
colewhite triaged this task as Normal priority.Feb 13 2019, 8:32 PM

Yep, slot 0 and 3 are gone and need replacement.

cloudvirt1024 / SAS Addr 0x500056b37c0f19c0 / Slot 0 / Model BTYS810309EP1P9DGNSSDSC2KB019T7R / Serial SCV1DL58
cloudvirt1024 / SAS Addr 0x500056b37c0f19c3 / Slot 3 / Model BTYS811208EB1P9DGNSSDSC2KB019T7R / Serial SCV1DL58

Mentioned in SAL (#wikimedia-operations) [2019-02-14T13:39:35Z] <arturo> T215892 icinga downtime cloudvirt1024 for 2 weeks

Andrew added a subscriber: Andrew.Feb 14 2019, 3:12 PM

This host is now fully drained, so the dcops folks can do whatever, whenever.

It seems that also PD: 8 is failed now:

			PD: 8 Information
			Enclosure Device ID: 32
			Slot Number: 8
			Drive's position: DiskGroup: 0, Span: 0, Arm: 8
			Media Error Count: 0
			Other Error Count: 110
			Predictive Failure Count: 0
			Last Predictive Failure Event Seq Number: 0

				Raw Size: 1.746 TB [0xdf8fe2b0 Sectors]
				Firmware state: =====> Failed <=====
				Media Type: Solid State Device
				Drive Temperature: 24C (75.20 F)
Cmjohnson closed this task as Resolved.Feb 19 2019, 5:24 PM
Cmjohnson claimed this task.

@GTirloni The disk has been replaced

Return Shipping Info
USPS 9202 3946 5301 2441 0201 84
FEDEX 9611918 2393026 77770201

GTirloni reopened this task as Open.Feb 19 2019, 6:21 PM
GTirloni closed this task as Resolved.Feb 19 2019, 6:33 PM