Page MenuHomePhabricator

Degraded RAID on labsdb1005
Closed, DuplicatePublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host labsdb1005. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: 1 failed LD(s) (Partially Degraded)

$ sudo /usr/local/lib/nagios/plugins/get_raid_status_megacli
=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 0 (Target Id: 0)
	RAID Level: Primary-6, Secondary-0, RAID Level Qualifier-3
	State: =====> Partially Degraded <=====
	Number Of Drives: 12
	Number of Spans: 1
	Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU

		Span: 0 - Number of PDs: 12

			PD: 7 Information
			Enclosure Device ID: 32
			Slot Number: 8
			Drive's position: DiskGroup: 0, Span: 0, Arm: 7
			Media Error Count: 0
			Other Error Count: 0
			Predictive Failure Count: =====> 2 <=====
			Last Predictive Failure Event Seq Number: 2513893

				Raw Size: 1.819 TB [0xe8e088b0 Sectors]
				Firmware state: =====> Offline <=====
				Media Type: Hard Disk Device
				Drive Temperature: 33C (91.40 F)

=== RaidStatus completed

Related Objects

Event Timeline

Marostegui added subscribers: Bstorm, bd808, GTirloni.
Marostegui added subscribers: Cmjohnson, jcrespo, Marostegui.

cloud-services-team I would suggest you coordinate with @Cmjohnson to get this disk replaced

Marostegui edited projects, added Toolforge; removed Data-Services.Feb 15 2019, 7:24 AM
colewhite triaged this task as Normal priority.Feb 15 2019, 7:31 PM

Directly related to T216202

This was triggered when I manually offlined the failing disk in order to recover the database filesystem. So far, it hasn't helped as much as I was really hoping.