Page MenuHomePhabricator

Degraded RAID on db1101
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host db1101. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: 1 failed LD(s) (Degraded)

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli
=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 0 (Target Id: 0)
	RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
	State: =====> Degraded <=====
	Number Of Drives: 10
	Number of Spans: 1
	Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU

		Span: 0 - Number of PDs: 10

			PD: 7 Information
			Enclosure Device ID: 32
			Slot Number: 7
			Drive's position: DiskGroup: 0, Span: 0, Arm: 7
			Media Error Count: 0
			Other Error Count: 1424
			Predictive Failure Count: =====> 1 <=====
			Last Predictive Failure Event Seq Number: 1617

				Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
				Firmware state: =====> Failed <=====
				Media Type: Solid State Device
				Drive Temperature: N/A

=== RaidStatus completed

Event Timeline

Marostegui added a project: DBA.
Marostegui added a subscriber: wiki_willy.

@wiki_willy this host is out of warranty, but do we have some spare disks (used is also ok) in the DC that we can replace this one with?
Thanks

wiki_willy added subscribers: Jclark-ctr, Cmjohnson.

@Jclark-ctr or @Cmjohnson - do we have any decom'd servers onsite with this drive size? Thanks, Willy

Not sure who changed the disk, but thank you either John or Chris!

19:06:25 <+icinga-wm> RECOVERY - MegaRAID on db1101 is OK: OK: optimal, 1 logical, 10 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
Time: Tue Dec 22 16:20:46 2020

Code: 0x000000f7
Class: 0
Locale: 0x02
Event Description: Inserted: PD 07(e0x20/s7) Info: enclPd=20, scsiType=0, portMap=00, sasAddr=500056b3969947c7,0000000000000000

seqNum: 0x00000c4a
Time: Tue Dec 22 17:57:31 2020

Code: 0x00000072
Class: 0
Locale: 0x02
Event Description: State change on PD 07(e0x20/s7) from REBUILD(14) to ONLINE(18)
Event Data:
===========
Device ID: 7
Enclosure Index: 32
Slot Number: 7
Previous state: 20
New state: 24


seqNum: 0x00000c4b
Time: Tue Dec 22 17:57:31 2020

Code: 0x00000051
Class: 0
Locale: 0x01
Event Description: State change on VD 00/0 from DEGRADED(2) to OPTIMAL(3)
Event Data:
===========
Target Id: 0
Previous state: 2
New state: 3


seqNum: 0x00000c4c
Time: Tue Dec 22 17:57:31 2020

Code: 0x000000f9
Class: 0
Locale: 0x01
Event Description: VD 00/0 is now OPTIMAL