Page MenuHomePhabricator

Degraded RAID on db1001
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host db1001. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: 1 failed LD(s) (Degraded)
=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 0 (Target Id: 0)
	RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
	State: =====> Degraded <=====
	Number Of Drives per span: 2
	Number of Spans: 6
	Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU

		Span: 3 - Number of PDs: 2

			PD: 0 Information
			Enclosure Device ID: 32
			Slot Number: 6
			Drive's position: DiskGroup: 0, Span: 3, Arm: 0
			Media Error Count: 3
			Other Error Count: 3
			Predictive Failure Count: 0
			Last Predictive Failure Event Seq Number: 0

				Raw Size: 279.396 GB [0x22ecb25c Sectors]
				Firmware state: =====> Failed <=====
				Media Type: Hard Disk Device
				Drive Temperature: 32C (89.60 F)

=== RaidStatus completed

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 20 2017, 9:56 PM

@Cmjohnson - you should have 300 GB old disks, but if you don't I can tell you were to get some (decommed/unused servers). This one is going to soon be retired, but right now is still in use.

fgiunchedi triaged this task as Medium priority.Jul 21 2017, 10:07 AM

Lovely, after I mentioned yesterday this host doesn't have any HW issues, a disk fails :-)
I should have kept my mouth closed!

@Cmjohnson remember that there are some hosts totally ready for you to decommission them which disks could be use to replace this faulty disk if needed: T166486 T164702 T163778
Thanks!

Cmjohnson moved this task from Backlog to Up next on the ops-eqiad board.Jul 25 2017, 3:10 PM

Disk replaced and rebuilding

Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Rebuild
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up

Marostegui closed this task as Resolved.Jul 25 2017, 7:37 PM
Marostegui assigned this task to Cmjohnson.

RAID back to Optimal
Thanks Chris!!

root@db1001:~# megacli -pdrbld -showprog -physdrv\[32:6\] -aALL

Device(Encl-32 Slot-6) is not in rebuild process

Number of Virtual Disks: 1
Virtual Drive: 0 (Target Id: 0)
Name                :
RAID Level          : Primary-1, Secondary-0, RAID Level Qualifier-0
Size                : 1.633 TB
Sector Size         : 512
Mirror Data         : 1.633 TB
State               : Optimal