Page MenuHomePhabricator

Degraded RAID on db1048
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID was detected on host db1048. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 0 (Target Id: 0)
	RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
	State: =====> Degraded <=====
	Number Of Drives per span: 2
	Number of Spans: 6
	Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU

		Span: 0 - Number of PDs: 2

			PD: 0 Information
			Enclosure Device ID: 32
			Slot Number: 0
			Drive's position: DiskGroup: 0, Span: 0, Arm: 0
			Media Error Count: 17
			Other Error Count: 0
			Predictive Failure Count: =====> 50 <=====
			Last Predictive Failure Event Seq Number: 43221

				Raw Size: 279.396 GB [0x22ecb25c Sectors]
				Firmware state: =====> Offline <=====
				Media Type: Hard Disk Device
				Drive Temperature: 35C (95.00 F)

		Span: 1 - Number of PDs: 2

			PD: 0 Information
			Enclosure Device ID: 32
			Slot Number: 2
			Drive's position: DiskGroup: 0, Span: 1, Arm: 0
			Media Error Count: 414
			Other Error Count: 1
			Predictive Failure Count: =====> 51 <=====
			Last Predictive Failure Event Seq Number: 43222

				Raw Size: 279.396 GB [0x22ecb25c Sectors]
				Firmware state: Online, Spun Up
				Media Type: Hard Disk Device
				Drive Temperature: 33C (91.40 F)

=== RaidStatus completed

Event Timeline

Restricted Application added subscribers: Southparkfan, Aklapper. · View Herald TranscriptDec 5 2016, 3:23 PM

We marked 32:0 as failed manually as the server is lagging.

Marostegui set Security to None.

Mentioned in SAL (#wikimedia-operations) [2016-12-05T15:25:55Z] <marostegui> Set disk 32:2 as failed db1048 - T152411

We have also marked 32:2 as failed.

Both disks had media error, can we get them replaced?

@Cmjohnson for this ticket, let's go one by one. Change one, we will let the raid rebuild and then change the other one.
Thanks!

Chris has replaced 32:0 disk (which is part of the SPAN #0)
It is rebuilding now:

root@db1048:~# megacli -PDRbld -ShowProg -PhysDrv [32:0] -aALL

Rebuild Progress on Device at Enclosure 32, Slot 0 Completed 12% in 3 Minutes.

Span #0 is now rebuilt.

root@db1048:~# megacli -PDRbld -ShowProg -PhysDrv [32:0] -aALL

Device(Encl-32 Slot-0) is not in rebuild process

                Device Present
                ================
Virtual Drives    : 1
  Degraded        : 1
  Offline         : 0
Physical Devices  : 14
  Disks           : 12
  Critical Disks  : 1
  Failed Disks    : 0

32:2 has been replaced and it is getting rebuilt

root@db1048:~# megacli -PDRbld -ShowProg -PhysDrv [32:2] -aALL

Rebuild Progress on Device at Enclosure 32, Slot 2 Completed 2% in 0 Minutes.
Marostegui closed this task as Resolved.Jan 4 2017, 5:11 PM
Marostegui assigned this task to Cmjohnson.

All good now - thanks Chris!!

root@db1048:~# megacli -PDRbld -ShowProg -PhysDrv [32:2] -aALL

Device(Encl-32 Slot-2) is not in rebuild process

Virtual Drive: 0 (Target Id: 0)
Name                :
RAID Level          : Primary-1, Secondary-0, RAID Level Qualifier-0
Size                : 1.633 TB
Sector Size         : 512
Mirror Data         : 1.633 TB
State               : Optimal
Strip Size          : 256 KB