Degraded RAID on db1048
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ops-monitoring-bot
	Dec 5 2016, 3:23 PM

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID was detected on host db1048. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 0 (Target Id: 0)
	RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
	State: =====> Degraded <=====
	Number Of Drives per span: 2
	Number of Spans: 6
	Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU

		Span: 0 - Number of PDs: 2

			PD: 0 Information
			Enclosure Device ID: 32
			Slot Number: 0
			Drive's position: DiskGroup: 0, Span: 0, Arm: 0
			Media Error Count: 17
			Other Error Count: 0
			Predictive Failure Count: =====> 50 <=====
			Last Predictive Failure Event Seq Number: 43221

				Raw Size: 279.396 GB [0x22ecb25c Sectors]
				Firmware state: =====> Offline <=====
				Media Type: Hard Disk Device
				Drive Temperature: 35C (95.00 F)

		Span: 1 - Number of PDs: 2

			PD: 0 Information
			Enclosure Device ID: 32
			Slot Number: 2
			Drive's position: DiskGroup: 0, Span: 1, Arm: 0
			Media Error Count: 414
			Other Error Count: 1
			Predictive Failure Count: =====> 51 <=====
			Last Predictive Failure Event Seq Number: 43222

				Raw Size: 279.396 GB [0x22ecb25c Sectors]
				Firmware state: Online, Spun Up
				Media Type: Hard Disk Device
				Drive Temperature: 33C (91.40 F)

=== RaidStatus completed

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Marostegui	T151039 Unknown cause is creating lag on db1048 under write load (but not on the other m3 slaves)
		Resolved		• Cmjohnson	T152411 Degraded RAID on db1048

Event Timeline

ops-monitoring-bot added projects: SRE, ops-eqiad.Dec 5 2016, 3:23 PM

ops-monitoring-bot subscribed.

Restricted Application added subscribers: Southparkfan, Aklapper. · View Herald TranscriptDec 5 2016, 3:23 PM

We marked 32:0 as failed manually as the server is lagging.

Marostegui added a project: DBA.Dec 5 2016, 3:23 PM

Marostegui set Security to None.

Mentioned in SAL (#wikimedia-operations) [2016-12-05T15:25:55Z] <marostegui> Set disk 32:2 as failed db1048 - T152411

We have also marked 32:2 as failed.

Both disks had media error, can we get them replaced?

Marostegui added a parent task: T151039: Unknown cause is creating lag on db1048 under write load (but not on the other m3 slaves).Dec 5 2016, 5:33 PM

@Cmjohnson for this ticket, let's go one by one. Change one, we will let the raid rebuild and then change the other one.
Thanks!

Chris has replaced 32:0 disk (which is part of the SPAN #0)
It is rebuilding now:

root@db1048:~# megacli -PDRbld -ShowProg -PhysDrv [32:0] -aALL

Rebuild Progress on Device at Enclosure 32, Slot 0 Completed 12% in 3 Minutes.

Span #0 is now rebuilt.

root@db1048:~# megacli -PDRbld -ShowProg -PhysDrv [32:0] -aALL

Device(Encl-32 Slot-0) is not in rebuild process

                Device Present
                ================
Virtual Drives    : 1
  Degraded        : 1
  Offline         : 0
Physical Devices  : 14
  Disks           : 12
  Critical Disks  : 1
  Failed Disks    : 0

32:2 has been replaced and it is getting rebuilt

root@db1048:~# megacli -PDRbld -ShowProg -PhysDrv [32:2] -aALL

Rebuild Progress on Device at Enclosure 32, Slot 2 Completed 2% in 0 Minutes.

All good now - thanks Chris!!

root@db1048:~# megacli -PDRbld -ShowProg -PhysDrv [32:2] -aALL

Device(Encl-32 Slot-2) is not in rebuild process

Virtual Drive: 0 (Target Id: 0)
Name                :
RAID Level          : Primary-1, Secondary-0, RAID Level Qualifier-0
Size                : 1.633 TB
Sector Size         : 512
Mirror Data         : 1.633 TB
State               : Optimal
Strip Size          : 256 KB

Degraded RAID on db1048Closed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Degraded RAID on db1048
Closed, ResolvedPublic
Actions

Related Objects
Search...