Degraded RAID on db1068
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ops-monitoring-bot
	Feb 24 2018, 8:52 PM

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host db1068. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: 1 failed LD(s) (Degraded)

$ sudo /usr/local/lib/nagios/plugins/get_raid_status_megacli
=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 0 (Target Id: 0)
	RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
	State: =====> Degraded <=====
	Number Of Drives per span: 2
	Number of Spans: 6
	Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU

		Span: 1 - Number of PDs: 2

			PD: 0 Information
			Enclosure Device ID: 32
			Slot Number: 2
			Drive's position: DiskGroup: 0, Span: 1, Arm: 0
			Media Error Count: 21
			Other Error Count: 71
			Predictive Failure Count: =====> 3 <=====
			Last Predictive Failure Event Seq Number: 3205

				Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
				Firmware state: =====> Failed <=====
				Media Type: Hard Disk Device
				Drive Temperature: 43C (109.40 F)

		Span: 4 - Number of PDs: 2

			PD: 1 Information
			Enclosure Device ID: 32
			Slot Number: 9
			Drive's position: DiskGroup: 0, Span: 4, Arm: 1
			Media Error Count: 46
			Other Error Count: 0
			Predictive Failure Count: =====> 555 <=====
			Last Predictive Failure Event Seq Number: 3206

				Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
				Firmware state: Online, Spun Up
				Media Type: Hard Disk Device
				Drive Temperature: 38C (100.40 F)

=== RaidStatus completed

Related Objects

Mentioned In: T188918: Degraded RAID on db1068
T188685: Degraded RAID on db1064
Mentioned Here: T187722: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030)

Event Timeline

ops-monitoring-bot added projects: SRE, ops-eqiad.Feb 24 2018, 8:52 PM

ops-monitoring-bot subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 24 2018, 8:52 PM

This is s4 primary master - please replace the disk as soon as you can.
Thanks!

• Marostegui moved this task from Triage to In progress on the DBA board.Feb 25 2018, 7:25 AM

• Marostegui assigned this task to • Cmjohnson.Feb 25 2018, 7:45 AM

@Cmjohnson Please be extra careful here, there are 2 degraded disks here, but we want to change first _only_ the one shown at the top the list up there. Once we stop being in non-redundant mode, we may go for the other.

We had an issue on codfw, but I think it was only because 2 disks of the same span failed at the same time: T187722 . Let us know if you need help with "blinking" leds for indentifying spans :-)

I believe only the one marked as failed should be blinking in a different colour. The other one only shows errors as far as the report goes, but yeah, better be careful if there are two of them showing as bad, let us know first.
Thanks!

Thanks Chris:

root@db1068:~# megacli -PDRbld -ShowProg -PhysDrv [32:2] -aALL

Rebuild Progress on Device at Enclosure 32, Slot 2 Completed 1% in 1 Minutes.

Once this is finished, we should try to replace the one on slot 9.
Will ping you when we are ready for the second swap

Thanks again

• Cmjohnson moved this task from Backlog to High Priority Task on the ops-eqiad board.Feb 26 2018, 4:04 PM

The rebuilt failed for this disk, I guess this disk was not in a good state:

PD: 0 Information
Enclosure Device ID: 32
Slot Number: 2
Drive's position: DiskGroup: 0, Span: 1, Arm: 0
Enclosure position: 1
Device Id: 2
WWN: 5000C50023C4C988
Sequence Number: 12
Media Error Count: 12
Other Error Count: 7
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS

Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
Non Coerced Size: 558.411 GB [0x45cd2fb0 Sectors]
Coerced Size: 558.375 GB [0x45cc0000 Sectors]
Sector Size:  0
Firmware state: Failed
Device Firmware Level: 0008

Can we get another one?

root@db1068:~# megacli -LDPDInfo -aAll

Adapter #0

Number of Virtual Disks: 1
Virtual Drive: 0 (Target Id: 0)
Name                :
RAID Level          : Primary-1, Secondary-0, RAID Level Qualifier-0
Size                : 3.271 TB
Sector Size         : 512
Mirror Data         : 3.271 TB
State               : Degraded
Strip Size          : 256 KB
Number Of Drives per span:2
Span Depth          : 6

@Cmjohnson see above :-)

Thanks Chris!

root@db1068:~# megacli -PDRbld -ShowProg -PhysDrv [32:2] -aALL

Rebuild Progress on Device at Enclosure 32, Slot 2 Completed 6% in 8 Minutes.

It worked this time!

root@db1068:~# megacli -LDPDInfo -aAll

Adapter #0

Number of Virtual Disks: 1
Virtual Drive: 0 (Target Id: 0)
Name                :
RAID Level          : Primary-1, Secondary-0, RAID Level Qualifier-0
Size                : 3.271 TB
Sector Size         : 512
Mirror Data         : 3.271 TB
State               : Optimal
Strip Size          : 256 KB
Number Of Drives per span:2
Span Depth          : 6
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU

@Cmjohnson do you have more spares so we can replace the other degraded disk next week or do we need to place an order?

I have one left and now I see db1064 is degraded. We needed to order more
@RobH.

In T188187#4016491, @Cmjohnson wrote:

I have one left and now I see db1064 is degraded. We needed to order more
@RobH.

Yeah...just saw that.
Let's save that spare disk for db1068 and order more.
db1064 is a slave, so it is less important.

For db1068, would Monday work for you to replace the disk? We need to make sure we mark it as failed using megacli before you can pull it out.

• Marostegui mentioned this in T188685: Degraded RAID on db1064.Mar 2 2018, 6:36 AM

@Marostegui Feel free to fail the disk...I am ready w/a replacement

In T188187#4023865, @Cmjohnson wrote:

@Marostegui Feel free to fail the disk...I am ready w/a replacement

Thanks - I will do in a sec once I get someone to double check the command :)

Mentioned in SAL (#wikimedia-operations) [2018-03-05T15:28:53Z] <marostegui> Mark as failed disk 32:9 on db1068 (s4 primary master) - T188187

This has been replaced by Chris:

root@db1068:~# megacli -PDRbld -ShowProg -PhysDrv [32:9] -aALL

Rebuild Progress on Device at Enclosure 32, Slot 9 Completed 1% in 12 Minutes.

• Marostegui mentioned this in T188918: Degraded RAID on db1068.Mar 5 2018, 3:57 PM

Almost there:

root@db1068:~# megacli -PDRbld -ShowProg -PhysDrv [32:9] -aALL

Rebuild Progress on Device at Enclosure 32, Slot 9 Completed 97% in 920 Minutes.

• Marostegui mentioned this in Unknown Object (Task).Mar 6 2018, 7:08 AM

˜/icinga-wm 8:26> RECOVERY - MegaRAID on db1068 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy

root@db1068:~# megacli -LDPDInfo -aAll

Adapter #0

Number of Virtual Disks: 1
Virtual Drive: 0 (Target Id: 0)
Name                :
RAID Level          : Primary-1, Secondary-0, RAID Level Qualifier-0
Size                : 3.271 TB
Sector Size         : 512
Mirror Data         : 3.271 TB
State               : Optimal
Strip Size          : 256 KB
Number Of Drives per span:2
Span Depth          : 6
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write

And this is the status of the errors for the record (even the new disk shows errors):

root@db1068:~# megacli -LDPDInfo -aAll | egrep -i "slot|error"
Slot Number: 0
Media Error Count: 0
Other Error Count: 2
Slot Number: 1
Media Error Count: 0
Other Error Count: 0
Slot Number: 2
Media Error Count: 0
Other Error Count: 1
Slot Number: 3
Media Error Count: 0
Other Error Count: 0
Slot Number: 4
Media Error Count: 0
Other Error Count: 0
Slot Number: 5
Media Error Count: 0
Other Error Count: 0
Slot Number: 6
Media Error Count: 0
Other Error Count: 0
Slot Number: 7
Media Error Count: 0
Other Error Count: 0
Slot Number: 8
Media Error Count: 0
Other Error Count: 0
Slot Number: 9
Media Error Count: 0
Other Error Count: 5
Slot Number: 10
Media Error Count: 0
Other Error Count: 0
Slot Number: 11
Media Error Count: 0
Other Error Count: 0

Way less errors than before, but still a few. The new disk took super long to get rebuild (almost 950 minutes = 15 hours)
Closing this for now.

Degraded RAID on db1068Closed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Degraded RAID on db1068
Closed, ResolvedPublic
Actions