Page MenuHomePhabricator

Degraded RAID on db2107
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host db2107. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: 1 failed LD(s) (Degraded)

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli
=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 0 (Target Id: 0)
	RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
	State: =====> Degraded <=====
	Number Of Drives: 6
	Number of Spans: 1
	Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU

=== RaidStatus completed

Event Timeline

Marostegui triaged this task as Medium priority.
Marostegui added a project: DBA.
Marostegui added a subscriber: Papaul.

@Papaul this host is under support, can we get a new disk from DELL? This is s2 codfw master

Create Dispatch: Success
You have successfully submitted request SR1060544173.

Thanks @Papaul - however the disk doesn't look to be rebuilding:

seqNum: 0x000002e2
Time: Wed May 26 15:14:34 2021

Code: 0x000000b9
Class: 2
Locale: 0x04
Event Description: Enclosure PD 20(c None/p1) phy bad for slot 5
Event Data:
===========
Device ID: 32
Enclosure Index: 1
Slot Number: 255
Index: 5

Can you try removing it, waiting a couple of minutes and then insert again? If not, maybe request a new disk? If that new one is also bad, maybe the RAID controller is broken?

Mentioned in SAL (#wikimedia-operations) [2021-05-26T16:10:56Z] <marostegui> Reboot db2103 (codfw master) T282072

Mentioned in SAL (#wikimedia-operations) [2021-05-26T16:12:51Z] <marostegui> Reboot db2107 (codfw master) T282072

After the reboot I can see the disk:

Raw Size: 1.746 TB [0xdf8fe2b0 Sectors]
Non Coerced Size: 1.745 TB [0xdf7fe2b0 Sectors]
Coerced Size: 1.745 TB [0xdf7c0000 Sectors]
Sector Size:  512
Logical Sector Size:  512
Physical Sector Size:  4096
Firmware state: Rebuild

The disk is rebuilding:

root@db2107:~# megacli -pdrbld -showprog -physdrv\[32:5\] -aALL

Rebuild Progress on Device at Enclosure 32, Slot 5 Completed 5% in 3 Minutes.

Let's hope it reaches 100%!

This keeps progressing well:

root@db2107:~# megacli -pdrbld -showprog -physdrv\[32:5\] -aALL

Rebuild Progress on Device at Enclosure 32, Slot 5 Completed 37% in 31 Minutes.

RAID back to optimal

root@db2107:~# megacli -LDInfo -Lall -aALL


Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name                :
RAID Level          : Primary-1, Secondary-0, RAID Level Qualifier-0
Size                : 5.237 TB
Sector Size         : 512
Is VD emulated      : Yes
Mirror Data         : 5.237 TB
State               : Optimal