Page MenuHomePhabricator

Degraded RAID on db2148
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host db2148. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: 1 failed LD(s) (Degraded)

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli
=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 0 (Target Id: 0)
	RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
	State: =====> Degraded <=====
	Number Of Drives: 10
	Number of Spans: 1
	Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU

		Span: 0 - Number of PDs: 10

			PD: 3 Information
			Enclosure Device ID: 32
			Slot Number: 3
			Drive's position: DiskGroup: 0, Span: 0, Arm: 3
			Media Error Count: 0
			Other Error Count: 0
			Predictive Failure Count: 0
			Last Predictive Failure Event Seq Number: 0

				Raw Size: 1.746 TB [0xdf8fe2b0 Sectors]
				Firmware state: =====> Rebuild <=====
				Media Type: Solid State Device
				Drive Temperature: 38C (100.40 F)

=== RaidStatus completed

Event Timeline

Marostegui triaged this task as Medium priority.
Marostegui added a project: DBA.
Marostegui added a subscriber: Papaul.

This is not correct, the RAID is ok and so is the BBU:

Something strange happened, as the RAID was degraded but then recovered:

[8023918.895348] megaraid_sas 0000:18:00.0: scanning for scsi0...
[8023918.895420] megaraid_sas 0000:18:00.0: 878 (676789824s/0x0001/CRIT) - VD 00/0 is now DEGRADED
[8028387.345703] Process accounting resumed
[8029038.265021] megaraid_sas 0000:18:00.0: scanning for scsi0...
root@db2148:~#  megacli -LDInfo -Lall -aALL


Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name                :
RAID Level          : Primary-1, Secondary-0, RAID Level Qualifier-0
Size                : 8.729 TB
Sector Size         : 512
Is VD emulated      : Yes
Mirror Data         : 8.729 TB
State               : Optimal
Strip Size          : 256 KB
Number Of Drives    : 10
Span Depth          : 1
Default Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy   : Disk's Default
Encryption Type     : None
Default Power Savings Policy: Controller Defined
Current Power Savings Policy: None
Can spin up in 1 minute: No
LD has drives that support T10 power conditions: No
LD's IO profile supports MAX power savings with cached writes: No
Bad Blocks Exist: No
Is VD Cached: No


root@db2148:~#  megacli -PDList -aALL | egrep -i "Slot|Firm|Err"
Slot Number: 0
Media Error Count: 0
Other Error Count: 0
Firmware state: Online, Spun Up
Device Firmware Level: J004
Slot Number: 1
Media Error Count: 0
Other Error Count: 0
Firmware state: Online, Spun Up
Device Firmware Level: J004
Slot Number: 2
Media Error Count: 0
Other Error Count: 0
Firmware state: Online, Spun Up
Device Firmware Level: J004
Slot Number: 3
Media Error Count: 0
Other Error Count: 0
Firmware state: Online, Spun Up
Device Firmware Level: J004
Slot Number: 4
Media Error Count: 0
Other Error Count: 0
Firmware state: Online, Spun Up
Device Firmware Level: J004
Slot Number: 5
Media Error Count: 0
Other Error Count: 0
Firmware state: Online, Spun Up
Device Firmware Level: J004
Slot Number: 6
Media Error Count: 0
Other Error Count: 0
Firmware state: Online, Spun Up
Device Firmware Level: J004
Slot Number: 7
Media Error Count: 0
Other Error Count: 0
Firmware state: Online, Spun Up
Device Firmware Level: J004
Slot Number: 8
Media Error Count: 0
Other Error Count: 0
Firmware state: Online, Spun Up
Device Firmware Level: J004
Slot Number: 9
Media Error Count: 0
Other Error Count: 0
Firmware state: Online, Spun Up
Device Firmware Level: J004

The controller's log shows that too:

seqNum: 0x00000366
Time: Sat Jun  5 03:00:00 2021

Code: 0x00000027
Class: 0
Locale: 0x20
Event Description: Patrol Read started
Event Data:
===========
None


seqNum: 0x00000367
Time: Sat Jun  5 07:27:55 2021

Code: 0x00000023
Class: 0
Locale: 0x20
Event Description: Patrol Read complete
Event Data:
===========
None


seqNum: 0x00000368
Time: Sat Jun 12 03:00:00 2021

Code: 0x00000027
Class: 0
Locale: 0x20
Event Description: Patrol Read started
Event Data:
===========
None


seqNum: 0x00000369
Time: Sat Jun 12 05:10:24 2021

Code: 0x0000010c
Class: 1
Locale: 0x02
Event Description: PD 03(e0x20/s3) Path 500056b36928cfc3  reset (Type 03)
Event Data:
===========
Device ID: 3
Enclosure Index: 32
Slot Number: 3
Error: 3


seqNum: 0x0000036a
Time: Sat Jun 12 05:10:24 2021

Code: 0x00000070
Class: 1
Locale: 0x02
Event Description: Removed: PD 03(e0x20/s3)
Event Data:
===========
Device ID: 3
Enclosure Index: 32
Slot Number: 3


seqNum: 0x0000036b
Time: Sat Jun 12 05:10:24 2021

Code: 0x000000f8
Class: 0
Locale: 0x02
Event Description: Removed: PD 03(e0x20/s3) Info: enclPd=20, scsiType=0, portMap=00, sasAddr=500056b36928cfc3,0000000000000000
Event Data:
===========
Device ID: 3
Enclosure Device ID: 32
Enclosure Index: 1
Slot Number: 3
SAS Address 1: 500056b36928cfc3
SAS Address 2: 0


seqNum: 0x0000036c
Time: Sat Jun 12 05:10:24 2021

Code: 0x00000072
Class: 0
Locale: 0x02
Event Description: State change on PD 03(e0x20/s3) from ONLINE(18) to FAILED(11)
Event Data:
===========
Device ID: 3
Enclosure Index: 32
Slot Number: 3
Previous state: 24
New state: 17


seqNum: 0x0000036d
Time: Sat Jun 12 05:10:24 2021

Code: 0x00000051
Class: 0
Locale: 0x01
Event Description: State change on VD 00/0 from OPTIMAL(3) to DEGRADED(2)
Event Data:
===========
Target Id: 0
Previous state: 3
New state: 2


seqNum: 0x0000036e
Time: Sat Jun 12 05:10:24 2021

Code: 0x000000fb
Class: 2
Locale: 0x01
Event Description: VD 00/0 is now DEGRADED
Event Data:
===========
Target Id: 0


seqNum: 0x0000036f
Time: Sat Jun 12 05:10:25 2021

Code: 0x00000072
Class: 0
Locale: 0x02
Event Description: State change on PD 03(e0x20/s3) from FAILED(11) to UNCONFIGURED_BAD(1)
Event Data:
===========
Device ID: 3
Enclosure Index: 32
Slot Number: 3
Previous state: 17
New state: 1


seqNum: 0x00000370
Time: Sat Jun 12 05:10:25 2021

Code: 0x000001bd
Class: 1
Locale: 0x02
Event Description: Patrol Read aborted on PD 03(e0x20/s3)
Event Data:
===========
Device ID: 3
Enclosure Index: 32
Slot Number: 3


seqNum: 0x00000371
Time: Sat Jun 12 05:11:08 2021

Code: 0x0000005b
Class: 0
Locale: 0x02
Event Description: Inserted: PD 03(e0x20/s3)
Event Data:
===========
Device ID: 3
Enclosure Index: 32
Slot Number: 3


seqNum: 0x00000372
Time: Sat Jun 12 05:11:08 2021

Code: 0x000000f7
Class: 0
Locale: 0x02
Event Description: Inserted: PD 03(e0x20/s3) Info: enclPd=20, scsiType=0, portMap=00, sasAddr=500056b36928cfc3,0000000000000000
Event Data:
===========
Device ID: 3
Enclosure Device ID: 32
Enclosure Index: 1
Slot Number: 3
SAS Address 1: 500056b36928cfc3
SAS Address 2: 0


seqNum: 0x00000373
Time: Sat Jun 12 05:11:08 2021

Code: 0x00000072
Class: 0
Locale: 0x02
Event Description: State change on PD 03(e0x20/s3) from UNCONFIGURED_BAD(1) to UNCONFIGURED_GOOD(0)
Event Data:
===========
Device ID: 3
Enclosure Index: 32
Slot Number: 3
Previous state: 1
New state: 0


seqNum: 0x00000374
Time: Sat Jun 12 05:11:08 2021

Code: 0x00000072
Class: 0
Locale: 0x02
Event Description: State change on PD 03(e0x20/s3) from UNCONFIGURED_GOOD(0) to OFFLINE(10)
Event Data:
===========
Device ID: 3
Enclosure Index: 32
Slot Number: 3
Previous state: 0
New state: 16


seqNum: 0x00000375
Time: Sat Jun 12 05:11:08 2021

Code: 0x0000006a
Class: 0
Locale: 0x02
Event Description: Rebuild automatically started on PD 03(e0x20/s3)
Event Data:
===========
Device ID: 3
Enclosure Index: 32
Slot Number: 3


seqNum: 0x00000376
Time: Sat Jun 12 05:11:08 2021

Code: 0x00000072
Class: 0
Locale: 0x02
Event Description: State change on PD 03(e0x20/s3) from OFFLINE(10) to REBUILD(14)
Event Data:
===========
Device ID: 3
Enclosure Index: 32
Slot Number: 3
Previous state: 16
New state: 20


seqNum: 0x00000377
Time: Sat Jun 12 05:11:08 2021

Code: 0x000001bd
Class: 1
Locale: 0x02
Event Description: Patrol Read aborted on PD 00(e0x20/s0)
Event Data:
===========
Device ID: 0
Enclosure Index: 32
Slot Number: 0


seqNum: 0x00000378
Time: Sat Jun 12 05:11:08 2021

Code: 0x000001bd
Class: 1
Locale: 0x02
Event Description: Patrol Read aborted on PD 01(e0x20/s1)
Event Data:
===========
Device ID: 1
Enclosure Index: 32
Slot Number: 1


seqNum: 0x00000379
Time: Sat Jun 12 05:11:08 2021

Code: 0x000001bd
Class: 1
Locale: 0x02
Event Description: Patrol Read aborted on PD 02(e0x20/s2)
Event Data:
===========
Device ID: 2
Enclosure Index: 32
Slot Number: 2


seqNum: 0x0000037a
Time: Sat Jun 12 05:11:08 2021

Code: 0x000001bd
Class: 1
Locale: 0x02
Event Description: Patrol Read aborted on PD 04(e0x20/s4)
Event Data:
===========
Device ID: 4
Enclosure Index: 32
Slot Number: 4


seqNum: 0x0000037b
Time: Sat Jun 12 05:11:08 2021

Code: 0x000001bd
Class: 1
Locale: 0x02
Event Description: Patrol Read aborted on PD 05(e0x20/s5)
Event Data:
===========
Device ID: 5
Enclosure Index: 32
Slot Number: 5


seqNum: 0x0000037c
Time: Sat Jun 12 05:11:08 2021

Code: 0x000001bd
Class: 1
Locale: 0x02
Event Description: Patrol Read aborted on PD 06(e0x20/s6)
Event Data:
===========
Device ID: 6
Enclosure Index: 32
Slot Number: 6


seqNum: 0x0000037d
Time: Sat Jun 12 05:11:08 2021

Code: 0x000001bd
Class: 1
Locale: 0x02
Event Description: Patrol Read aborted on PD 07(e0x20/s7)
Event Data:
===========
Device ID: 7
Enclosure Index: 32
Slot Number: 7


seqNum: 0x0000037e
Time: Sat Jun 12 05:11:08 2021

Code: 0x000001bd
Class: 1
Locale: 0x02
Event Description: Patrol Read aborted on PD 08(e0x20/s8)
Event Data:
===========
Device ID: 8
Enclosure Index: 32
Slot Number: 8


seqNum: 0x0000037f
Time: Sat Jun 12 05:11:08 2021

Code: 0x000001bd
Class: 1
Locale: 0x02
Event Description: Patrol Read aborted on PD 09(e0x20/s9)
Event Data:
===========
Device ID: 9
Enclosure Index: 32
Slot Number: 9


seqNum: 0x000003cc
Time: Sat Jun 12 06:35:44 2021

Code: 0x00000063
Class: 0
Locale: 0x02
Event Description: Rebuild complete on VD 00/0
Event Data:
===========
Target Id: 0


seqNum: 0x000003cd
Time: Sat Jun 12 06:35:44 2021

Code: 0x00000064
Class: 0
Locale: 0x02
Event Description: Rebuild complete on PD 03(e0x20/s3)
Event Data:
===========
Device ID: 3
Enclosure Index: 32
Slot Number: 3


seqNum: 0x000003ce
Time: Sat Jun 12 06:35:44 2021

Code: 0x00000194
Class: 0
Locale: 0x02
Event Description: Drive Cache settings restored after rebuild for PD 03(e0x20/s3)
Event Data:
===========
Device ID: 3
Enclosure Index: 32
Slot Number: 3


seqNum: 0x000003cf
Time: Sat Jun 12 06:35:45 2021

Code: 0x00000072
Class: 0
Locale: 0x02
Event Description: State change on PD 03(e0x20/s3) from REBUILD(14) to ONLINE(18)
Event Data:
===========
Device ID: 3
Enclosure Index: 32
Slot Number: 3
Previous state: 20
New state: 24


seqNum: 0x000003d0
Time: Sat Jun 12 06:35:45 2021

Code: 0x00000051
Class: 0
Locale: 0x01
Event Description: State change on VD 00/0 from DEGRADED(2) to OPTIMAL(3)
Event Data:
===========
Target Id: 0
Previous state: 2
New state: 3


seqNum: 0x000003d1
Time: Sat Jun 12 06:35:45 2021

Code: 0x000000f9
Class: 0
Locale: 0x01
Event Description: VD 00/0 is now OPTIMAL
Event Data:
===========
Target Id: 0

@Papaul can you confirm you didn't remove/insert any disk from this host?

Mentioned in SAL (#wikimedia-operations) [2021-06-14T07:29:32Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db2148 T284852', diff saved to https://phabricator.wikimedia.org/P16454 and previous config saved to /var/cache/conftool/dbconfig/20210614-072930-marostegui.json

I have rebooted the host and everything came as normal, all disks online, raid optimal...
Leaving this open until @Papaul confirms he wasn't touching these disks while on-site.

@maqrostegui I haven't been on rack B8 for the pass 2 weeks so no I was not touching these disks while on-site

Thanks - closing this. It might have been a glitch.