Page MenuHomePhabricator

Degraded RAID on es2026
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host es2026. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: 1 failed LD(s) (Degraded)

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli
=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 0 (Target Id: 0)
	RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
	State: =====> Degraded <=====
	Number Of Drives: 12
	Number of Spans: 1
	Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU

		Span: 0 - Number of PDs: 12

			PD: 2 Information
			Enclosure Device ID: 32
			Slot Number: 2
			Drive's position: DiskGroup: 0, Span: 0, Arm: 2
			Media Error Count: 18
			Other Error Count: 4
			Predictive Failure Count: 0
			Last Predictive Failure Event Seq Number: 0

				Raw Size: 1.819 TB [0xe8e088b0 Sectors]
				Firmware state: =====> Failed <=====
				Media Type: Hard Disk Device
				Drive Temperature: 28C (82.40 F)

=== RaidStatus completed

Event Timeline

Restricted Application added subscribers: Marostegui, Aklapper. · View Herald TranscriptSep 25 2020, 9:44 AM
Papaul claimed this task.Sep 25 2020, 4:43 PM
Papaul triaged this task as Medium priority.

Create Dispatch: Success
You have successfully submitted request SR1037735666.

Your dispatch request has been successfully created and will be reviewed by our team. You can monitor its progress on your Dell Technologies TechDirect dashboard.

Thank you Papaul

Marostegui moved this task from Triage to In progress on the DBA board.Sep 28 2020, 5:10 AM
Papaul reassigned this task from Papaul to Marostegui.Sep 28 2020, 3:43 PM
Papaul added a subscriber: Papaul.

just after 1 month we received this server, we have already a bad disk.

Disk replaced.

Thanks @Papaul is the disk blinking there? I still don't see it on the OS.

Time: Mon Sep 28 15:39:48 2020
Event Description: PD 02(e0x20/s2) Path 500056b34b011fc2  reset (Type 03)
Time: Mon Sep 28 15:39:48 2020
Event Description: Removed: PD 02(e0x20/s2)
Time: Mon Sep 28 15:39:48 2020
Event Description: Removed: PD 02(e0x20/s2) Info: enclPd=20, scsiType=0, portMap=00, sasAddr=500056b34b011fc2,0000000000000000
Time: Mon Sep 28 15:39:48 2020
Event Description: State change on PD 02(e0x20/s2) from FAILED(11) to UNCONFIGURED_BAD(1)
Time: Mon Sep 28 15:43:53 2020
Event Description: PD 110(e0x00/s0) Path 500056b34b011fc2  reset (Type 03)
Time: Mon Sep 28 15:44:07 2020
Event Description: Enclosure PD 20(c None/p1) phy bad for slot 2

Let's try to remove it wait a couple of minutes and then back in?

@Papaul after putting the disk back in, I am seeing the same errors on the controller:

[1764225.764609] megaraid_sas 0000:af:00.0: 1103 (654623787s/0x0004/CRIT) - Enclosure PD 20(c None/p1) phy bad for slot 2
Time: Mon Sep 28 15:56:17 2020
Event Description: PD 110(e0x00/s0) Path 500056b34b011fc2  reset (Type 03)
Time: Mon Sep 28 15:56:27 2020
Event Description: Enclosure PD 20(c None/p1) phy bad for slot 2

. But maybe a bad disk was shipped?
The HW logs do see the disk (but obviously knows nothing about the raid):

-------------------------------------------------------------------------------
Record:      2
Date/Time:   09/25/2020 08:59:20
Source:      system
Severity:    Critical
Description: Fault detected on drive 2 in disk drive bay 1.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   09/28/2020 15:39:43
Source:      system
Severity:    Critical
Description: Drive 2 is removed from disk drive bay 1.
-------------------------------------------------------------------------------
Record:      4
Date/Time:   09/28/2020 15:39:43
Source:      system
Severity:    Ok
Description: Drive 2 in disk drive bay 1 is operating normally.
-------------------------------------------------------------------------------
Record:      5
Date/Time:   09/28/2020 15:42:39
Source:      system
Severity:    Ok
Description: Drive 2 is installed in disk drive bay 1.
-------------------------------------------------------------------------------
Record:      6
Date/Time:   09/28/2020 15:54:54
Source:      system
Severity:    Critical
Description: Drive 2 is removed from disk drive bay 1.
-------------------------------------------------------------------------------
Record:      7
Date/Time:   09/28/2020 15:55:05
Source:      system
Severity:    Ok
Description: Drive 2 is installed in disk drive bay 1.
-------------------------------------------------------------------------------

But the controller doesn't

Slot Number: 0
Slot Number: 1
Slot Number: 3
Slot Number: 4
Slot Number: 5
Slot Number: 6
Slot Number: 7
Slot Number: 8
Slot Number: 9
Slot Number: 10
Slot Number: 11

I will try to to reboot it tomorrow and see what happens

Change 630718 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] es2026: Disable notifications

https://gerrit.wikimedia.org/r/630718

Mentioned in SAL (#wikimedia-operations) [2020-09-29T05:12:37Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool es2026 T263837', diff saved to https://phabricator.wikimedia.org/P12822 and previous config saved to /var/cache/conftool/dbconfig/20200929-051236-marostegui.json

Change 630718 merged by Marostegui:
[operations/puppet@production] es2026: Disable notifications

https://gerrit.wikimedia.org/r/630718

Mentioned in SAL (#wikimedia-operations) [2020-09-29T05:13:36Z] <marostegui> Stop mysql and reboot es2026 - T263837

Marostegui reassigned this task from Marostegui to Papaul.Sep 29 2020, 5:29 AM
Marostegui added a subscriber: wiki_willy.

@Papaul I think we need to ask for another disk or advise from Dell.
These are the controller logs after the reboot:

Time: Tue Sep 29 05:17:10 2020
Event Description: Shutdown command received from host
Event Description: Firmware initialization started (PCI ID 005d/1000/1f42/1028)
Event Description: Firmware version 4.300.00-8352
Event Description: Battery Present
Event Description: Package version 25.5.6.0009
Event Description: Board Revision A01
Event Description: Battery temperature is normal
Event Description: Current capacity of the battery is above threshold
Event Description: Enclosure PD 20(c None/p1) communication restored
Event Description: Inserted: Encl PD 20
Event Description: Inserted: PD 20(c None/p1) Info: enclPd=20, scsiType=d, portMap=00, sasAddr=500056b34b011ffd,0000000000000000
Event Description: Inserted: PD 00(e0x20/s0)
Event Description: Inserted: PD 00(e0x20/s0) Info: enclPd=20, scsiType=0, portMap=00, sasAddr=500056b34b011fc0,0000000000000000
Event Description: Inserted: PD 01(e0x20/s1)
Event Description: Inserted: PD 01(e0x20/s1) Info: enclPd=20, scsiType=0, portMap=00, sasAddr=500056b34b011fc1,0000000000000000
Event Description: Inserted: PD 03(e0x20/s3)
Event Description: Inserted: PD 03(e0x20/s3) Info: enclPd=20, scsiType=0, portMap=00, sasAddr=500056b34b011fc3,0000000000000000
Event Description: Inserted: PD 04(e0x20/s4)
Event Description: Inserted: PD 04(e0x20/s4) Info: enclPd=20, scsiType=0, portMap=00, sasAddr=500056b34b011fc4,0000000000000000
Event Description: Inserted: PD 05(e0x20/s5)
Event Description: Inserted: PD 05(e0x20/s5) Info: enclPd=20, scsiType=0, portMap=00, sasAddr=500056b34b011fc5,0000000000000000
Event Description: Inserted: PD 06(e0x20/s6)
Event Description: Inserted: PD 06(e0x20/s6) Info: enclPd=20, scsiType=0, portMap=00, sasAddr=500056b34b011fc6,0000000000000000
Event Description: Inserted: PD 07(e0x20/s7)
Event Description: Inserted: PD 07(e0x20/s7) Info: enclPd=20, scsiType=0, portMap=00, sasAddr=500056b34b011fc7,0000000000000000
Event Description: Inserted: PD 08(e0x20/s8)
Event Description: Inserted: PD 08(e0x20/s8) Info: enclPd=20, scsiType=0, portMap=00, sasAddr=500056b34b011fc8,0000000000000000
Event Description: Inserted: PD 09(e0x20/s9)
Event Description: Inserted: PD 09(e0x20/s9) Info: enclPd=20, scsiType=0, portMap=00, sasAddr=500056b34b011fc9,0000000000000000
Event Description: Inserted: PD 0a(e0x20/s10)
Event Description: Inserted: PD 0a(e0x20/s10) Info: enclPd=20, scsiType=0, portMap=00, sasAddr=500056b34b011fca,0000000000000000
Event Description: Inserted: PD 0b(e0x20/s11)
Event Description: Inserted: PD 0b(e0x20/s11) Info: enclPd=20, scsiType=0, portMap=00, sasAddr=500056b34b011fcb,0000000000000000
Event Description: Controller operating temperature within normal range, full operation restored
Time: Tue Sep 29 05:18:59 2020
Event Description: Time established as 09/29/20  5:18:59; (88 seconds since power on)
Elapsed Time since power-on: 88
Time: Tue Sep 29 05:18:59 2020
Time: Tue Sep 29 05:19:59 2020
Event Description: Patrol Read resumed
Time: Tue Sep 29 05:20:05 2020
Event Description: Host driver is loaded and operational
Time: Tue Sep 29 05:20:19 2020
Event Description: Enclosure PD 20(c None/p1) phy bad for slot 2

Checking the controller configuration during boot up shows that the disk isn't being recognized indeed, as it jumps from #1 to #3


This host was delivered by Dell 9th September 2020.

The plan agreed with Papaul is to use an old disk from an es host that was decommissioned, and see if the controller recognizes the disk.
If it does, the new disk is probably bad, if it doesn't we'll need to talk to Dell as it is probably the RAID controller

The disk from one of the decom es server works

 		Status 	Name 	State 	Slot Number 	Size 	Security Status 	Bus Protocol 	Media Type 	Hot Spare 	Remaining Rated Write Endurance
			Physical Disk 0:1:0 	Online 	0 	1862.5 GB	Not Capable 	SATA 	HDD 	No 	Not Applicable
			Physical Disk 0:1:1 	Online 	1 	1862.5 GB	Not Capable 	SATA 	HDD 	No 	Not Applicable
			Physical Disk 0:1:2 	Online 	2 	1862.5 GB	Not Capable 	SATA 	HDD 	No 	Not Applicable
			Physical Disk 0:1:3 	Online 	3 	1862.5 GB	Not Capable 	SATA 	HDD 	No 	Not Applicable
			Physical Disk 0:1:4 	Online 	4 	1862.5 GB	Not Capable 	SATA 	HDD 	No 	Not Applicable
			Physical Disk 0:1:5 	Online 	5 	1862.5 GB	Not Capable 	SATA 	HDD 	No 	Not Applicable
			Physical Disk 0:1:6 	Online 	6 	1862.5 GB	Not Capable 	SATA 	HDD 	No 	Not Applicable
			Physical Disk 0:1:7 	Online 	7 	1862.5 GB	Not Capable 	SATA 	HDD 	No 	Not Applicable
			Physical Disk 0:1:8 	Online 	8 	1862.5 GB	Not Capable 	SATA 	HDD 	No 	Not Applicable
			Physical Disk 0:1:9 	Online 	9 	1862.5 GB	Not Capable 	SATA 	HDD 	No 	Not Applicable

I can see the disk now:

Time: Wed Sep 30 14:44:09 2020

Code: 0x00000072
Class: 0
Locale: 0x02
Event Description: State change on PD 02(e0x20/s2) from OFFLINE(10) to REBUILD(14)
Event Data:
===========
Device ID: 2
Enclosure Index: 32
Slot Number: 2


root@es2026:~# megacli -PDList -aALL | grep "Slot"
Slot Number: 0
Slot Number: 1
Slot Number: 2
Slot Number: 3
Slot Number: 4
Slot Number: 5
Slot Number: 6
Slot Number: 7
Slot Number: 8
Slot Number: 9
Slot Number: 10
Slot Number: 11

So Dell sent a bad disk

Not sure if it is actually going to work but at least the disk is seen:

root@es2026:~# megacli -PDRbld -ShowProg -physdrv[32:2] -aALL

Rebuild Progress on Device at Enclosure 32, Slot 2 Completed 0% in 21 Minutes.

I create another dispatch to request a new disk and shipped the one received on 9/25/2020 back.

Create Dispatch: Success
You have successfully submitted request SR1038277901.

Great!
The rebuild is happening, slowly, but at least has started:

root@es2026:~# megacli -PDRbld -ShowProg -physdrv[32:2] -aALL

Rebuild Progress on Device at Enclosure 32, Slot 2 Completed 1% in 28 Minutes.

Return tracking information

root@es2026:~# megacli -PDRbld -ShowProg -physdrv[32:2] -aALL

Rebuild Progress on Device at Enclosure 32, Slot 2 Completed 62% in 875 Minutes.

The RAID finished correctly.
@Papaul what do you want to do once the new disk arrives? Should we leave this old one in, or should we pull it out and insert the new one?

root@es2026:~# megacli -LDInfo -L0 -a0


Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name                :
RAID Level          : Primary-1, Secondary-0, RAID Level Qualifier-0
Size                : 10.913 TB
Sector Size         : 512
Is VD emulated      : No
Mirror Data         : 10.913 TB
State               : Optimal
Strip Size          : 256 KB
Number Of Drives    : 12
Span Depth          : 1
Default Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy   : Disk's Default
Encryption Type     : None
Default Power Savings Policy: Controller Defined
Current Power Savings Policy: None
Can spin up in 1 minute: Yes
LD has drives that support T10 power conditions: Yes
LD's IO profile supports MAX power savings with cached writes: No
Bad Blocks Exist: No
Is VD Cached: No
Marostegui closed this task as Resolved.Oct 1 2020, 2:55 PM

Going to close this as resolved. @Papaul let me know your thought from the above comment!
Thank you

@Marostegui since the server is under warranty, it is best to use a disk that is under warranty as well.

@Papaul sounds good, so maybe let's remove the old disk, give it 5 minutes, and then place the new one in?

Will do that once on site

Going to depool the host just in case, thanks!

Mentioned in SAL (#wikimedia-operations) [2020-10-13T12:49:42Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool es2026 for on-site maintenance T263837 ', diff saved to https://phabricator.wikimedia.org/P12975 and previous config saved to /var/cache/conftool/dbconfig/20201013-124940-marostegui.json

new disk in place

 		Status 	Name 	State 	Slot Number 	Size 	Security Status 	Bus Protocol 	Media Type 	Hot Spare 	Remaining Rated Write Endurance
			Physical Disk 0:1:0 	Online 	0 	1862.5 GB	Not Capable 	SATA 	HDD 	No 	Not Applicable
			Physical Disk 0:1:1 	Online 	1 	1862.5 GB	Not Capable 	SATA 	HDD 	No 	Not Applicable
			Physical Disk 0:1:2 	Online 	2 	1862.5 GB	Not Capable 	SATA 	HDD 	No 	Not Applicable
			Physical Disk 0:1:3 	Online 	3 	1862.5 GB	Not Capable 	SATA 	HDD 	No 	Not Applicable
			Physical Disk 0:1:4 	Online 	4 	1862.5 GB	Not Capable 	SATA 	HDD 	No 	Not Applicable
			Physical Disk 0:1:5 	Online 	5 	1862.5 GB	Not Capable 	SATA 	HDD 	No 	Not Applicable
			Physical Disk 0:1:6 	Online 	6 	1862.5 GB	Not Capable 	SATA 	HDD 	No 	Not Applicable
			Physical Disk 0:1:7 	Online 	7 	1862.5 GB	Not Capable 	SATA 	HDD 	No 	Not Applicable
			Physical Disk 0:1:8 	Online 	8 	1862.5 GB	Not Capable 	SATA 	HDD 	No 	Not Applicable
			Physical Disk 0:1:9 	Online 	9 	1862.5 GB	Not Capable 	SATA 	HDD 	No 	Not Applicable
pt1979@es2026:~$ sudo megacli -PDRbld -ShowProg -physdrv[32:2] -aALL

Rebuild Progress on Device at Enclosure 32, Slot 2 Completed 3% in 6 Minutes.

return tracking information

Just for the record, the rebuilt process for the new disk finished correctly:

root@es2026:~# megacli -LDInfo -L0 -a0


Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name                :
RAID Level          : Primary-1, Secondary-0, RAID Level Qualifier-0
Size                : 10.913 TB
Sector Size         : 512
Is VD emulated      : No
Mirror Data         : 10.913 TB
State               : Optimal

Thank you Papaul!