Page MenuHomePhabricator

Degraded RAID on sodium
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host sodium. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: 1 failed LD(s) (Degraded)

$ sudo /usr/local/lib/nagios/plugins/get_raid_status_megacli
=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 0 (Target Id: 0)
	RAID Level: Primary-5, Secondary-0, RAID Level Qualifier-3
	State: =====> Degraded <=====
	Number Of Drives: 4
	Number of Spans: 1
	Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU

=== RaidStatus completed

Event Timeline

RobH moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.
RobH subscribed.

So failed disk, but under warranty until June 17, 2019.

We cannot really 'test' the failed disk, since the others have data and we cannot move them around. So this will just need a support case opened directly with Dell.

Dzahn triaged this task as Medium priority.Dec 14 2018, 8:23 PM
Dzahn subscribed.

service is up and disk still in warranty -> normal

Sodium does not have any failed disks. One of the disks is listed as a hotspare.

cmjohnson@sodium:~$ sudo megacli -PDList -aALL |grep "Firmware state"
Firmware state: Online, Spun Up
Firmware state: Hotspare, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up

Though the nagios plugin calls this "degraded". @Volans Is this maybe a bug in the check script?

[sodium:~] $  sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli 
=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 0 (Target Id: 0)
	RAID Level: Primary-5, Secondary-0, RAID Level Qualifier-3
	State: =====> Degraded <=====
	Number Of Drives: 4
	Number of Spans: 1
	Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU

=== RaidStatus completed

@Dzahn it's reported as degraded by megacli:

$ sudo /usr/sbin/megacli -LdPdInfo -aAll -NoLog
Adapter #0

Number of Virtual Disks: 1
Virtual Drive: 0 (Target Id: 0)
Name                :
RAID Level          : Primary-5, Secondary-0, RAID Level Qualifier-3
Size                : 10.914 TB
Sector Size         : 512
Is VD emulated      : No
Parity Size         : 3.637 TB
State               : Degraded
Strip Size          : 64 KB
[...SNIP...]

So maybe some missing configuration to tell it that the 4th disk is spare, if that's the intended configuration?

I don't know what the intended configuration is but it says RAID5 which just needs 3 disks as a minimum. The thing is that no human changed anything as far as we know yet this turned into "degraded" state on December 14. So some kind of event must have triggered that.

Isn't this RAID "50" and therefore needs 6 disks minimum?

It seems that one disk if failed in a way that is not even reported by megacli. The new version of the script reports:

=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 0 (Target Id: 0)
	RAID Level: Primary-5, Secondary-0, RAID Level Qualifier-3
	State: =====> Degraded <=====
	Number Of Drives: 4
	Number of Spans: 1
	Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU

		Span: 0 - Number of PDs: 4

			PD: 1 Information
			ERROR: =====> MISSING DRIVE INFO <=====

=== RaidStatus completed
faidon raised the priority of this task from Medium to High.Mar 4 2019, 1:02 PM
faidon subscribed.

I just merged a duplicate in. @Cmjohnson what's the status of this?

Ok, the steps previously taken on this, and the next steps to take, are as follows:

Past steps:

  • sodium had a failed disk 6 months ago on T202705, and Chris opened request WO10392987.
  • Dell sent a replacement disk, it appears replacing disk 2 of 4 disks.
    • The replacement disk was the wrong size at 5.458 TB, when all the other 3 disks are 3.638 TB.
    • Dell sent the wrong disk

Future steps:

  • Chris reopens previous Dell case, or opens a new one referencing it, and has Dell sent the proper sized disk that matches the rest.
  • Warranty ends on June 17, 2019. This needs to get handled sooner than later.

@RobH interesting response from Dell regarding the disk

Denial Notes
1.The previous dispatch with this part was done on 8/30/2018. As per DOA policy a part can be reported DOA only within stipulated time frame. 2.This system shipped with 6TB hard drive however you are requesting 4TB hard drive please verify the service tag. If this part was purchased separately then please resubmit the request with the Dell Order number under which the part was purchased.

Ok, in reviewing the ordering task of T137132, it isn't 100% clear at first review what was ordered (packing slip only has the second page attached to task, and the task has a lot of different specs on it.)

End result, the sodium system WAS ordered with 4 * 6TB 7.2K RPM SAS 12Gbps 4Kn 3.5in Hot-plug Hard Drive and then swapped to 4TB via T139171#2582840.

I've emailed our team to get this fixed:

Issues with service tag JFTXHB2

Dell Team,

We have an ongoing issue with support, that I expect can be more easily resolved by looping you in than our attempting to explain the history of this service tag to a random support agent.

We ordered JFTXHB2 back on 2016-06-14 via Dell Order 993614052. It shipped with 4*6TB disks. The 6tb 4k disks were not compatible with the controller and our required setup, and Chris opened case 9726888 to get the 6TB swapped to 4TB via Dell, which was done.

Now, one of the 4TB disks died 6 months ago, and Dell dispatch via WO10392987 sent us a 6TB replacement. We recently filed another request to try to get the properly sized 4TB disk and got the following reply:

1.The previous dispatch with this part was done on 8/30/2018. As per DOA policy a part can be reported DOA only within stipulated time frame. 2.This system shipped with 6TB hard drive however you are requesting 4TB hard drive please verify the service tag. If this part was purchased separately then please resubmit the request with the Dell Order number under which the part was purchased.

What can be done so we can get this 6TB, not useful to us and previously swapped by Dell to 4TB back to a replacement 4TB disk. Back when the disks were downgraded to 4TB, we went via Dell to prevent this exact issue from happening. End result, we have a degraded system and need the 6TB disk swapped to the 4TB disk previously sent to us.

Please advise,

Reply from a Dell SR:

Hello Rob/Chris,

My name is Ivan, Resolution Manager at Dell EMC. I was engaged on the case above regarding the HDD replacement issue you all have been running into. Do you happen to have the 4TB part number or PPID? Normally we can see these in the TSR if you’re able to provide one.

Also, do you have an SR number or order number for the drives when they were swapped from the 6TB to the 4TB? I checked under service tag JFTXHB2 but I’m not seeing it.

The case below was referenced but it’s not bringing up any Dell cases.

“Chris opened case 9726888 to get the 6TB swapped to 4TB via Dell, which was done.”

Thank you,

Ivan Martinez

Resolution Manager, Support Resolution Team

Dell EMC | Support & Deployment Services

Office +1 512 723 4111

Mobile +1 512 496 4495

Ivan.Martinez@Dell.com

I was referencing:

Still working on getting the disks replaced w/out any costs to us and possibly a refund. This is the latest message .

Chris,

Base on your request below return request order number 935064839, To return this order we submitted a FSR (Financial Services Request) as it is out of policy.

An FSR does not guarantee that the order will be approved for the return. Moreover, I will do all my effort to help on this request.

Request Id 9726888.

I have created a Service Request for this issue. Feel free to contact me if there is any question or doubt.

Regards,

Carlos Brown
Customer Care Analyst
Dell | Dell Business Operations
Carlos_Brown@DellTeam.com
Work Hours: Monday to Friday: 8am-6pm
Customer feedback | How am I doing? Please contact my manager Claudia_taboada@dell.com

  • Please do not remove your unique tracking number! ------

<<#3075-36464440#>>

Still working on getting the disks replaced w/out any costs to us and possibly a refund. This is the latest message .

Chris,

Base on your request below return request order number 935064839, To return this order we submitted a FSR (Financial Services Request) as it is out of policy.

An FSR does not guarantee that the order will be approved for the return. Moreover, I will do all my effort to help on this request.

Request Id 9726888.

I have created a Service Request for this issue. Feel free to contact me if there is any question or doubt.

Regards,

Carlos Brown
Customer Care Analyst
Dell | Dell Business Operations
Carlos_Brown@DellTeam.com
Work Hours: Monday to Friday: 8am-6pm
Customer feedback | How am I doing? Please contact my manager Claudia_taboada@dell.com

  • Please do not remove your unique tracking number! ------

<<#3075-36464440#>>

I don't want to reply back to Dell until we sync on this today.

Chris,

What was the case you opened with Dell support recently to attempt to get the 6TB swapped to 4TB referenced int his comment?

@RobH interesting response from Dell regarding the disk

Denial Notes
1.The previous dispatch with this part was done on 8/30/2018. As per DOA policy a part can be reported DOA only within stipulated time frame. 2.This system shipped with 6TB hard drive however you are requesting 4TB hard drive please verify the service tag. If this part was purchased separately then please resubmit the request with the Dell Order number under which the part was purchased.

Please chat with me about this task, thanks!

Basically we need the email thread where this swap was approved and done by Dell to demonstrate the 4TB are dell supported and under their warranty.

Ok, synced up with Chris and have the following going on:

  • emailed our dell account team, they opened SR 987845644 with Ivan Martinez

His email (last Friday):

Hello Rob/Chris,

My name is Ivan, Resolution Manager at Dell EMC. I was engaged on the case above regarding the HDD replacement issue you all have been running into. Do you happen to have the 4TB part number or PPID? Normally we can see these in the TSR if you’re able to provide one.

Also, do you have an SR number or order number for the drives when they were swapped from the 6TB to the 4TB? I checked under service tag JFTXHB2 but I’m not seeing it.

The case below was referenced but it’s not bringing up any Dell cases.

“Chris opened case 9726888 to get the 6TB swapped to 4TB via Dell, which was done.”

Thank you,

Ivan Martinez

Resolution Manager, Support Resolution Team

Dell EMC | Support & Deployment Services

My reply from today:

Ivan,

So, initially Chris opened case 935064839 in August 2016 to have the swap happen, and that generated a FSR request 9726888. That email also had a unique tracking number of <<#3075-36464440#>>. Dell Fusion Incident #34794881 <<#3075-36464440#>>

At that time, Dell sent us 4 * 4TB disks with the model info (pulled from os just now) of: SEAGATE ST4000NM0025.

I'm not sure why 9726888 isn't showing anything? Do the other numbers assist?

No update back from Dell, so sent a followup today:

Ivan,

Any updates on this?

Dell sent the correct size disk, thanks to @RobH. Raid is rebuildingcmjohnson@sodium:~$ sudo megacli -PDList -aALL |grep "Firmware state"
Firmware state: Online, Spun Up
Firmware state: Rebuild
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up

robh@sodium:~$ sudo megacli -PDList -aALL |grep "Firmware state"
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up