Page MenuHomePhabricator

es2004 has a dead disk, but it is not under warranty
Closed, ResolvedPublic

Description

es2004 does not provide "live" service, but I asked it to be keep around (with es2001-3) as an offline backup for ES systems until backup storage is expanded next fiscal year.

Currently, the data it keeps is duplicated on es2001-es2003. Data on es2004 can be erased.

Recently one of its disks failed, but it is in a RAID10 HW configuration.

  • I think these systems have a special disk size/type, confirm there are no spares (if they are just replace the new disk, but let's not buy new ones for what is essentially end-of-life servers!)
  • Retire the broken disk, recycle it; retire another disk (maybe keep it as a spare?), redo the RAID10 with 2 less disks
Enclosure Device ID: 32
Slot Number: 10
Drive's position: DiskGroup: 0, Span: 5, Arm: 0
Enclosure position: N/A
Device Id: 10
WWN: 50014ee6ac5872ac
Sequence Number: 3
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA

Raw Size: 1.819 TB [0xe8e088b0 Sectors]
Non Coerced Size: 1.818 TB [0xe8d088b0 Sectors]
Coerced Size: 1.818 TB [0xe8d00000 Sectors]
Sector Size:  0
Firmware state: Failed
Device Firmware Level: 1D02
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x500065b36789abe8
Connected Port Number: 0(path0) 
Inquiry Data: ATA     WDC WD2003FYYS-11D02     WD-WMAY03516757

Event Timeline

Restricted Application added subscribers: Southparkfan, Aklapper. · View Herald Transcript

@jcrespo I have 10x2TB 7.2k disks on site for spare that I can use to replace the faulty disk.

Disk replacement complete.

@Papaul, if you have 10, and you do not mind using 1 for this (not a priority server), just replacing the disk will be faster than rebuild the RAID! So proceed if you are ok with it.

Right now, es2004 still shows:

CRITICAL: 1 failed LD(s) (Degraded)
Enclosure Device ID: 32
Slot Number: 10
Drive's position: DiskGroup: 0, Span: 5, Arm: 0
Enclosure position: N/A
Device Id: 10
WWN: 50014ee6ac5872ac
Sequence Number: 3
Media Error Count: 0
Other Error Count: 1
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA

Raw Size: 1.819 TB [0xe8e088b0 Sectors]
Non Coerced Size: 1.818 TB [0xe8d088b0 Sectors]
Coerced Size: 1.818 TB [0xe8d00000 Sectors]
Sector Size:  0
Firmware state: Failed
Device Firmware Level: 1D02
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x500065b36789abe8
Connected Port Number: 0(path0) 
Inquiry Data:             Z4Z3LLB1ST2000DM001-1ER164                      CC26    
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None 
Device Speed: 3.0Gb/s 
Link Speed: 3.0Gb/s 
Media Type: Hard Disk Device
Drive Temperature : N/A
PI Eligibility:  No
Drive is formatted for PI information:  No
PI: No PI
Drive's NCQ setting : N/A
Port-0 :
Port status: Active
Port's Linkspeed: 3.0Gb/s 
Drive has flagged a S.M.A.R.T alert : No

@Papaul It seems that the disk you added was already non-working.

@jcrespo the disk is a brain new disk that was in a static plastic bag nerve used .

I believe you, I am just copying and pasting:
"Firmware state: Failed"

Either you changed the wrong disk- I *do not* believe that, the serial number seems different- you replaced the right one, or its lifespan was minutes. Let me see if it lived for at least some minutes.

There is a more likely possibility- the controller has a problem with that particular port- the controler before didn't failed as usual, some if its information ended up with a request timeout.

@jcrespo true but now it returns immediately, so maybe it was just not recognized?
Maybe you could try to unplug it and plug it again.

 es2004  0 ~$ time sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli
=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 0 (Target Id: 0)
	RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
	State: =====> Degraded <=====
	Number Of Drives per span: 2
	Number of Spans: 6
	Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU

		Span: 5 - Number of PDs: 2

			PD: 0 Information
			Enclosure Device ID: 32
			Slot Number: 10
			Drive's position: DiskGroup: 0, Span: 5, Arm: 0
			Media Error Count: 0
			Other Error Count: =====> 1 <=====
			Predictive Failure Count: 0
			Last Predictive Failure Event Seq Number: 0

				Raw Size: 1.819 TB [0xe8e088b0 Sectors]
				Firmware state: =====> Failed <=====
				Media Type: Hard Disk Device
				Drive Temperature: N/A

=== RaidStatus completed

real	0m0.091s

Mentioned in SAL [2016-08-23T16:27:34Z] <jynus> rebooting es2004 for hardware maintenance T143220

Yes, it seems that it may need a reboot + configuration, that is the working thesis now.

I've put it down and downtime'd it for a day, @Papaul feel free to start it and do anything with it configuration-wise (it is not urgent).

@jcrespo the Raid controller is showing that it is saying the disk what you need to do is to put the new disk in the Raid10 see image below.

papaul@debianwiki: ~_009.png (873×1 px, 94 KB)

papaul@debianwiki: ~_011.png (873×1 px, 106 KB)

Ok, now things are nice: it says Unconfigured(good), Spun Up rather than Failed.

Now rebuilding, I didn't know the drive didn't rebuild automatically.

root@es2004:~$ megacli -Pdgetmissing -a0
                                     
    Adapter 0 - Missing Physical drives

    No.   Array   Row   Size Expected
    0     5       0     1907200 MB

Exit Code: 0x00

root@es2004:~$ megacli -PdReplaceMissing -PhysDrv '[32:10]' -Array5 -Row0 -a0
                                     
Adapter: 0: Missing PD at Array 5, Row 0 is replaced.

Exit Code: 0x00

root@es2004:~$ megacli -PDRbld -Start -PhysDrv '[32:10]' -a0
                                     
Started rebuild progress on device(Encl-32 Slot-10)

Exit Code: 0x00

root@es2004:~$ megacli -PDList -aALL | less

Enclosure Device ID: 32
Slot Number: 10
Drive's position: DiskGroup: 0, Span: 5, Arm: 0
Enclosure position: N/A
Device Id: 10
WWN: 5000c500870f7025
Sequence Number: 3
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA

Raw Size: 1.819 TB [0xe8e088b0 Sectors]
Non Coerced Size: 1.818 TB [0xe8d088b0 Sectors]
Coerced Size: 1.818 TB [0xe8d00000 Sectors]
Sector Size:  0
Firmware state: Rebuild
RECOVERY - MegaRAID on es2004 is OK: OK: optimal, 1 logical, 2 physical