Page MenuHomePhabricator

es2004 has a dead disk, but it is not under warranty
Closed, ResolvedPublic

Description

es2004 does not provide "live" service, but I asked it to be keep around (with es2001-3) as an offline backup for ES systems until backup storage is expanded next fiscal year.

Currently, the data it keeps is duplicated on es2001-es2003. Data on es2004 can be erased.

Recently one of its disks failed, but it is in a RAID10 HW configuration.

  • I think these systems have a special disk size/type, confirm there are no spares (if they are just replace the new disk, but let's not buy new ones for what is essentially end-of-life servers!)
  • Retire the broken disk, recycle it; retire another disk (maybe keep it as a spare?), redo the RAID10 with 2 less disks
Enclosure Device ID: 32
Slot Number: 10
Drive's position: DiskGroup: 0, Span: 5, Arm: 0
Enclosure position: N/A
Device Id: 10
WWN: 50014ee6ac5872ac
Sequence Number: 3
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA

Raw Size: 1.819 TB [0xe8e088b0 Sectors]
Non Coerced Size: 1.818 TB [0xe8d088b0 Sectors]
Coerced Size: 1.818 TB [0xe8d00000 Sectors]
Sector Size:  0
Firmware state: Failed
Device Firmware Level: 1D02
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x500065b36789abe8
Connected Port Number: 0(path0) 
Inquiry Data: ATA     WDC WD2003FYYS-11D02     WD-WMAY03516757

Event Timeline

jcrespo created this task.Aug 17 2016, 3:44 PM
Restricted Application added a project: Operations. · View Herald TranscriptAug 17 2016, 3:44 PM
Restricted Application added subscribers: Southparkfan, Aklapper. · View Herald Transcript
jcrespo triaged this task as Low priority.Aug 17 2016, 3:46 PM
Papaul added a subscriber: Papaul.Aug 22 2016, 3:01 PM

@jcrespo I have 10x2TB 7.2k disks on site for spare that I can use to replace the faulty disk.

Papaul assigned this task to jcrespo.Aug 22 2016, 3:05 PM

Disk replacement complete.

@Papaul, if you have 10, and you do not mind using 1 for this (not a priority server), just replacing the disk will be faster than rebuild the RAID! So proceed if you are ok with it.

Right now, es2004 still shows:

CRITICAL: 1 failed LD(s) (Degraded)
Enclosure Device ID: 32
Slot Number: 10
Drive's position: DiskGroup: 0, Span: 5, Arm: 0
Enclosure position: N/A
Device Id: 10
WWN: 50014ee6ac5872ac
Sequence Number: 3
Media Error Count: 0
Other Error Count: 1
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA

Raw Size: 1.819 TB [0xe8e088b0 Sectors]
Non Coerced Size: 1.818 TB [0xe8d088b0 Sectors]
Coerced Size: 1.818 TB [0xe8d00000 Sectors]
Sector Size:  0
Firmware state: Failed
Device Firmware Level: 1D02
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x500065b36789abe8
Connected Port Number: 0(path0) 
Inquiry Data:             Z4Z3LLB1ST2000DM001-1ER164                      CC26    
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None 
Device Speed: 3.0Gb/s 
Link Speed: 3.0Gb/s 
Media Type: Hard Disk Device
Drive Temperature : N/A
PI Eligibility:  No
Drive is formatted for PI information:  No
PI: No PI
Drive's NCQ setting : N/A
Port-0 :
Port status: Active
Port's Linkspeed: 3.0Gb/s 
Drive has flagged a S.M.A.R.T alert : No

@Papaul It seems that the disk you added was already non-working.

@jcrespo the disk is a brain new disk that was in a static plastic bag nerve used .

I believe you, I am just copying and pasting:
"Firmware state: Failed"

Either you changed the wrong disk- I *do not* believe that, the serial number seems different- you replaced the right one, or its lifespan was minutes. Let me see if it lived for at least some minutes.

There is a more likely possibility- the controller has a problem with that particular port- the controler before didn't failed as usual, some if its information ended up with a request timeout.

Volans added a subscriber: Volans.Aug 23 2016, 4:10 PM

@jcrespo true but now it returns immediately, so maybe it was just not recognized?
Maybe you could try to unplug it and plug it again.

 es2004  0 ~$ time sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli
=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 0 (Target Id: 0)
	RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
	State: =====> Degraded <=====
	Number Of Drives per span: 2
	Number of Spans: 6
	Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU

		Span: 5 - Number of PDs: 2

			PD: 0 Information
			Enclosure Device ID: 32
			Slot Number: 10
			Drive's position: DiskGroup: 0, Span: 5, Arm: 0
			Media Error Count: 0
			Other Error Count: =====> 1 <=====
			Predictive Failure Count: 0
			Last Predictive Failure Event Seq Number: 0

				Raw Size: 1.819 TB [0xe8e088b0 Sectors]
				Firmware state: =====> Failed <=====
				Media Type: Hard Disk Device
				Drive Temperature: N/A

=== RaidStatus completed

real	0m0.091s

Mentioned in SAL [2016-08-23T16:27:34Z] <jynus> rebooting es2004 for hardware maintenance T143220

Yes, it seems that it may need a reboot + configuration, that is the working thesis now.

I've put it down and downtime'd it for a day, @Papaul feel free to start it and do anything with it configuration-wise (it is not urgent).

@jcrespo the Raid controller is showing that it is saying the disk what you need to do is to put the new disk in the Raid10 see image below.

Ok, now things are nice: it says Unconfigured(good), Spun Up rather than Failed.

Now rebuilding, I didn't know the drive didn't rebuild automatically.

root@es2004:~$ megacli -Pdgetmissing -a0
                                     
    Adapter 0 - Missing Physical drives

    No.   Array   Row   Size Expected
    0     5       0     1907200 MB

Exit Code: 0x00

root@es2004:~$ megacli -PdReplaceMissing -PhysDrv '[32:10]' -Array5 -Row0 -a0
                                     
Adapter: 0: Missing PD at Array 5, Row 0 is replaced.

Exit Code: 0x00

root@es2004:~$ megacli -PDRbld -Start -PhysDrv '[32:10]' -a0
                                     
Started rebuild progress on device(Encl-32 Slot-10)

Exit Code: 0x00

root@es2004:~$ megacli -PDList -aALL | less

Enclosure Device ID: 32
Slot Number: 10
Drive's position: DiskGroup: 0, Span: 5, Arm: 0
Enclosure position: N/A
Device Id: 10
WWN: 5000c500870f7025
Sequence Number: 3
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA

Raw Size: 1.819 TB [0xe8e088b0 Sectors]
Non Coerced Size: 1.818 TB [0xe8d088b0 Sectors]
Coerced Size: 1.818 TB [0xe8d00000 Sectors]
Sector Size:  0
Firmware state: Rebuild
RECOVERY - MegaRAID on es2004 is OK: OK: optimal, 1 logical, 2 physical
jcrespo closed this task as Resolved.Aug 24 2016, 12:17 PM