backup1001 failed disk (degraded RAID)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	jcrespo
	Sep 13 2019, 6:15 PM

Description

Device not healthy -SMART-
device=megaraid,10 instance=backup1001:9100 job=node site=eqiad

Enclosure Device ID: 66
Slot Number: 0
Enclosure position: 1
Device Id: 10
WWN: 5000039898180FEC
Sequence Number: 2
Media Error Count: 0
Other Error Count: 172
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS

Raw Size: 0 KB [0x0 Sectors]
Non Coerced Size: 0 KB [0x0 Sectors]
Coerced Size: 0 KB [0x0 Sectors]
Sector Size:  0
Firmware state: Unconfigured(bad)
Device Firmware Level: DR07
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x5000039898180fee
SAS Address(1): 0x5000039898180fef
Connected Port Number: 1(path0) 0(path1) 
Inquiry Data: TOSHIBA MG04SCA60EE     DR074870A0RMFEGC
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None 
Device Speed: Unknown 
Link Speed: Unknown 
Media Type: Hard Disk Device
Drive:  Not Supported
Drive Temperature : N/A
PI Eligibility:  No 
Drive is formatted for PI information:  No
PI: No PI
Port-0 :
Port status: Active
Port's Linkspeed: Unknown 
Port-1 :
Port status: Active
Port's Linkspeed: Unknown 
Drive has flagged a S.M.A.R.T alert : No

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		jcrespo	T229209 Strengthen backup infrastructure and support
		Resolved		Jclark-ctr	T232882 backup1001 failed disk (degraded RAID)

Event Timeline

jcrespo created this task.Sep 13 2019, 6:15 PM

Restricted Application added a project: SRE. · View Herald TranscriptSep 13 2019, 6:15 PM

jcrespo mentioned this in T227335: backup1001 can't address the disk shelf's drives.Sep 13 2019, 6:15 PM

wiki_willy assigned this task to Jclark-ctr.Sep 13 2019, 6:24 PM

contacted dell regarding failed drive will update with response

jcrespo added a parent task: T229209: Strengthen backup infrastructure and support.Sep 16 2019, 3:07 PM

Thanks @Jclark-ctr , can you have the drive replaced this week? Also, you might need to coordinate with @jcrespo via IRC to get a couple other things completed to get backup1001 up and running. Thanks, Willy

Just to be clear, there may be new stuff coming (RAID setup), but it is not set on stone yet both on eqiad and codfw dcs. I will create one or two tickets when I have the specific configuration ready for you. I should have a request ready by tomorrow.

@Jclark-ctr @jcrespo SR# 997901435 . DPS# 717467224, and it is setup to arrive during normal business hours on Wednesday. Disk will be replaced Wednesday

@jcrespo Drive arrived early Replaced failed drive

Jclark-ctr closed this task as Resolved.Sep 17 2019, 8:08 PM

jcrespo awarded a token.Sep 17 2019, 8:35 PM

jcrespo rescinded a token.Sep 18 2019, 8:19 AM

Now instead of a failed disk, I can only see 23/24 disks, one disk of the second enclosure is gone. See:

root@backup1001:~$ sudo megacli -PDList -aALL | grep 'Device Id'
Device Id: 23
Device Id: 17
Device Id: 21
Device Id: 20
Device Id: 12
Device Id: 18
Device Id: 16
Device Id: 19
Device Id: 14
Device Id: 22
Device Id: 11
Device Id: 15
Device Id: 7
Device Id: 8
Device Id: 3
Device Id: 9
Device Id: 13
Device Id: 5
Device Id: 0
Device Id: 2
Device Id: 4
Device Id: 1
Device Id: 6
root@backup1001:~$ sudo megacli -PDList -aALL | grep 'Device Id' | wc -l
23

I believe it to be 0:1:0, or the first disk of the second enclosure, the same previously reported as failed (but present). I would suggest to check the connection, otherwise this could be actual enclosure or raid adapter issues.

jcrespo mentioned this in T229209: Strengthen backup infrastructure and support.Sep 18 2019, 9:20 AM

@jcrespo all Disk show connection light can you please reverify

I can see now 24, thanks!

megacli -PDList -aALL | grep 'Device Id' | wc -l
24

jcrespo awarded a token.Sep 19 2019, 7:41 AM

	F30388238: disks2.png
	Sep 18 2019, 8:21 AM

	F30388239: disks1.png
	Sep 18 2019, 8:21 AM

	F30388237: disks3.png
	Sep 18 2019, 8:21 AM

backup1001 failed disk (degraded RAID)Closed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

backup1001 failed disk (degraded RAID)
Closed, ResolvedPublic
Actions

Related Objects
Search...