Page MenuHomePhabricator

backup1001 failed disk (degraded RAID)
Closed, ResolvedPublic

Description

Device not healthy -SMART-
device=megaraid,10 instance=backup1001:9100 job=node site=eqiad
Enclosure Device ID: 66
Slot Number: 0
Enclosure position: 1
Device Id: 10
WWN: 5000039898180FEC
Sequence Number: 2
Media Error Count: 0
Other Error Count: 172
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS

Raw Size: 0 KB [0x0 Sectors]
Non Coerced Size: 0 KB [0x0 Sectors]
Coerced Size: 0 KB [0x0 Sectors]
Sector Size:  0
Firmware state: Unconfigured(bad)
Device Firmware Level: DR07
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x5000039898180fee
SAS Address(1): 0x5000039898180fef
Connected Port Number: 1(path0) 0(path1) 
Inquiry Data: TOSHIBA MG04SCA60EE     DR074870A0RMFEGC
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None 
Device Speed: Unknown 
Link Speed: Unknown 
Media Type: Hard Disk Device
Drive:  Not Supported
Drive Temperature : N/A
PI Eligibility:  No 
Drive is formatted for PI information:  No
PI: No PI
Port-0 :
Port status: Active
Port's Linkspeed: Unknown 
Port-1 :
Port status: Active
Port's Linkspeed: Unknown 
Drive has flagged a S.M.A.R.T alert : No

Event Timeline

jcrespo created this task.Sep 13 2019, 6:15 PM
Restricted Application added a project: Operations. · View Herald TranscriptSep 13 2019, 6:15 PM

contacted dell regarding failed drive will update with response

Thanks @Jclark-ctr , can you have the drive replaced this week? Also, you might need to coordinate with @jcrespo via IRC to get a couple other things completed to get backup1001 up and running. Thanks, Willy

Just to be clear, there may be new stuff coming (RAID setup), but it is not set on stone yet both on eqiad and codfw dcs. I will create one or two tickets when I have the specific configuration ready for you. I should have a request ready by tomorrow.

@Jclark-ctr @jcrespo SR# 997901435 . DPS# 717467224, and it is setup to arrive during normal business hours on Wednesday. Disk will be replaced Wednesday

@jcrespo Drive arrived early Replaced failed drive

Jclark-ctr closed this task as Resolved.Tue, Sep 17, 8:08 PM
jcrespo rescinded a token.Wed, Sep 18, 8:19 AM
jcrespo reopened this task as Open.EditedWed, Sep 18, 8:21 AM

Now instead of a failed disk, I can only see 23/24 disks, one disk of the second enclosure is gone. See:

root@backup1001:~$ sudo megacli -PDList -aALL | grep 'Device Id'
Device Id: 23
Device Id: 17
Device Id: 21
Device Id: 20
Device Id: 12
Device Id: 18
Device Id: 16
Device Id: 19
Device Id: 14
Device Id: 22
Device Id: 11
Device Id: 15
Device Id: 7
Device Id: 8
Device Id: 3
Device Id: 9
Device Id: 13
Device Id: 5
Device Id: 0
Device Id: 2
Device Id: 4
Device Id: 1
Device Id: 6
root@backup1001:~$ sudo megacli -PDList -aALL | grep 'Device Id' | wc -l
23

I believe it to be 0:1:0, or the first disk of the second enclosure, the same previously reported as failed (but present). I would suggest to check the connection, otherwise this could be actual enclosure or raid adapter issues.

@jcrespo all Disk show connection light can you please reverify

jcrespo closed this task as Resolved.Thu, Sep 19, 7:41 AM

I can see now 24, thanks!

megacli -PDList -aALL | grep 'Device Id' | wc -l
24