Page MenuHomePhabricator

db1069 bad disk
Closed, ResolvedPublic

Description

Hello,

Can we get a replacement for disk #0 on db1069?
It has SMART errors.

d1069 is x1 master

root@db1069:~# megacli -PDList -aALL

Adapter #0

Enclosure Device ID: 32
Slot Number: 0
Drive's position: DiskGroup: 0, Span: 0, Arm: 0
Enclosure position: 1
Device Id: 0
WWN: 5000C50071ABF4E0
Sequence Number: 2
Media Error Count: 57
Other Error Count: 0
Predictive Failure Count: 2
Last Predictive Failure Event Seq Number: 2766
PD Type: SAS

Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
Non Coerced Size: 558.411 GB [0x45cd2fb0 Sectors]
Coerced Size: 558.375 GB [0x45cc0000 Sectors]
Sector Size:  0
Firmware state: Online, Spun Up
Device Firmware Level: ES66
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x5000c50071abf4e1
SAS Address(1): 0x0
Connected Port Number: 0(path0)
Inquiry Data: SEAGATE ST3600057SS     ES666SL831BP
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None
Device Speed: 6.0Gb/s
Link Speed: 6.0Gb/s
Media Type: Hard Disk Device
Drive Temperature :38C (100.40 F)
PI Eligibility:  No
Drive is formatted for PI information:  No
PI: No PI
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s
Port-1 :
Port status: Active
Port's Linkspeed: Unknown
Drive has flagged a S.M.A.R.T alert : Yes

Let's coordinate, so we can put the disk OFFLINE manually before you replace it.

Event Timeline

Restricted Application added a project: Operations. · View Herald TranscriptJul 8 2018, 1:35 PM
Marostegui triaged this task as Medium priority.Jul 8 2018, 1:35 PM
Marostegui moved this task from Triage to In progress on the DBA board.Jul 8 2018, 3:06 PM

Mentioned in SAL (#wikimedia-operations) [2018-07-09T16:09:21Z] <marostegui> Set disk 32:0 offline on db1069 for a replacement - T199056

@Cmjohnson you can now proceed, I have set the disk offline:

Adapter #0

Enclosure Device ID: 32
Slot Number: 0
Drive's position: DiskGroup: 0, Span: 0, Arm: 0
Enclosure position: 1
Device Id: 0
WWN: 5000C50071ABF4E0
Sequence Number: 3
Media Error Count: 57
Other Error Count: 0
Predictive Failure Count: 4
Last Predictive Failure Event Seq Number: 2768
PD Type: SAS

Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
Non Coerced Size: 558.411 GB [0x45cd2fb0 Sectors]
Coerced Size: 558.375 GB [0x45cc0000 Sectors]
Sector Size:  0
Firmware state: Offline

disk swapped by chris:

root@db1069:~# megacli -PDRbld -ShowProg -PhysDrv [32:0] -a0

Rebuild Progress on Device at Enclosure 32, Slot 0 Completed 28% in 16 Minutes.
Marostegui added a subscriber: ops-monitoring-bot.

disk swapped by chris:

root@db1069:~# megacli -PDRbld -ShowProg -PhysDrv [32:0] -a0

Rebuild Progress on Device at Enclosure 32, Slot 0 Completed 28% in 16 Minutes.
Marostegui closed this task as Resolved.Jul 10 2018, 4:41 AM

All good!

root@db1069:~# megacli -LDPDInfo -aAll

Adapter #0

Number of Virtual Disks: 1
Virtual Drive: 0 (Target Id: 0)
Name                :
RAID Level          : Primary-1, Secondary-0, RAID Level Qualifier-0
Size                : 3.271 TB
Sector Size         : 512
Mirror Data         : 3.271 TB
State               : Optimal
Strip Size          : 256 KB
Marostegui reopened this task as Open.Jul 10 2018, 4:44 AM

Actually it this disk has smart errors too.
Was this a re-used or a new disk, @Cmjohnson?

PD: 0 Information
Enclosure Device ID: 32
Slot Number: 0
Drive's position: DiskGroup: 0, Span: 0, Arm: 0
Enclosure position: 1
Device Id: 0
WWN: 5000C5004797A870
Sequence Number: 13
Media Error Count: 1
Other Error Count: 3
Predictive Failure Count: 1
Last Predictive Failure Event Seq Number: 2895
PD Type: SAS

Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
Non Coerced Size: 558.411 GB [0x45cd2fb0 Sectors]
Coerced Size: 558.375 GB [0x45cc0000 Sectors]
Sector Size:  0
Firmware state: Online, Spun Up
Device Firmware Level: ES64
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x5000c5004797a871
SAS Address(1): 0x0
Connected Port Number: 0(path0)
Inquiry Data: SEAGATE ST3600057SS     ES646SL2Y71X
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None
Device Speed: 6.0Gb/s
Link Speed: 6.0Gb/s
Media Type: Hard Disk Device
Drive Temperature :38C (100.40 F)
PI Eligibility:  No
Drive is formatted for PI information:  No
PI: No PI
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s
Port-1 :
Port status: Active
Port's Linkspeed: Unknown
Drive has flagged a S.M.A.R.T alert : Yes
Cmjohnson moved this task from Backlog to Up next on the ops-eqiad board.Jul 10 2018, 2:27 PM

Mentioned in SAL (#wikimedia-operations) [2018-07-10T14:31:12Z] <marostegui> Set disk #0 offline for replacement - T199056

Marostegui added a comment.EditedJul 10 2018, 2:31 PM

@Cmjohnson disk #0 is now offline, feel free to replace it when you can.

Enclosure Device ID: 32
Slot Number: 0
Drive's position: DiskGroup: 0, Span: 0, Arm: 0
Enclosure position: 1
Device Id: 0
WWN: 5000C5004797A870
Sequence Number: 14
Media Error Count: 1
Other Error Count: 3
Predictive Failure Count: 2
Last Predictive Failure Event Seq Number: 2897
PD Type: SAS

Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
Non Coerced Size: 558.411 GB [0x45cd2fb0 Sectors]
Coerced Size: 558.375 GB [0x45cc0000 Sectors]
Sector Size:  0
Firmware state: Offline

Disk replaced by Chris, let's see if this time it turns out fine!

root@db1069:~# megacli -PDRbld -ShowProg -PhysDrv [32:0] -a0

Rebuild Progress on Device at Enclosure 32, Slot 0 Completed 1% in 1 Minutes.
Marostegui closed this task as Resolved.Jul 10 2018, 4:35 PM

All good this time

root@db1069:~# megacli -LDPDInfo -aAll

Adapter #0

Number of Virtual Disks: 1
Virtual Drive: 0 (Target Id: 0)
Name                :
RAID Level          : Primary-1, Secondary-0, RAID Level Qualifier-0
Size                : 3.271 TB
Sector Size         : 512
Mirror Data         : 3.271 TB
State               : Optimal
Strip Size          : 256 KB
Enclosure Device ID: 32
Slot Number: 0
Drive's position: DiskGroup: 0, Span: 0, Arm: 0
Enclosure position: 1
Device Id: 0
WWN: 5000C500479122EC
Sequence Number: 24
Media Error Count: 6
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS

Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
Non Coerced Size: 558.411 GB [0x45cd2fb0 Sectors]
Coerced Size: 558.375 GB [0x45cc0000 Sectors]
Sector Size:  0
Firmware state: Online, Spun Up
Device Firmware Level: ES64
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x5000c500479122ed
SAS Address(1): 0x0
Connected Port Number: 0(path0)
Inquiry Data: SEAGATE ST3600057SS     ES646SL2V019
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None
Device Speed: 6.0Gb/s
Link Speed: 6.0Gb/s
Media Type: Hard Disk Device
Drive Temperature :39C (102.20 F)
PI Eligibility:  No
Drive is formatted for PI information:  No
PI: No PI
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s
Port-1 :
Port status: Active
Port's Linkspeed: Unknown
Drive has flagged a S.M.A.R.T alert : No
Marostegui reopened this task as Open.Jul 16 2018, 7:43 AM

This has happened again, same disk, disk #0, can we get another one?
Please ping me before replacing it so I can manually put it offline

Enclosure Device ID: 32
Slot Number: 0
Drive's position: DiskGroup: 0, Span: 0, Arm: 0
Enclosure position: 1
Device Id: 0
WWN: 5000C500479122EC
Sequence Number: 24
Media Error Count: 1097
Other Error Count: 0
Predictive Failure Count: 3
Last Predictive Failure Event Seq Number: 5211
PD Type: SAS

Drive has flagged a S.M.A.R.T alert : Yes

Hey @Cmjohnson can we try to get this disk swapped soon? It is x1's primary master

Just talked to Chris - as this disk is on predictive failure but not failed yet, we are going to wait for the new disks to arrive in order to avoid trying again with used ones.

Marostegui closed this task as Resolved.Jul 24 2018, 7:49 PM

The disk got replaced and this is all good now: T200287#4448846