Page MenuHomePhabricator

db1062 (s7 db primary master) disk with predictive failure
Closed, DeclinedPublic

Description

Looks like s7 primary db master has a disk #0 with predictive failure:

PROBLEM - Device not healthy -SMART- on db1062 is CRITICAL: cluster=mysql device=megaraid,0 instance=db1062:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1062&var-datasource=eqiad+prometheus/ops

It is disk #0

Enclosure Device ID: 32
Slot Number: 0
Drive's position: DiskGroup: 0, Span: 0, Arm: 0
Enclosure position: 1
Device Id: 0
WWN: 5000C500712219A8
Sequence Number: 2
Media Error Count: 37
Other Error Count: 0
Predictive Failure Count: 1
Last Predictive Failure Event Seq Number: 4886
PD Type: SAS

Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
Non Coerced Size: 558.411 GB [0x45cd2fb0 Sectors]
Coerced Size: 558.375 GB [0x45cc0000 Sectors]
Sector Size:  0
Firmware state: Online, Spun Up
Device Firmware Level: ES66
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x5000c500712219a9
SAS Address(1): 0x0
Connected Port Number: 0(path0)
Inquiry Data: SEAGATE ST3600057SS     ES666SL7TCQE
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None
Device Speed: 6.0Gb/s
Link Speed: 6.0Gb/s
Media Type: Hard Disk Device
Drive Temperature :35C (95.00 F)
PI Eligibility:  No
Drive is formatted for PI information:  No
PI: No PI
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s
Port-1 :
Port status: Active
Port's Linkspeed: Unknown
Drive has flagged a S.M.A.R.T alert : Yes

We should probably replace it

Event Timeline

Marostegui moved this task from Triage to In progress on the DBA board.

The server is out of warrant and we will need to order more 600GB disks.

I would suggest to take one out of the less important services and replace it here, I will see with @Marostegui where from.

The failure is predictive, it should hold for some time. I suggest to wait for db1068 switch T224852, and once that is resolved use one of its good disks for this. Not worth ordering more when we are soon to full decom. What do you think of that plan @Marostegui, stall until that? We should also be vigilant about increases in write latency to prevent https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Disks_about_to_fail

Yeah, let's use an used disk to replace this one.

And we can schedule s7 failover after s4. The new server is ready in s7 as well.
I scheduled s4 first cause the memory problems have happened more often.

I think we should wait till the disk has fully failed

jcrespo changed the task status from Open to Stalled.Jun 6 2019, 5:01 PM
jcrespo lowered the priority of this task from High to Medium.

declining this for now since it's out of warranty and the disk has not failed