db2018 failed disk (degraded RAID)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	jcrespo
	Feb 25 2016, 8:20 AM

Description

                Device Present
                ================
Virtual Drives    : 1 
  Degraded        : 1 
  Offline         : 0 
Physical Devices  : 14 
  Disks           : 12 
  Critical Disks  : 1 
  Failed Disks    : 1

Enclosure Device ID: 32
Slot Number: 3
Drive's position: DiskGroup: 0, Span: 0, Arm: 3
Enclosure position: N/A
Device Id: 3
WWN: 5000C50076A5A818
Sequence Number: 3
Media Error Count: 12874
Other Error Count: 13
Predictive Failure Count: 55
Last Predictive Failure Event Seq Number: 30498
PD Type: SAS

Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
Non Coerced Size: 558.411 GB [0x45cd2fb0 Sectors]
Coerced Size: 558.375 GB [0x45cc0000 Sectors]
Sector Size:  0
Firmware state: Failed
Device Firmware Level: ES66
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x5000c50076a5a819
SAS Address(1): 0x0
Connected Port Number: 0(path0) 
Inquiry Data: SEAGATE ST3600057SS     ES666SL8SK0L            
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None 
Device Speed: 6.0Gb/s 
Link Speed: 6.0Gb/s 
Media Type: Hard Disk Device
Drive Temperature :44C (111.20 F)
PI Eligibility:  No 
Drive is formatted for PI information:  No
PI: No PI
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s 
Port-1 :
Port status: Active
Port's Linkspeed: Unknown 
Drive has flagged a S.M.A.R.T alert : Yes

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		jcrespo	T128057 db2018 failed disk (degraded RAID)
					Unknown Object (Task)

Event Timeline

jcrespo created this task.Feb 25 2016, 8:20 AM

Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptFeb 25 2016, 8:20 AM

jcrespo added a project: ops-codfw.Feb 25 2016, 8:21 AM

Restricted Application added a project: SRE. · View Herald TranscriptFeb 25 2016, 8:21 AM

MoritzMuehlenhoff assigned this task to Papaul.Mar 3 2016, 9:15 AM

The system is out of warranty, will open a task for procurement of new disk.

Papaul triaged this task as Medium priority.Mar 3 2016, 2:57 PM

@ jcrespo I was about to open a ticket for ordering a drive replacement for this system and I did a count down of all the db servers that are showing failed drives, i found out there are a total of 9 boxes that have failed drives and a total of 12 bad drives + 1 bad drive for db2018

db2004 slot 8
db2007 slot 0
db2010 slot 0 and slot 8
db2011 slot 7 and slot 11
db2013 slot 7
db2017 slot 11
db2021 slot 6 and slot 7
db2023 slot 9
db2024 slot 4

Please confirm and check that we need to replace drives also on those boxes. it will make sense to order at once all those drives than just make a ticket to order one drive for db2018.

Thanks.

These are my comments after rechecking both the disks and the server roles:

db2004 slot 8 - confirmed, but unused? Maybe pending decommission? I need to confirm it
db2007 slot 0 - confirmed, but unused? Maybe pending decommission? I need to confirm it
db2010 slot 0 and slot 8 - confirmed, m1 slave
db2011 slot 7 and slot 11 - confirmed, m2 slave
db2013 slot 7 - I do not have access, and it is not part of the regular mysql core production, ask someone else (fundrising? decommissoned?)
db2018 slot 3 - confirmed, s3 master, important
db2017 slot 11 - confirmed, s2 master, important
db2021 slot 6 and slot 7 - I do not have access, and it is not part of the regular mysql core production, ask someone else (fundrising? decommissoned?)
db2023 slot 9 - confirmed, s5 master, important
db2024 slot 4 - I do not have access, and it is not part of the regular mysql core production, ask someone else (fundrising? decommissoned?)

Let me check if db2004 and db2007 are unused. If they are, we could decommission them and reuse its disks.

The following servers respond to salt but do not have any known function:

db2001.codfw.wmnet: 
db2002.codfw.wmnet:
db2003.codfw.wmnet:
db2004.codfw.wmnet:
db2005.codfw.wmnet:
db2007.codfw.wmnet:

I do not know if there was any plan to decommission them, but they may be used for parts.

Related T125827

@jcrespo thanks so i will be waiting on @RobH for final confirmation on T125827

Peachey88 subscribed.Mar 9 2016, 6:07 AM

db2017 slot 11 failed completely today.

Aklapper added a project: DC-Ops.Mar 17 2016, 12:14 PM

@Papaul @RobH: any news on this?

In particular for db2017 (failed), db2018 (failed) and db2023 (predicted failure), that are masters in codfw, it would be better to have those replaced and the RAID rebuild before the codfw switchover.

@RobH: please buy 4 appropriate disks today, fastest delivery. Hereby approved.

RobH mentioned this in Unknown Object (Task).Apr 13 2016, 4:54 PM

RobH added a subtask: Unknown Object (Task).Apr 13 2016, 4:59 PM

Restricted Application added a subscriber: Urbanecm. · View Herald TranscriptApr 13 2016, 4:59 PM

Thanks to @Papaul replaced 2 failed disks on db2017 and 1 on db2018.
Keeping the last spare disk as spare for now in case another DB disk breaks in the next days, db2023 is only in predictive failure. I'll check with @jcrespo if it's ok for him.

volans@db2017:~$ sudo megacli -PDRbld -ShowProg -PhysDrv [32:02] -aALL

Rebuild Progress on Device at Enclosure 32, Slot 2 Completed 0% in 2 Minutes.

Exit Code: 0x00
volans@db2017:~$ sudo megacli -PDRbld -ShowProg -PhysDrv [32:11] -aALL

Rebuild Progress on Device at Enclosure 32, Slot 11 Completed 1% in 6 Minutes.

Exit Code: 0x00

volans@db2018:~$ sudo megacli -PDRbld -ShowProg -PhysDrv [32:03] -aALL

Rebuild Progress on Device at Enclosure 32, Slot 3 Completed 0% in 0 Minutes.

Exit Code: 0x00

Drive replacement on db2017 slot 11 and slot 2
Drive replacement on db2018 slot 3

Rebuild halfway trough, I've disabled notification for the RAID check for db2017 and db2018 to avoid to be awaken in the middle of the night by the recovery. I'll re-activate them tomorrow after the rebuild is complete.

volans@db2017:~$ sudo megacli -PDRbld -ShowProg -PhysDrv [32:11] -aALL

Rebuild Progress on Device at Enclosure 32, Slot 11 Completed 53% in 206 Minutes.

Exit Code: 0x00
volans@db2017:~$ sudo megacli -PDRbld -ShowProg -PhysDrv [32:02] -aALL

Rebuild Progress on Device at Enclosure 32, Slot 2 Completed 52% in 202 Minutes.

Exit Code: 0x00

volans@db2018:~$ sudo megacli -PDRbld -ShowProg -PhysDrv [32:03] -aALL

Rebuild Progress on Device at Enclosure 32, Slot 3 Completed 53% in 198 Minutes.

Exit Code: 0x00

db2017 and db2018 RAID is back to optimal, re-enabled notifications on icinga for RAID checks.
Leaving the task open for the remaining hosts.

I don't think this should be assigned to me, since I already ordered disks...

Restricted Application added subscribers: Luke081515, TerraCodes. · View Herald TranscriptApr 20 2016, 7:51 PM

Volans lowered the priority of this task from Unbreak Now! to Medium.Apr 20 2016, 7:52 PM

We are going to have lots of spares with T129452. I would retire es2005-es2010 and use its disks as spares, reuse es2001-4 for ES disaster recovery and archival.

Restricted Application added a subscriber: Southparkfan. · View Herald TranscriptMay 9 2016, 3:45 PM

jcrespo mentioned this in T134755: Decommission es2005-es2010.May 10 2016, 8:01 AM

jcrespo closed this task as Resolved.Sep 22 2016, 8:07 AM

jcrespo claimed this task.

RobH closed subtask Unknown Object (Task) as Resolved.Oct 12 2016, 5:48 PM

db2018 failed disk (degraded RAID)Closed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

db2018 failed disk (degraded RAID)
Closed, ResolvedPublic
Actions

Related Objects
Search...