Page MenuHomePhabricator

db2011 disk media errors
Closed, ResolvedPublic

Description

                Device Present
                ================
Virtual Drives    : 1 
  Degraded        : 0 
  Offline         : 0 
Physical Devices  : 14 
  Disks           : 12 
  Critical Disks  : 3 
  Failed Disks    : 0
megacli -PDList -aALL | egrep '(rror|Firm|S\.M\.)'
Media Error Count: 0
Other Error Count: 0
Firmware state: Online, Spun Up
Device Firmware Level: ES64
Drive has flagged a S.M.A.R.T alert : No
Media Error Count: 0
Other Error Count: 0
Firmware state: Online, Spun Up
Device Firmware Level: ES64
Drive has flagged a S.M.A.R.T alert : No
Media Error Count: 24
Other Error Count: 0
Firmware state: Online, Spun Up
Device Firmware Level: ES64
Drive has flagged a S.M.A.R.T alert : No
Media Error Count: 0
Other Error Count: 0
Firmware state: Online, Spun Up
Device Firmware Level: ES64
Drive has flagged a S.M.A.R.T alert : No
Media Error Count: 65
Other Error Count: 0
Firmware state: Online, Spun Up
Device Firmware Level: ES64
Drive has flagged a S.M.A.R.T alert : Yes
Media Error Count: 0
Other Error Count: 0
Firmware state: Online, Spun Up
Device Firmware Level: ES64
Drive has flagged a S.M.A.R.T alert : No
Media Error Count: 0
Other Error Count: 0
Firmware state: Online, Spun Up
Device Firmware Level: ES64
Drive has flagged a S.M.A.R.T alert : No
Media Error Count: 222
Other Error Count: 0
Firmware state: Online, Spun Up
Device Firmware Level: ES64
Drive has flagged a S.M.A.R.T alert : Yes
Media Error Count: 0
Other Error Count: 0
Firmware state: Online, Spun Up
Device Firmware Level: ES64
Drive has flagged a S.M.A.R.T alert : No
Media Error Count: 0
Other Error Count: 0
Firmware state: Online, Spun Up
Device Firmware Level: ES64
Drive has flagged a S.M.A.R.T alert : No
Media Error Count: 0
Other Error Count: 0
Firmware state: Online, Spun Up
Device Firmware Level: ES64
Drive has flagged a S.M.A.R.T alert : No
Media Error Count: 2085
Other Error Count: 0
Firmware state: Online, Spun Up
Device Firmware Level: ES64
Drive has flagged a S.M.A.R.T alert : Yes

Related Objects

Event Timeline

Restricted Application added subscribers: Southparkfan, Aklapper. · View Herald Transcript

@Papaul From the output I would replace disks #4, #7 and #11, which should be the ones with the light on.

Disk #1 has some media errors, but I suppose we can live with it for now.

RobH changed the task status from Open to Stalled.EditedOct 25 2016, 9:00 PM
RobH subscribed.

So these are 300GB SEAGATE ST3300657SS. 3.5" 15K SAS disks, and we don't keep any of these spare. (We've moved on to SSDs in new databases.)

@Papaul is checking the 600GB SAS disks we have (from decommissioned es systems) to see if they have the same (or greater) speed/sector count. If so, they could be used as replacements.

If they cannot, I'll create a sub-task in procurement S4 for the pricing discussion on this. The main factor will be the price of replacement versus the cost of system loss until the projected replacements arrive.

@RobH I think we are okay on using the disks from the decommissioned es servers. Please see below for disk information

Dell
ST3600057SS
3.5:
SAS
15K

I agree, the 15k and the larger size typically means they can replace smaller capacity disks without issues. Since they are larger, they'll likely be re-added to the raid array and only make use of the 300GB on the disk, rather than the entire 600GB.

If @jcrespo is cool with trying this out, it would let us fix this out of warranty system without any purchases.

I am cool with this, this worked last time we tried.

jcrespo changed the task status from Stalled to Open.Oct 26 2016, 5:03 PM
jcrespo claimed this task.

The disks are unconfigured, they need to be put into the RAID still:

Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
Non Coerced Size: 558.411 GB [0x45cd2fb0 Sectors]
Coerced Size: 558.375 GB [0x45cc0000 Sectors]
Sector Size:  0
Firmware state: Unconfigured(good), Spun Up
Device Firmware Level: ES66
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x5000c50088bb22a5
SAS Address(1): 0x0
Connected Port Number: 0(path0) 
Inquiry Data: SEAGATE ST3600057SS     ES666SLA78YG            
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: Foreign 
Foreign Secure: Drive is not secured by a foreign lock key
Device Speed: 6.0Gb/s 
Link Speed: 6.0Gb/s 
Media Type: Hard Disk Device
Drive Temperature :37C (98.60 F)
PI Eligibility:  No 
Drive is formatted for PI information:  No
PI: No PI
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s 
Port-1 :
Port status: Active
Port's Linkspeed: Unknown 
Drive has flagged a S.M.A.R.T alert : No
jcrespo triaged this task as Medium priority.Oct 26 2016, 5:06 PM

So I would like to get another pair of eyes here, as if this goes wrong, we might need to rebuild the whole server :-(

There are currently 3 new disks there that were not included in the RAID, as they have a foreign configuration still there, that needs to be cleaned.

root@db2011:~# megacli -CfgForeign -Scan -aALL

There are 1 foreign configuration(s) on controller 0.

root@db2011:~# megacli -PDList -aall | grep -B16 -A15 Unco | egrep "Enclosure Device ID|Device Id|Firmware state|Foreign State"
Enclosure Device ID: 32
Device Id: 4
Firmware state: Unconfigured(good), Spun Up
Foreign State: Foreign
Enclosure Device ID: 32
Device Id: 7
Firmware state: Unconfigured(good), Spun Up
Foreign State: Foreign
Enclosure Device ID: 32
Device Id: 11
Firmware state: Unconfigured(good), Spun Up
Foreign State: Foreign

The disks are:

Enclosure Device ID: 32
Slot 4
Device id 4

Enclosure Device ID: 32
Slot 7
Device id 7

Enclosure Device ID: 32
Slot 11
Device id 11

And they all have: Foreign State: Foreign
They are all in Unconfigured(good) so we only need to remove the foreign configuration

This would be what I would run and which should only clean the foreign and do nothing else with the raid:

megacli -CfgForeign -Clear -aALL

The SPANs that have no disks information are:

SPAN 2
Physical Disk: 0

SPAN 3
Physical disk: 1

SPAN 5
Physical Disk: 1

So I believe we would need to insert the disks in the following order (only one at the time)

32:4 -> array 2
32:7 -> array 3
32:11 -> array 5

I would run this (again, one at the time)

megacli -PdReplaceMissing -PhysDrv[32:4] -array2 -row0 -a0

 megacli -PDRbld -Start -PhysDrv[32:4] -a0

megacli -PdReplaceMissing -PhysDrv[32:7] -array3 -row1 -a0

megacli -PDRbld -Start -PhysDrv[32:7] -a0

megacli -PdReplaceMissing -PhysDrv[32:11] -array5 -row1 -a0

megacli -PDRbld -Start -PhysDrv[32:11] -a0

Looks good, maybe rebuilding one at a time, to avoid IO exhaustion?

Yeah - as I said, I would only add (and rebuild) once at the time.

Sorry, I overlooked that and looked only at the commands.

No worries! Better be safe than sorry :)

I am going to try to fix db2011 today. This server belongs to m2 shard.
This is what I am going to do, in order to roll back if this box happens to fail.

First, I am planning on stopping this MySQL, copy its data to dbstore2001:/srv/tmp, as if it fail, we will need to rebuild it and it will be easier and faster that way.

Mentioned in SAL (#wikimedia-operations) [2016-11-02T07:19:07Z] <marostegui> Stopping MySQL db2011 for maintenance - T149099

The backup finished, and I was able to extract it, so proceeding now.

Clearing the foreign config

root@db2011:~# megacli -CfgForeign -Scan -aALL

There are 1 foreign configuration(s) on controller 0.

Exit Code: 0x00
root@db2011:~# megacli -CfgForeign -Clear -aALL

Foreign configuration 0 is cleared on controller 0.

Exit Code: 0x00
root@db2011:~# megacli -CfgForeign -Scan -aALL

There is no foreign configuration on controller 0.

Exit Code: 0x00
root@db2011:~#

Replacing the first disk and starting to rebuild the RAID

root@db2011:~# megacli -PdReplaceMissing -PhysDrv[32:4] -array2 -row0 -a0

Adapter: 0: Missing PD at Array 2, Row 0 is replaced.

Exit Code: 0x00
root@db2011:~# megacli -PDRbld -Start -PhysDrv[32:4] -a0

Started rebuild progress on device(Encl-32 Slot-4)

Checking the status of the rebuild

root@db2011:~# megacli -PDRbld -ShowProg -PhysDrv [32:4] -aALL

Rebuild Progress on Device at Enclosure 32, Slot 4 Completed 9% in 2 Minutes.

I am going to document all these steps and procedure for future cases.

32:4 finished the rebuild correctly
Starting 32:7

root@db2011:~# megacli -PdReplaceMissing -PhysDrv[32:7] -array3 -row1 -a0

Adapter: 0: Missing PD at Array 3, Row 1 is replaced.

Exit Code: 0x00
root@db2011:~# megacli -PDRbld -Start -PhysDrv[32:7] -a0

Started rebuild progress on device(Encl-32 Slot-7)

root@db2011:~# megacli -PDRbld -ShowProg -PhysDrv [32:7] -aALL

Rebuild Progress on Device at Enclosure 32, Slot 7 Completed 5% in 1 Minutes.

Exit Code: 0x00

32:7 finished fine.
Starting 32:11

root@db2011:~# megacli -PdReplaceMissing -PhysDrv[32:11] -array5 -row1 -a0

Adapter: 0: Missing PD at Array 5, Row 1 is replaced.

Exit Code: 0x00
root@db2011:~# megacli -PDRbld -Start -PhysDrv[32:11] -a0

Started rebuild progress on device(Encl-32 Slot-11)

Exit Code: 0x00

root@db2011:~# megacli -PDRbld -ShowProg -PhysDrv [32:11] -aALL

Rebuild Progress on Device at Enclosure 32, Slot 11 Completed 3% in 1 Minutes.

Exit Code: 0x00

The last disk finished fine and the RAID is now Optimal

                Device Present
                ================
Virtual Drives    : 1
  Degraded        : 0
  Offline         : 0
Physical Devices  : 14
  Disks           : 12
  Critical Disks  : 2
  Failed Disks    : 0

Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name                :
RAID Level          : Primary-1, Secondary-0, RAID Level Qualifier-0
Size                : 1.633 TB
Sector Size         : 512
Mirror Data         : 1.633 TB
State               : Optimal
Strip Size          : 256 KB
Number Of Drives per span:2

I believe this can be closed. I will link the documentation in this ticket for the record anyways