Reimage db1060 due to pysical disk corruption (was: Degraded RAID on db1060)
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host db1060. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 0 (Target Id: 0)
	RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
	State: =====> Degraded <=====
	Number Of Drives per span: 2
	Number of Spans: 6
	Current Cache Policy: WriteBack, ReadAheadNone, Direct, Write Cache OK if Bad BBU

		Span: 2 - Number of PDs: 2

			PD: 0 Information
			Enclosure Device ID: 32
			Slot Number: 4
			Drive's position: DiskGroup: 0, Span: 2, Arm: 0
			Media Error Count: 15
			Other Error Count: 0
			Predictive Failure Count: 0
			Last Predictive Failure Event Seq Number: 0

				Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
				Firmware state: =====> Offline <=====
				Media Type: Hard Disk Device
				Drive Temperature: 36C (96.80 F)

		Span: 3 - Number of PDs: 2

			PD: 1 Information
			Enclosure Device ID: 32
			Slot Number: 7
			Drive's position: DiskGroup: 0, Span: 3, Arm: 1
			Media Error Count: 5
			Other Error Count: 1
			Predictive Failure Count: 0
			Last Predictive Failure Event Seq Number: 0

				Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
				Firmware state: =====> Offline <=====
				Media Type: Hard Disk Device
				Drive Temperature: 35C (95.00 F)

=== RaidStatus completed
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 15 2017, 3:00 PM
Marostegui changed the task status from Open to Stalled.Feb 16 2017, 8:32 AM
Marostegui added a subscriber: Marostegui.

Wait for this to happen before we replace any disks on this task: https://phabricator.wikimedia.org/T158194

We should replace this two disks, as they had media errors. I would suggest we do one at the time.

  • Replace one
  • Wait for the RAID to rebuild
  • Replace the second one.
Marostegui changed the task status from Stalled to Open.Feb 28 2017, 7:35 AM
Marostegui added a project: DBA.

Replaced disk in slot 4. will wait for it to rebuild and then replace slot 7

Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Rebuild
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Offline
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up

Thanks Chris!
It will take a long time, so probably best to replace 7 on Monday :-)

root@db1060:~# megacli -PDRbld -ShowProg -PhysDrv [32:4] -aALL

Rebuild Progress on Device at Enclosure 32, Slot 4 Completed 25% in 69 Minutes.

It finished its rebuilt - so we can go ahead and replace #7:

root@db1060:~# megacli -PDRbld -ShowProg -PhysDrv [32:4] -aALL

Device(Encl-32 Slot-4) is not in rebuild process

Exit Code: 0x00
PD: 0 Information
Enclosure Device ID: 32
Slot Number: 4
Drive's position: DiskGroup: 0, Span: 2, Arm: 0
Enclosure position: 1
Device Id: 4
WWN: 5000C5005F08D720
Sequence Number: 12
Media Error Count: 0
Other Error Count: 2
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS

Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
Non Coerced Size: 558.411 GB [0x45cd2fb0 Sectors]
Coerced Size: 558.375 GB [0x45cc0000 Sectors]
Sector Size:  0
Firmware state: Online, Spun Up
Ottomata triaged this task as Normal priority.Mar 6 2017, 6:47 PM
Ottomata assigned this task to Marostegui.
Ottomata added a subscriber: Ottomata.

Assigning, feel free to reassign.

Slot 7 is just offline for some reason. Changed status to online

cmjohnson@db1060:~$ sudo megacli -PDOnline -PhysDrv [32:7] -a0

EnclId-32 SlotId-7 state changed to OnLine.

Exit Code: 0x00
cmjohnson@db1060:~$ sudo megacli -PDList -aALL |grep "Firmware state"
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up

jcrespo renamed this task from Degraded RAID on db1060 to Reimage db1060 due to pysical disk corruption (was: Degraded RAID on db1060).Mar 6 2017, 7:24 PM

Even though the server's data is now corrupted and needs to be reimaged, the RAID is on optimal status:

root@db1060:~# megacli -LDInfo -L0 -a0


Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name                :
RAID Level          : Primary-1, Secondary-0, RAID Level Qualifier-0
Size                : 3.271 TB
Sector Size         : 512
Mirror Data         : 3.271 TB
State               : Optimal
Strip Size          : 256 KB
Number Of Drives per span:2
Span Depth          : 6
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy   : Disk's Default
Encryption Type     : None
Default Power Savings Policy: Controller Defined
Current Power Savings Policy: None
Can spin up in 1 minute: Yes
LD has drives that support T10 power conditions: Yes
LD's IO profile supports MAX power savings with cached writes: No
Bad Blocks Exist: No
Is VD Cached: Yes
Cache Cade Type : Read Only

Script wmf_auto_reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts:

['db1060.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201703070704_marostegui_2099.log.

Completed auto-reimage of hosts:

['db1060.eqiad.wmnet']

and were ALL successful.

Mentioned in SAL (#wikimedia-operations) [2017-03-07T07:39:41Z] <marostegui> Stop MySQL db1067 to clone db1060 from it - T158193

The data transfer between db1067 and db1060 was started around 20 minutes ago.

db1060 has been reimaged and recloned and it is now trying to catch up (GTID is enabled)

For the record, we are seeing the following disk errors (raid is fine and disks are online though):

#1
Media error count: 2


#4
Other Error Count: 5

Change 341520 had a related patch set uploaded (by marostegui):
[operations/mediawiki-config] db-eqiad.php: Repool db1060 with less weight

https://gerrit.wikimedia.org/r/341520

Change 341520 merged by jenkins-bot:
[operations/mediawiki-config] db-eqiad.php: Repool db1060 with less weight

https://gerrit.wikimedia.org/r/341520

Mentioned in SAL (#wikimedia-operations) [2017-03-07T12:34:13Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1060 with less weight - T158193 (duration: 00m 40s)

Change 341530 had a related patch set uploaded (by marostegui):
[operations/mediawiki-config] db-eqiad.php: Increase weight db1060

https://gerrit.wikimedia.org/r/341530

Change 341530 merged by jenkins-bot:
[operations/mediawiki-config] db-eqiad.php: Increase weight db1060

https://gerrit.wikimedia.org/r/341530

Mentioned in SAL (#wikimedia-operations) [2017-03-07T13:58:50Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Increase db1060 weight - T158193 (duration: 00m 58s)

Change 341757 had a related patch set uploaded (by Marostegui):
[operations/mediawiki-config] db-eqiad.php: Restore db1060 normal weight

https://gerrit.wikimedia.org/r/341757

Change 341757 merged by jenkins-bot:
[operations/mediawiki-config] db-eqiad.php: Restore db1060 normal weight

https://gerrit.wikimedia.org/r/341757

Mentioned in SAL (#wikimedia-operations) [2017-03-08T07:13:29Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Restore db1060 original weight - T158193 (duration: 00m 47s)

Marostegui closed this task as Resolved.Mar 8 2017, 7:14 AM

Server's original weight has been restored.
I will close this ticket, even though there are some errors in some disks. If we see some issues, we can reopen it or at least use this ticket for future reference
Thanks everyone involved!!