Degraded RAID on cloudvirt1024
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ops-monitoring-bot
	Jan 4 2020, 2:25 PM

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host cloudvirt1024. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: 1 failed LD(s) (Degraded)

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli
=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 0 (Target Id: 0)
	RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
	State: =====> Degraded <=====
	Number Of Drives: 8
	Number of Spans: 1
	Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU

		Span: 0 - Number of PDs: 8

			PD: 4 Information
			Enclosure Device ID: 32
			Slot Number: 9
			Drive's position: DiskGroup: 0, Span: 0, Arm: 4
			Media Error Count: 0
			Other Error Count: 847
			Predictive Failure Count: 0
			Last Predictive Failure Event Seq Number: 0

				Raw Size: 1.746 TB [0xdf8fe2b0 Sectors]
				Firmware state: =====> Rebuild <=====
				Media Type: Solid State Device
				Drive Temperature: 30C (86.00 F)

=== RaidStatus completed

Details

	Subject	Repo	Branch	Lines +/-
	Depool cloudvirt1024, raid controller issues	operations/puppet	production	+1 -2

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
		Unknown Object (Task)
Resolved	Andrew	T199125 rack/setup/install cloudvirt102[34]
Resolved	Jclark-ctr	T241884 Degraded RAID on cloudvirt1024

Event Timeline

ops-monitoring-bot added projects: SRE, ops-eqiad.Jan 4 2020, 2:25 PM

ops-monitoring-bot subscribed.

I see

[Sat Jan  4 08:56:39 2020] megaraid_sas 0000:18:00.0: 155794 (631458161s/0x0004/CRIT) - Enclosure PD 20(c None/p1) phy bad for slot 4

in dmesg. Checking some other things quick because this system has had raid controller issues.

I see that the controller has a warning on it in idrac, and it has actually removed 2 disks from the volume now

This is the behavior that led to T216218 and then T230289

In fact, this is pretty much exactly the same as T230289: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only. Checking that the filesystem is still mounted ok.

It seems like the filesystem is ok, but there are no hot spares at this point, so if it kicks 2 more disks out, it'll cause problems. So far so good on that.

To be clear, I am relating this to T230289 is because it thinks the disks are removed, not failed.

• Bstorm added a parent task: T199125: rack/setup/install cloudvirt102[34].Jan 4 2020, 3:06 PM

Change 561987 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Depool cloudvirt1024, raid controller issues

https://gerrit.wikimedia.org/r/561987

gerritbot added a project: Patch-For-Review.Jan 4 2020, 3:07 PM

• Bstorm merged a task: T241873: Degraded RAID on cloudvirt1024.Jan 4 2020, 3:08 PM

Change 561987 merged by Andrew Bogott:
[operations/puppet@production] Depool cloudvirt1024, raid controller issues

https://gerrit.wikimedia.org/r/561987

According to the "livecycle" logs in idrac, it had trouble communicating with the disks and then marked them removed. Basically the same as before and again, 2 disks on the same day. It makes me very suspicious of this RAID controller.

• Bstorm added a project: cloud-services-team (Kanban).Jan 4 2020, 3:18 PM

Mentioned in SAL (#wikimedia-cloud) [2020-01-04T16:01:31Z] <bstorm_> moving vm puppetmaster-1001 from cloudvirt1024 to cloudvirt1009 due to hardware error T241884

Mentioned in SAL (#wikimedia-cloud) [2020-01-04T16:04:09Z] <arturo> moving VM tools-sgeexec-0923 from cloudvirt1024 to cloudvirt1009 due to hardware errors (T241884)

Mentioned in SAL (#wikimedia-cloud) [2020-01-04T16:05:21Z] <bstorm_> moving VM meza-full from cloudvirt1024 to cloudvirt1003 due to hardware error T241884

Mentioned in SAL (#wikimedia-cloud) [2020-01-04T16:06:02Z] <arturo> moving VM tools-sgewebgrid-lighttpd-0909 from cloudvirt1024 to cloudvirt1009 due to hardware errors (T241884)

Mentioned in SAL (#wikimedia-cloud) [2020-01-04T16:07:39Z] <arturo> moving VM tools-sgewebgrid-lighttpd-0923 from cloudvirt1024 to cloudvirt1009 due to hardware errors (T241884)

Mentioned in SAL (#wikimedia-cloud) [2020-01-04T16:08:42Z] <arturo> moving VM tools-sgewebgrid-lighttpd-0924 from cloudvirt1024 to cloudvirt1009 due to hardware errors (T241884)

Mentioned in SAL (#wikimedia-cloud) [2020-01-04T16:09:30Z] <bstorm_> moving VMs encoding04 and encoding05 from cloudvirt1024 to cloudvirt1003 due to hardware errors T241884

Mentioned in SAL (#wikimedia-cloud) [2020-01-04T16:09:57Z] <arturo> moving VM tools-sgewebgrid-lighttpd-0925 from cloudvirt1024 to cloudvirt1009 due to hardware errors (T241884)

Mentioned in SAL (#wikimedia-cloud) [2020-01-04T16:11:22Z] <arturo> moving VM tools-sgewebgrid-lighttpd-0926 from cloudvirt1024 to cloudvirt1009 due to hardware errors (T241884)

Mentioned in SAL (#wikimedia-cloud) [2020-01-04T16:13:44Z] <arturo> moving VM tools-sgewebgrid-lighttpd-0927 from cloudvirt1024 to cloudvirt1009 due to hardware errors (T241884)

Mentioned in SAL (#wikimedia-cloud) [2020-01-04T16:16:20Z] <bd808> Draining tools-worker-10{05,12,28} due to hardware errors (T241884)

bd808 triaged this task as High priority.Jan 4 2020, 4:18 PM

• Bstorm merged a task: T241886: Degraded RAID on cloudvirt1024.Jan 4 2020, 4:25 PM

Mentioned in SAL (#wikimedia-cloud) [2020-01-04T16:34:54Z] <arturo> icinga downtime cloudvirt1024 for 2 months because hardware errors (T241884)

Mentioned in SAL (#wikimedia-cloud) [2020-01-04T16:47:20Z] <bstorm_> moving VM tools-flannel-etcd-02 from cloudvirt1024 to cloudvirt1003 due to hardware errors T241884

Mentioned in SAL (#wikimedia-cloud) [2020-01-04T16:54:07Z] <bstorm_> moving VMs tools-worker-1012/1028/1005 from cloudvirt1024 to cloudvirt1003 due to hardware errors T241884

@aborrero When migrating cyberbot-db-01, the script died with

total size is 304,445,262,469  speedup is 1.00
wmcs-cold-migrate: INFO: cyberbot-db-01 instance copied. Now updating nova db...
wmcs-cold-migrate: INFO: Needed image is deactivated
wmcs-cold-migrate: INFO: activating image e02770ae-b45f-4776-a852-d9a13217611e
wmcs-cold-migrate: INFO: current status is SHUTOFF; waiting for it to change to ACTIVE
Traceback (most recent call last):
  File "/usr/local/sbin/wmcs-cold-migrate", line 314, in <module>
    instance.migrate(config)
  File "/usr/local/sbin/wmcs-cold-migrate", line 182, in migrate
    self.wait_for_status('ACTIVE')
  File "/usr/local/sbin/wmcs-cold-migrate", line 99, in wait_for_status
    self.refresh_instance()
  File "/usr/local/sbin/wmcs-cold-migrate", line 86, in refresh_instance
    self.instance = self.novaclient.servers.get(self.instance_id)
  File "/usr/lib/python2.7/dist-packages/novaclient/v2/servers.py", line 762, in get
    return self._get("/servers/%s" % base.getid(server), "server")
  File "/usr/lib/python2.7/dist-packages/novaclient/base.py", line 346, in _get
    resp, body = self.api.client.get(url)
  File "/usr/lib/python2.7/dist-packages/keystoneauth1/adapter.py", line 217, in get
    return self.request(url, 'GET', **kwargs)
  File "/usr/lib/python2.7/dist-packages/novaclient/client.py", line 117, in request
    raise exceptions.from_response(resp, body, url, method)
novaclient.exceptions.BadRequest: Networking client is experiencing an unauthorized exception. (HTTP 400) (Request-ID: req-b48d74e6-8e6b-4988-8d5f-ddb8bd18a0cb)

I verified that the instance is running in the new location and it doesn't show up in virsh list in the old location (anything else to check @Andrew?). I manually removed the disk and such on the old location. I think the image still needs to be deactivated as well.

With that done, this host is evacuated of user VMs and ready for whatever troubleshooting to see what is going on with the hardware.

Looks like we're missing drives in slot 2 and 9 on this host.

 megacli -PDList -aALL | egrep 'Slot|Firmware'
Slot Number: 0
Firmware state: Online, Spun Up
Device Firmware Level: DL58
Slot Number: 1
Firmware state: Online, Spun Up
Device Firmware Level: DL63
Slot Number: 3
Firmware state: Online, Spun Up
Device Firmware Level: DL58
Slot Number: 5
Firmware state: Online, Spun Up
Device Firmware Level: DL58
Slot Number: 6
Firmware state: Online, Spun Up
Device Firmware Level: DL58
Slot Number: 7
Firmware state: Online, Spun Up
Device Firmware Level: DL58
Slot Number: 8
Firmware state: Online, Spun Up
Device Firmware Level: DL61

And slot 4!

wiki_willy assigned this task to Jclark-ctr.Jan 6 2020, 4:47 PM

wiki_willy moved this task from Backlog to Cloud Tasks on the ops-eqiad board.

@Jclark-ctr - if no errors pop up after firmware and bios upgrades, maybe it a bad raid controller.

bd808 edited projects, added cloud-services-team (Hardware); removed cloud-services-team (Kanban).Jan 9 2020, 10:21 PM

bd808 moved this task from Backlog to Hardware faults on the cloud-services-team (Hardware) board.

Mentioned in SAL (#wikimedia-cloud) [2020-01-23T21:09:32Z] <jeh> cloudvirt1024 set icinga downtime and powering down for hardware maintenance T241884

Drives 2 and 4 had a foreign configuration. I've cleared the configuration and reassigned them as global host spares.

The original drive (slot 9) failure reported in this ticket is currently rebuilding and not showing any errors. Closing this ticket for now, but I suspect we'll be hearing from this drive again.

• JHedden mentioned this in T243555: Degraded RAID on cloudvirt1024.Jan 23 2020, 9:59 PM

Drive 9 reported a lot of errors while rebuilding the RAID array, and now drives 2, 4, and 9 are missing from the RAID set again. I'll leave drive 9 out of the pool and test rebuilding the array with only 2 and 4.

• JHedden merged a task: T243605: Degraded RAID on cloudvirt1024.Jan 24 2020, 3:33 PM

During the next rebuild the RAID array kicked out Drive 4. Either we have 3 bad drives 2, 4 and 9 or the RAID adapter is bad. I'll send the TSR for this host to @Jclark-ctr

This host has been depooled from production and has no running workloads.

Mentioned in SAL (#wikimedia-operations) [2020-02-12T15:24:59Z] <jeh> upgrade BIOS firmware on cloudvirt1024 to 2.4.8 T241884

Mentioned in SAL (#wikimedia-operations) [2020-02-12T15:39:38Z] <jeh> clearing foreign drive RAID configuration on cloudvirt1024 T241884

• JHedden merged a task: T245035: Degraded RAID on cloudvirt1024.Feb 12 2020, 3:48 PM

Mentioned in SAL (#wikimedia-operations) [2020-02-12T17:27:19Z] <jeh> upgrade RAID firmware on cloudvirt1024 to 25.5.6.0009 T241884

Firmware upgrades I applied:

Dell PERC H730/H730P/H830/FD33xS/FD33xD Mini/Adapter RAID Controllers firmware version 25.5.6.0009
https://www.dell.com/support/home/us/en/04/drivers/driversdetails?driverid=g7n2c&oscode=biosa&productcode=poweredge-r440

Dell EMC Server PowerEdge BIOS R440/R540/T440 Version 2.4.8
https://www.dell.com/support/home/us/en/04/drivers/driversdetails?driverid=wgm2r&oscode=us004&productcode=poweredge-r440

After the upgrade drive 2 is still missing, but drives 4 and 9 have remained and the array was finally able to rebuild successfully.

Mentioned in SAL (#wikimedia-operations) [2020-02-13T22:13:30Z] <jeh> running filesystem tests on cloudvirt1024 T241884

• JHedden merged a task: T245282: Degraded RAID on cloudvirt1024.Feb 14 2020, 4:44 PM

We're still having the same issue after the firmware and BIOS upgrades. We've been trying to identify this issue since last August T230289#5429400, please open a new ticket with Dell so we can get this resolved.

I'll collect a TSR for this host and send it to @Jclark-ctr, if there's anything else I can to do help please let me know.

Worse, we've been trying to find it since last February T216218: Cloud VPS outage on cloudvirt1024 and cloudvirt1018 due to storage failure. August was another round of efforts. I'll link that task to the older outage.

bd808 merged a task: T245324: Degraded RAID on cloudvirt1024.Feb 15 2020, 12:34 AM

Mentioned in SAL (#wikimedia-operations) [2020-02-18T21:07:00Z] <jeh> power down and set incinga downtime on cloudvirt1022 T241884

updated dell ticket with new tsr report

Any updates on this?

@JHedden Reached out to dell today opened another service request 1023973621 Request new Raid Adapter

Great! Thanks for the update. This host is currently out of service and can be taken offline anytime.

Mentioned in SAL (#wikimedia-operations) [2020-04-30T19:42:57Z] <jeh> reboot cloudvirt1024 for NIC firmware updates T241884

Mentioned in SAL (#wikimedia-operations) [2020-04-30T20:05:48Z] <jeh> cloudvirt1024 upgrade iDRAC firmware from 2.4.8 to 2.5.4 T241884

Used the BIOS versions in that last log message, the correct iDRAC versions and log output are below

# bash ./iDRAC-with-Lifecycle-Controller_Firmware_KTC95_LN_4.10.10.10_A00.BIN 
Collecting inventory.....
Running validation...

iDRAC

The version of this Update Package is newer than the currently installed version.
Software application name: iDRAC
Package version: 4.10.10.10
Installed version: 3.30.30.30

Continue? Y/N:Y
Executing update...
WARNING: DO NOT STOP THIS PROCESS OR INSTALL OTHER PRODUCTS WHILE UPDATE IS IN PROGRESS.
THESE ACTIONS MAY CAUSE YOUR SYSTEM TO BECOME UNSTABLE!
................................................................................................................................................
Update Successful.
The update completed successfully.

I'm unable to upgrade the SATA because of the failed drive state:

ERROR Serial ATA firmware

# bash ./Serial-ATA_Firmware_V141M_LN_DL5C_A00.BIN    
..
..
............................................................................
The operation was aborted because one or more logical drives are in a degraded state.

SSDSC2KB019T7R
The operation was aborted because one or more logical drives are in a degraded state.

The operation was aborted because one or more logical drives are in a degraded state.

SSDSC2KB019T7R
The operation was successful.

SSDSC2KB019T7R
The operation was aborted because one or more logical drives are in a degraded state.

SSDSC2KB019T7R
The operation was successful.

SSDSC2KB019T7R
The operation was aborted because one or more logical drives are in a degraded state.

SSDSC2KB019T7R
The operation was aborted because one or more logical drives are in a degraded state.

SSDSC2KB019T7R
The operation was aborted because one or more logical drives are in a degraded state.

SSDSC2KB019T7R
The operation was aborted because one or more logical drives are in a degraded state.

I'm also having issues upgrading the server BIOS (using the same process as the last upgrade):

ERROR BIOS upgrade

# bash ./BIOS_FP00W_LN_2.5.4.BIN                                                                                                                                             
Collecting inventory...                                                                                                                                                                                                             
                                                                                                                                                                                                                                    
Running validation...                                                                                                                                                                                                               
                                                                                                                                                                                                                                    
Taurus BIOS                                                                                                                                                                                                                         
                                                                                                                                                                                                                                    
The version of this Update Package is newer than the currently installed version.                                                                                                                                                   
Software application name: BIOS                                                                                                                                                                                                     
Package version: 2.5.4                                                                                                                                                                                                              
Installed version: 2.4.8                                                                                                                                                                                                            
                                                                                                                                                                                                                                                                                                                               
Continue? Y/N:Y                                                                                                                                                                                                                     
Executing update...                                                                                                                                                                                                                 
WARNING: DO NOT STOP THIS PROCESS OR INSTALL OTHER PRODUCTS WHILE UPDATE IS IN PROGRESS.                                                                                                                                            
THESE ACTIONS MAY CAUSE YOUR SYSTEM TO BECOME UNSTABLE!                                                                                                                                                                             
.terminate called after throwing an instance of 'smbios::InternalErrorImpl'                                                                                                                                                         
  what():  Could not instantiate SMBIOS table.

These completed successfully

Broadcom NIC Firmware

# bash ./Network_Firmware_YK81Y_LN64_21.60.22.11_01.BIN
Collecting inventory...
Running validation...

enp175s0f1d1

The version of this Update Package is older than the currently installed version.
Software application name: Broadcom Adv. Dual 10Gb Ethernet
Package version: 21.60.22.11
Installed version: FFV20.06.05.11

enp175s0f0

The version of this Update Package is older than the currently installed version.
Software application name: Broadcom Adv. Dual 10Gb Ethernet
Package version: 21.60.22.11
Installed version: FFV20.06.05.11

Continue? Y/N:Y
Y entered; update was forced by user
Executing update...
WARNING: DO NOT STOP THIS PROCESS OR INSTALL OTHER PRODUCTS WHILE UPDATE IS IN PROGRESS.
THESE ACTIONS MAY CAUSE YOUR SYSTEM TO BECOME UNSTABLE!
...................................

Update success
Would you like to reboot your system now?
Continue? Y/N:Y
W: molly-guard: SSH session detected!
Please type in hostname of the machine to shutdown: cloudvirt1024
Connection to cloudvirt1024.eqiad.wmnet closed by remote host.

Expander Backplane

# bash ./Firmware_2F90T_LN_2.46_A00_03.BIN 
Collecting inventory...
Running validation...

14G Expander Backplane

The version of this Update Package is newer than the currently installed version.
Software application name: 14G Expander Backplane
Package version: 2.46
Installed version: 2.17

Continue? Y/N:Y
Executing update...
WARNING: DO NOT STOP THIS PROCESS OR INSTALL OTHER PRODUCTS WHILE UPDATE IS IN PROGRESS.
THESE ACTIONS MAY CAUSE YOUR SYSTEM TO BECOME UNSTABLE!
....................................................................................
The operation was successful.
The update completed successfully.

I'd also like to point out that we have another system purchased in the same batch T192119, and 6 more with the same configuration T201352 that are running the same workloads without any problems.

@wiki_willy This server and/or RAID card has been giving us problems since February 2019 [1]. Do we have any options here? We seem to be stuck in an infinite loop of firmware upgrades and not getting anywhere.

[1] https://wikitech.wikimedia.org/wiki/Incident_documentation/20190213-cloudvps

I've cleared the foreign configuration on drives 4 and 9 again, once the rebuild completes I'll attempt the SATA firmware and system BIOS upgrades.

In T241884#6098941, @JHedden wrote:

I'd also like to point out that we have another system purchased in the same batch T192119, and 6 more with the same configuration T201352 that are running the same workloads without any problems.

@wiki_willy This server and/or RAID card has been giving us problems since February 2019 [1]. Do we have any options here? We seem to be stuck in an infinite loop of firmware upgrades and not getting anywhere.

[1] https://wikitech.wikimedia.org/wiki/Incident_documentation/20190213-cloudvps

@Jclark-ctr - let me know if Dell is giving you problems with the replacement RAID adapter.

The RAID card took drive 9 offline again during the virtual disk rebuild. We cannot update the SATA drive firmware until all the devices are healthy, and since that is never the case we cannot apply the update.

I've updated the other firmware and emailed a new TSR report to @Jclark-ctr

Kizule merged a task: T251579: Degraded RAID on cloudvirt1024.May 1 2020, 2:57 PM

called Dell after no response from email. Dell is sending out new backplane and new raid card.

@JHedden finished replacement of backplane, raid card, and drive 9

Thanks! I've imported the RAID config, restored the boot order settings and will verify it's fixed.

The virtual drive rebuild process was MUCH faster, the firmware upgrades completed successfully and all drives have remained online.

I'll continue running some stress tests, but I think everything looks good now.

	F31500343: Screen Shot 2020-01-04 at 7.55.04 AM.png
	Jan 4 2020, 2:56 PM

Degraded RAID on cloudvirt1024Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Degraded RAID on cloudvirt1024
Closed, ResolvedPublic
Actions

Related Objects
Search...