Page MenuHomePhabricator

Degraded RAID on cloudvirt1024
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host cloudvirt1024. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: 1 failed LD(s) (Degraded)

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli
=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 0 (Target Id: 0)
	RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
	State: =====> Degraded <=====
	Number Of Drives: 8
	Number of Spans: 1
	Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU

		Span: 0 - Number of PDs: 8

			PD: 4 Information
			Enclosure Device ID: 32
			Slot Number: 9
			Drive's position: DiskGroup: 0, Span: 0, Arm: 4
			Media Error Count: 0
			Other Error Count: 847
			Predictive Failure Count: 0
			Last Predictive Failure Event Seq Number: 0

				Raw Size: 1.746 TB [0xdf8fe2b0 Sectors]
				Firmware state: =====> Rebuild <=====
				Media Type: Solid State Device
				Drive Temperature: 30C (86.00 F)

=== RaidStatus completed

Event Timeline

Bstorm added a subscriber: Bstorm.Jan 4 2020, 2:56 PM

I see

[Sat Jan  4 08:56:39 2020] megaraid_sas 0000:18:00.0: 155794 (631458161s/0x0004/CRIT) - Enclosure PD 20(c None/p1) phy bad for slot 4

in dmesg. Checking some other things quick because this system has had raid controller issues.

I see that the controller has a warning on it in idrac, and it has actually removed 2 disks from the volume now

Bstorm added a comment.Jan 4 2020, 3:00 PM

This is the behavior that led to T216218 and then T230289

In fact, this is pretty much exactly the same as T230289: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only. Checking that the filesystem is still mounted ok.

Bstorm added a comment.Jan 4 2020, 3:03 PM

It seems like the filesystem is ok, but there are no hot spares at this point, so if it kicks 2 more disks out, it'll cause problems. So far so good on that.

Bstorm added a comment.Jan 4 2020, 3:06 PM

To be clear, I am relating this to T230289 is because it thinks the disks are removed, not failed.

Change 561987 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Depool cloudvirt1024, raid controller issues

https://gerrit.wikimedia.org/r/561987

Change 561987 merged by Andrew Bogott:
[operations/puppet@production] Depool cloudvirt1024, raid controller issues

https://gerrit.wikimedia.org/r/561987

Bstorm added a comment.Jan 4 2020, 3:11 PM

According to the "livecycle" logs in idrac, it had trouble communicating with the disks and then marked them removed. Basically the same as before and again, 2 disks on the same day. It makes me very suspicious of this RAID controller.

Mentioned in SAL (#wikimedia-cloud) [2020-01-04T16:01:31Z] <bstorm_> moving vm puppetmaster-1001 from cloudvirt1024 to cloudvirt1009 due to hardware error T241884

Mentioned in SAL (#wikimedia-cloud) [2020-01-04T16:04:09Z] <arturo> moving VM tools-sgeexec-0923 from cloudvirt1024 to cloudvirt1009 due to hardware errors (T241884)

Mentioned in SAL (#wikimedia-cloud) [2020-01-04T16:05:21Z] <bstorm_> moving VM meza-full from cloudvirt1024 to cloudvirt1003 due to hardware error T241884

Mentioned in SAL (#wikimedia-cloud) [2020-01-04T16:06:02Z] <arturo> moving VM tools-sgewebgrid-lighttpd-0909 from cloudvirt1024 to cloudvirt1009 due to hardware errors (T241884)

Mentioned in SAL (#wikimedia-cloud) [2020-01-04T16:07:39Z] <arturo> moving VM tools-sgewebgrid-lighttpd-0923 from cloudvirt1024 to cloudvirt1009 due to hardware errors (T241884)

Mentioned in SAL (#wikimedia-cloud) [2020-01-04T16:08:42Z] <arturo> moving VM tools-sgewebgrid-lighttpd-0924 from cloudvirt1024 to cloudvirt1009 due to hardware errors (T241884)

Mentioned in SAL (#wikimedia-cloud) [2020-01-04T16:09:30Z] <bstorm_> moving VMs encoding04 and encoding05 from cloudvirt1024 to cloudvirt1003 due to hardware errors T241884

Mentioned in SAL (#wikimedia-cloud) [2020-01-04T16:09:57Z] <arturo> moving VM tools-sgewebgrid-lighttpd-0925 from cloudvirt1024 to cloudvirt1009 due to hardware errors (T241884)

Mentioned in SAL (#wikimedia-cloud) [2020-01-04T16:11:22Z] <arturo> moving VM tools-sgewebgrid-lighttpd-0926 from cloudvirt1024 to cloudvirt1009 due to hardware errors (T241884)

Mentioned in SAL (#wikimedia-cloud) [2020-01-04T16:13:44Z] <arturo> moving VM tools-sgewebgrid-lighttpd-0927 from cloudvirt1024 to cloudvirt1009 due to hardware errors (T241884)

Mentioned in SAL (#wikimedia-cloud) [2020-01-04T16:16:20Z] <bd808> Draining tools-worker-10{05,12,28} due to hardware errors (T241884)

bd808 triaged this task as High priority.Jan 4 2020, 4:18 PM

Mentioned in SAL (#wikimedia-cloud) [2020-01-04T16:34:54Z] <arturo> icinga downtime cloudvirt1024 for 2 months because hardware errors (T241884)

Mentioned in SAL (#wikimedia-cloud) [2020-01-04T16:47:20Z] <bstorm_> moving VM tools-flannel-etcd-02 from cloudvirt1024 to cloudvirt1003 due to hardware errors T241884

Mentioned in SAL (#wikimedia-cloud) [2020-01-04T16:54:07Z] <bstorm_> moving VMs tools-worker-1012/1028/1005 from cloudvirt1024 to cloudvirt1003 due to hardware errors T241884

@aborrero When migrating cyberbot-db-01, the script died with

total size is 304,445,262,469  speedup is 1.00
wmcs-cold-migrate: INFO: cyberbot-db-01 instance copied. Now updating nova db...
wmcs-cold-migrate: INFO: Needed image is deactivated
wmcs-cold-migrate: INFO: activating image e02770ae-b45f-4776-a852-d9a13217611e
wmcs-cold-migrate: INFO: current status is SHUTOFF; waiting for it to change to ACTIVE
Traceback (most recent call last):
  File "/usr/local/sbin/wmcs-cold-migrate", line 314, in <module>
    instance.migrate(config)
  File "/usr/local/sbin/wmcs-cold-migrate", line 182, in migrate
    self.wait_for_status('ACTIVE')
  File "/usr/local/sbin/wmcs-cold-migrate", line 99, in wait_for_status
    self.refresh_instance()
  File "/usr/local/sbin/wmcs-cold-migrate", line 86, in refresh_instance
    self.instance = self.novaclient.servers.get(self.instance_id)
  File "/usr/lib/python2.7/dist-packages/novaclient/v2/servers.py", line 762, in get
    return self._get("/servers/%s" % base.getid(server), "server")
  File "/usr/lib/python2.7/dist-packages/novaclient/base.py", line 346, in _get
    resp, body = self.api.client.get(url)
  File "/usr/lib/python2.7/dist-packages/keystoneauth1/adapter.py", line 217, in get
    return self.request(url, 'GET', **kwargs)
  File "/usr/lib/python2.7/dist-packages/novaclient/client.py", line 117, in request
    raise exceptions.from_response(resp, body, url, method)
novaclient.exceptions.BadRequest: Networking client is experiencing an unauthorized exception. (HTTP 400) (Request-ID: req-b48d74e6-8e6b-4988-8d5f-ddb8bd18a0cb)

I verified that the instance is running in the new location and it doesn't show up in virsh list in the old location (anything else to check @Andrew?). I manually removed the disk and such on the old location. I think the image still needs to be deactivated as well.

With that done, this host is evacuated of user VMs and ready for whatever troubleshooting to see what is going on with the hardware.

JHedden added a subscriber: JHedden.

Looks like we're missing drives in slot 2 and 9 on this host.

 megacli -PDList -aALL | egrep 'Slot|Firmware'
Slot Number: 0
Firmware state: Online, Spun Up
Device Firmware Level: DL58
Slot Number: 1
Firmware state: Online, Spun Up
Device Firmware Level: DL63
Slot Number: 3
Firmware state: Online, Spun Up
Device Firmware Level: DL58
Slot Number: 5
Firmware state: Online, Spun Up
Device Firmware Level: DL58
Slot Number: 6
Firmware state: Online, Spun Up
Device Firmware Level: DL58
Slot Number: 7
Firmware state: Online, Spun Up
Device Firmware Level: DL58
Slot Number: 8
Firmware state: Online, Spun Up
Device Firmware Level: DL61

And slot 4!

wiki_willy moved this task from Backlog to Cloud Tasks on the ops-eqiad board.

@Jclark-ctr - if no errors pop up after firmware and bios upgrades, maybe it a bad raid controller.

Mentioned in SAL (#wikimedia-cloud) [2020-01-23T21:09:32Z] <jeh> cloudvirt1024 set icinga downtime and powering down for hardware maintenance T241884

JHedden closed this task as Resolved.Jan 23 2020, 9:42 PM

Drives 2 and 4 had a foreign configuration. I've cleared the configuration and reassigned them as global host spares.

The original drive (slot 9) failure reported in this ticket is currently rebuilding and not showing any errors. Closing this ticket for now, but I suspect we'll be hearing from this drive again.

JHedden reopened this task as Open.Jan 24 2020, 2:41 PM

Drive 9 reported a lot of errors while rebuilding the RAID array, and now drives 2, 4, and 9 are missing from the RAID set again. I'll leave drive 9 out of the pool and test rebuilding the array with only 2 and 4.

During the next rebuild the RAID array kicked out Drive 4. Either we have 3 bad drives 2, 4 and 9 or the RAID adapter is bad. I'll send the TSR for this host to @Jclark-ctr

This host has been depooled from production and has no running workloads.

Mentioned in SAL (#wikimedia-operations) [2020-02-12T15:24:59Z] <jeh> upgrade BIOS firmware on cloudvirt1024 to 2.4.8 T241884

Mentioned in SAL (#wikimedia-operations) [2020-02-12T15:39:38Z] <jeh> clearing foreign drive RAID configuration on cloudvirt1024 T241884

Mentioned in SAL (#wikimedia-operations) [2020-02-12T17:27:19Z] <jeh> upgrade RAID firmware on cloudvirt1024 to 25.5.6.0009 T241884

Firmware upgrades I applied:

Dell PERC H730/H730P/H830/FD33xS/FD33xD Mini/Adapter RAID Controllers firmware version 25.5.6.0009
https://www.dell.com/support/home/us/en/04/drivers/driversdetails?driverid=g7n2c&oscode=biosa&productcode=poweredge-r440

Dell EMC Server PowerEdge BIOS R440/R540/T440 Version 2.4.8
https://www.dell.com/support/home/us/en/04/drivers/driversdetails?driverid=wgm2r&oscode=us004&productcode=poweredge-r440

After the upgrade drive 2 is still missing, but drives 4 and 9 have remained and the array was finally able to rebuild successfully.

Mentioned in SAL (#wikimedia-operations) [2020-02-13T22:13:30Z] <jeh> running filesystem tests on cloudvirt1024 T241884

We're still having the same issue after the firmware and BIOS upgrades. We've been trying to identify this issue since last August T230289#5429400, please open a new ticket with Dell so we can get this resolved.

I'll collect a TSR for this host and send it to @Jclark-ctr, if there's anything else I can to do help please let me know.

Worse, we've been trying to find it since last February T216218: Cloud VPS outage on cloudvirt1024 and cloudvirt1018 due to storage failure. August was another round of efforts. I'll link that task to the older outage.

Mentioned in SAL (#wikimedia-operations) [2020-02-18T21:07:00Z] <jeh> power down and set incinga downtime on cloudvirt1022 T241884

updated dell ticket with new tsr report

Any updates on this?

@JHedden Reached out to dell today opened another service request 1023973621 Request new Raid Adapter

Great! Thanks for the update. This host is currently out of service and can be taken offline anytime.

Mentioned in SAL (#wikimedia-operations) [2020-04-30T19:42:57Z] <jeh> reboot cloudvirt1024 for NIC firmware updates T241884

Mentioned in SAL (#wikimedia-operations) [2020-04-30T20:05:48Z] <jeh> cloudvirt1024 upgrade iDRAC firmware from 2.4.8 to 2.5.4 T241884

Used the BIOS versions in that last log message, the correct iDRAC versions and log output are below

# bash ./iDRAC-with-Lifecycle-Controller_Firmware_KTC95_LN_4.10.10.10_A00.BIN 
Collecting inventory.....
Running validation...

iDRAC

The version of this Update Package is newer than the currently installed version.
Software application name: iDRAC
Package version: 4.10.10.10
Installed version: 3.30.30.30

Continue? Y/N:Y
Executing update...
WARNING: DO NOT STOP THIS PROCESS OR INSTALL OTHER PRODUCTS WHILE UPDATE IS IN PROGRESS.
THESE ACTIONS MAY CAUSE YOUR SYSTEM TO BECOME UNSTABLE!
................................................................................................................................................
Update Successful.
The update completed successfully.

I'm unable to upgrade the SATA because of the failed drive state:

ERROR Serial ATA firmware
# bash ./Serial-ATA_Firmware_V141M_LN_DL5C_A00.BIN    
..
..
............................................................................
The operation was aborted because one or more logical drives are in a degraded state.

SSDSC2KB019T7R
The operation was aborted because one or more logical drives are in a degraded state.

The operation was aborted because one or more logical drives are in a degraded state.

SSDSC2KB019T7R
The operation was successful.

SSDSC2KB019T7R
The operation was aborted because one or more logical drives are in a degraded state.

SSDSC2KB019T7R
The operation was successful.

SSDSC2KB019T7R
The operation was aborted because one or more logical drives are in a degraded state.

SSDSC2KB019T7R
The operation was aborted because one or more logical drives are in a degraded state.

SSDSC2KB019T7R
The operation was aborted because one or more logical drives are in a degraded state.

SSDSC2KB019T7R
The operation was aborted because one or more logical drives are in a degraded state.

I'm also having issues upgrading the server BIOS (using the same process as the last upgrade):

ERROR BIOS upgrade
# bash ./BIOS_FP00W_LN_2.5.4.BIN                                                                                                                                             
Collecting inventory...                                                                                                                                                                                                             
                                                                                                                                                                                                                                    
Running validation...                                                                                                                                                                                                               
                                                                                                                                                                                                                                    
Taurus BIOS                                                                                                                                                                                                                         
                                                                                                                                                                                                                                    
The version of this Update Package is newer than the currently installed version.                                                                                                                                                   
Software application name: BIOS                                                                                                                                                                                                     
Package version: 2.5.4                                                                                                                                                                                                              
Installed version: 2.4.8                                                                                                                                                                                                            
                                                                                                                                                                                                                                                                                                                               
Continue? Y/N:Y                                                                                                                                                                                                                     
Executing update...                                                                                                                                                                                                                 
WARNING: DO NOT STOP THIS PROCESS OR INSTALL OTHER PRODUCTS WHILE UPDATE IS IN PROGRESS.                                                                                                                                            
THESE ACTIONS MAY CAUSE YOUR SYSTEM TO BECOME UNSTABLE!                                                                                                                                                                             
.terminate called after throwing an instance of 'smbios::InternalErrorImpl'                                                                                                                                                         
  what():  Could not instantiate SMBIOS table.

These completed successfully

Broadcom NIC Firmware
# bash ./Network_Firmware_YK81Y_LN64_21.60.22.11_01.BIN
Collecting inventory...
Running validation...

enp175s0f1d1

The version of this Update Package is older than the currently installed version.
Software application name: Broadcom Adv. Dual 10Gb Ethernet
Package version: 21.60.22.11
Installed version: FFV20.06.05.11

enp175s0f0

The version of this Update Package is older than the currently installed version.
Software application name: Broadcom Adv. Dual 10Gb Ethernet
Package version: 21.60.22.11
Installed version: FFV20.06.05.11

Continue? Y/N:Y
Y entered; update was forced by user
Executing update...
WARNING: DO NOT STOP THIS PROCESS OR INSTALL OTHER PRODUCTS WHILE UPDATE IS IN PROGRESS.
THESE ACTIONS MAY CAUSE YOUR SYSTEM TO BECOME UNSTABLE!
...................................

Update success
Would you like to reboot your system now?
Continue? Y/N:Y
W: molly-guard: SSH session detected!
Please type in hostname of the machine to shutdown: cloudvirt1024
Connection to cloudvirt1024.eqiad.wmnet closed by remote host.
Expander Backplane
# bash ./Firmware_2F90T_LN_2.46_A00_03.BIN 
Collecting inventory...
Running validation...

14G Expander Backplane

The version of this Update Package is newer than the currently installed version.
Software application name: 14G Expander Backplane
Package version: 2.46
Installed version: 2.17

Continue? Y/N:Y
Executing update...
WARNING: DO NOT STOP THIS PROCESS OR INSTALL OTHER PRODUCTS WHILE UPDATE IS IN PROGRESS.
THESE ACTIONS MAY CAUSE YOUR SYSTEM TO BECOME UNSTABLE!
....................................................................................
The operation was successful.
The update completed successfully.

I'd also like to point out that we have another system purchased in the same batch T192119, and 6 more with the same configuration T201352 that are running the same workloads without any problems.

@wiki_willy This server and/or RAID card has been giving us problems since February 2019 [1]. Do we have any options here? We seem to be stuck in an infinite loop of firmware upgrades and not getting anywhere.

[1] https://wikitech.wikimedia.org/wiki/Incident_documentation/20190213-cloudvps

I've cleared the foreign configuration on drives 4 and 9 again, once the rebuild completes I'll attempt the SATA firmware and system BIOS upgrades.

I'd also like to point out that we have another system purchased in the same batch T192119, and 6 more with the same configuration T201352 that are running the same workloads without any problems.

@wiki_willy This server and/or RAID card has been giving us problems since February 2019 [1]. Do we have any options here? We seem to be stuck in an infinite loop of firmware upgrades and not getting anywhere.

[1] https://wikitech.wikimedia.org/wiki/Incident_documentation/20190213-cloudvps

@Jclark-ctr - let me know if Dell is giving you problems with the replacement RAID adapter.

The RAID card took drive 9 offline again during the virtual disk rebuild. We cannot update the SATA drive firmware until all the devices are healthy, and since that is never the case we cannot apply the update.

I've updated the other firmware and emailed a new TSR report to @Jclark-ctr

called Dell after no response from email. Dell is sending out new backplane and new raid card.

@JHedden finished replacement of backplane, raid card, and drive 9

Thanks! I've imported the RAID config, restored the boot order settings and will verify it's fixed.

JHedden closed this task as Resolved.May 7 2020, 9:23 PM

The virtual drive rebuild process was MUCH faster, the firmware upgrades completed successfully and all drives have remained online.

I'll continue running some stress tests, but I think everything looks good now.