Page MenuHomePhabricator

hw troubleshooting: megaraid reset due to fatal error for labstore1005.eqiad.wmnet
Closed, ResolvedPublicRequest

Description

  • - Provide FQDN of system.

labstore1005.eqiad.wmnet

  • - If other than a hard drive issue, please depool the machine (and confirm that it’s been depooled) for us to work on it. If not, please provide time frame for us to take the machine down.

@Bstorm, @Andrew can you help with this? I'm not sure what/how to depool this machine, and/or what repercussion it will have.

  • - Put system into a failed state in Netbox.
  • - Provide urgency of request, along with justification (redundancy, dependencies, etc)

The server gets IO issues that end up with the megaraid controlling reseting itself. This render the server unavailable for a few minutes and forces drbd sync to catch up every time it happens. The raid is stable and none of the disks is failing.
The frequency varies from once an hour to half a dozen times per hour.

Dell support assist report - F34630916

Events log - F34630921

The part of the journal log with the reset loop:

# First instance:
Sep 02 09:14:36 labstore1005 kernel: drbd tools: meta connection shut down by peer.
Sep 02 09:14:36 labstore1005 kernel: drbd tools: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Sep 02 09:14:37 labstore1005 sshd[56107]: Connection from 208.80.153.84 port 57818 on 10.64.37.20 port 22
Sep 02 09:14:37 labstore1005 sshd[56107]: Connection closed by 208.80.153.84 port 57818 [preauth]
Sep 02 09:14:38 labstore1005 kernel: megaraid_sas 0000:03:00.0: Iop2SysDoorbellIntfor scsi0
Sep 02 09:14:39 labstore1005 kernel: megaraid_sas 0000:03:00.0: Found FW in FAULT state, will reset adapter scsi0.
Sep 02 09:14:39 labstore1005 kernel: megaraid_sas 0000:03:00.0: resetting fusion adapter scsi0.
Sep 02 09:14:39 labstore1005 kernel: drbd misc: meta connection shut down by peer.
Sep 02 09:14:39 labstore1005 kernel: drbd misc: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Sep 02 09:14:48 labstore1005 kernel: megaraid_sas 0000:03:00.0: Waiting for FW to come to ready state
Sep 02 09:14:58 labstore1005 kernel: megaraid_sas 0000:03:00.0: FW now in Ready state
Sep 02 09:14:58 labstore1005 kernel: megaraid_sas 0000:03:00.0: Current firmware maximum commands: 928         LDIO threshold: 0
Sep 02 09:14:58 labstore1005 kernel: megaraid_sas 0000:03:00.0: FW supports sync cache        : No
Sep 02 09:14:59 labstore1005 kernel: megaraid_sas 0000:03:00.0: Init cmd success
Sep 02 09:14:59 labstore1005 kernel: megaraid_sas 0000:03:00.0: firmware type        : Extended VD(240 VD)firmware
Sep 02 09:14:59 labstore1005 kernel: megaraid_sas 0000:03:00.0: controller type        : MR(1024MB)
Sep 02 09:14:59 labstore1005 kernel: megaraid_sas 0000:03:00.0: Online Controller Reset(OCR)        : Enabled
Sep 02 09:14:59 labstore1005 kernel: megaraid_sas 0000:03:00.0: Secure JBOD support        : No
Sep 02 09:14:59 labstore1005 kernel: megaraid_sas 0000:03:00.0: Jbod map is not supported megasas_setup_jbod_map 4980
Sep 02 09:14:59 labstore1005 kernel: megaraid_sas 0000:03:00.0: Reset successful for scsi0.
Sep 02 09:14:59 labstore1005 kernel: megaraid_sas 0000:03:00.0: 67083 (683889164s/0x0020/CRIT) - Controller encountered a fatal error and was reset

Frequency of incidence by hour:

root@labstore1005:~# journalctl -S "2021-09-01" | grep "Controller encountered a fatal error and was reset" | cut -d: -f 1 | sort | uniq -c
      2 Sep 02 09
      1 Sep 02 10
     11 Sep 02 11
      1 Sep 02 12
      2 Sep 02 13
      1 Sep 02 14
      1 Sep 02 15
      1 Sep 02 16
      1 Sep 02 17
      1 Sep 02 18
      2 Sep 02 19
      6 Sep 02 20
      3 Sep 02 21
      2 Sep 02 22
      5 Sep 02 23
      5 Sep 03 00
      2 Sep 03 01
      5 Sep 03 02
      1 Sep 03 03
      3 Sep 03 04
      6 Sep 03 05
      2 Sep 03 06
      2 Sep 03 07
      4 Sep 03 08
      2 Sep 03 09
      1 Sep 03 10
      1 Sep 03 11
      2 Sep 03 12
      2 Sep 03 13

The disks statuses (all online):

root@labstore1005:~# sudo megacli -PDList -aALL | grep -i 'firmware state' | uniq -c
     26 Firmware state: Online, Spun Up
  • - Assign correct project tag and appropriate owner (based on above). Also, please ensure the service owners of the host(s) are added as subscribers to provide any additional input.

summary of work

A backup controller was ordered via T290602.
Firmware was flashed to idrac and raid controller via T290318#7357112
System returned to service, monitoring for errors via T290318#7359652
If no errors are generated by 2021-09-18, we can likely resolve this task. (If they happen after that time a new task can be generated and point at this task, then the controller will simply be swapped.)

Event Timeline

Mentioned in SAL (#wikimedia-cloud) [2021-09-03T15:20:13Z] <bstorm> stopping puppet and disabling backup syncs to labstore1005 on cloudbackup2001 T290318

Mentioned in SAL (#wikimedia-cloud) [2021-09-03T15:24:15Z] <bstorm> stopping puppet and disabling backup syncs to labstore1005 on cloudbackup2002 T290318

The commands are hanging to disconnect this server from the cluster, so I have to reboot it in order to break the link. I've downtimed it in icinga until Tuesday and am doing that now.

Mentioned in SAL (#wikimedia-cloud) [2021-09-03T15:34:17Z] <bstorm> rebooting labstore1005 to disconnect the drives from labstore1004 T290318

That was successful:

root@labstore1004:~# drbd-overview
 1:test/0   StandAlone Primary/Unknown UpToDate/DUnknown /srv/test  ext4 9.8G 535M 8.7G 6%
 3:misc/0   StandAlone Primary/Unknown UpToDate/Outdated /srv/misc  ext4 5.0T 1.9T 3.0T 39%
 4:tools/0  StandAlone Primary/Unknown UpToDate/Outdated /srv/tools ext4 8.0T 6.1T 1.5T 81%

With that in "standalone" mode, it won't try to reconnect without manual help.

The urgency is moderate here. The NFS service for the cloud (a core function of Toolforge ) is still up and should function fine. This is the failover and backup host. We will not be able to take weekly backups of Toolforge or any other cloud VPS NFS until labstore1005 is stable again, and failover will be impossible. This is the only warm standby host and backup source for the majority of Cloud-Services NFS.

tldr: We aren't down, but one of our core services to stay up is now a single point of failure with backups stopped.

Change 717948 had a related patch set uploaded (by Bstorm; author: Bstorm):

[operations/puppet@production] cloud NFS: tighten up the traffic shaping a little

https://gerrit.wikimedia.org/r/717948

Change 717948 merged by Bstorm:

[operations/puppet@production] cloud NFS: tighten up the traffic shaping a little

https://gerrit.wikimedia.org/r/717948

The first thing we will need to do is to try and update the raid controller firmware along with the bios.

Bstorm mentioned this in Unknown Object (Task).Sep 8 2021, 5:47 PM
wiki_willy added a subtask: Unknown Object (Task).Sep 8 2021, 6:00 PM
RobH mentioned this in Unknown Object (Task).Sep 9 2021, 10:44 PM

Update from sub-task and IRC discussions:

Updating the firmware is the first thing to try. However, the system is currently experiencing transient failures, any troubleshooting could result in repair or complete failure of the controller. Rather than potentially offline an important backups server without any recourse a replacement controller is in the process of being ordered via T290602.

When it arrives, the current controller should have the firmware updated to see if the error clears. It is far easier to just clear the error than swap the controller and import the config.

If the update of firmware doesn't clear the error, then we should do a couple things:

  • backup the config of the raid if possible, if not just note it manually
  • swap controller, controller may detect 'foreign config' on the disks, if it does, use the foreign config as its the old controllers configuration data written to the raid array and should simply import the config.

If the new controller isn't needed, it is still an H730P that could be used in other systems if needed in the future.

RobH changed the task status from Open to In Progress.Sep 15 2021, 6:35 PM
RobH claimed this task.
RobH added a subscriber: Cmjohnson.

Attempting to flash the raid firmware results in failure to verify package contents on the firmware file. The firmware file is fine (re-downloaded and checked the checksum locally before upload twice, upload attempt 2 times per file download). It seems the idrac is just ancient, so it has to be updated first.

Each of these attempts included a reboot, so I've left it powered off for the idrac flash so we don't have to keep rebooting the OS for no reason.

Unfortunately its 2.41, and when you try to flash to 2.81 (latest) it will flash to 2.61 first, then require reflash of the same file to push to 2.81. This 2.61 introduces a bad SSL self sign generation that breaks firefox and chrome, but in the past works in Safari. So I'm flashing this idrac now and will complete hte flash in safari (ssh tunneling of local proxy) to complete idrac, then reattempt the raid bios.

attempted flash of bios and raid bios, failure to due unable to verify package - upload attempted twice from the same set of downloaded files
reattempted flash of bios and raid bios with new downloads and uploads - failure to due unable to verify package - upload attempted twice from the same set of newly downloaded files
attempted flash of idrac to resolve package verification issues, idrac https locked up and required racreset command via ssh
attempted flash of idrac from 2.30 to 2.81, failed due ot package verification. we're going to have to step this up manually by cutting the versions down one at a time.

tried to cut the difference and go half as old on firmware, didnt work (package verify failed) so started just incrementing one version at a time.

idrac
2.30 to 2.40 successful
2.40 to 2.41 successful
2.41 to 2.52 successful

I incremented the idrac from 2.52 to 2.81, bypassing the known bad version 2.61. Seems 2.81 also has an https issue (different than 2.61) so John had to connect a crash cart and rollback the firmware via the lifecycle controller.

https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/Dell_Documentation#Rolling_back_Firmware_updates updated for the directions on doing that with a crash cart.

I've now updated the raid firmware to the newest version, 25.5.9.0001_A17

This system should now be returned to service to see if it still generates the errors it had before. If it doesn't, this task can be resolved. If it does, then we'll swap the controller with the spare purchased.

I'm assigning this over to @Bstorm since I've been working with them in IRC with downtiming this host. If someone else should handle, please feel free to reassign as needed!

Mentioned in SAL (#wikimedia-cloud) [2021-09-16T15:56:03Z] <bstorm> removing downtime for labstore1005 so we'll know if it has another issue T290318

So far so good:

[bstorm@labstore1005]:~ $ sudo journalctl -S "2021-09-15" | grep "Controller encountered a fatal error and was reset" | cut -d: -f 1 | sort | uniq -c
[bstorm@labstore1005]:~ $
RobH changed the task status from In Progress to Stalled.Sep 16 2021, 5:28 PM

So far so good:

[bstorm@labstore1005]:~ $ sudo journalctl -S "2021-09-15" | grep "Controller encountered a fatal error and was reset" | cut -d: -f 1 | sort | uniq -c
[bstorm@labstore1005]:~ $

\o/ so far so good!

I forgot to change the status from In Progress when I finished, changing it to 'stalled' for now since we're just waiting to see if it errors. If it doesn't generate any errors in a day or two, we can likely resolve this task.

robh@labstore1005:~$ sudo journalctl -S "2021-09-15" | grep "Controller encountered a fatal error and was reset" | cut -d: -f 1 | sort | uniq -c
robh@labstore1005:~$ sudo journalctl -S "2021-09-16" | grep "Controller encountered a fatal error and was reset" | cut -d: -f 1 | sort | uniq -c
robh@labstore1005:~$ sudo journalctl -S "2021-09-17" | grep "Controller encountered a fatal error and was reset" | cut -d: -f 1 | sort | uniq -c
robh@labstore1005:~$

still going strong

nskaggs added a subscriber: nskaggs.

This also worked without issue over the weekend. Closing for now as resolved. Thanks!

Jclark-ctr closed subtask Unknown Object (Task) as Resolved.Sep 21 2021, 8:59 PM