Page MenuHomePhabricator

Upgrade BIOS and IDRAC firmware on R440 cp systems
Closed, ResolvedPublic

Description

We should upgrade BIOS and IDRAC firmware in esams, these are crashing frequently (T238305).
This task was expanded on 2020-02-10 to include eqiad cache systems.

BIOS Version: 2.4.8
iDRAC Firmware Version: 4.00.00.00

eqiad hosts

upload:

  • cp1076
  • cp1078
  • cp1080
  • cp1082
  • cp1084
  • cp1086
  • cp1088
  • cp1090

text:

  • cp1075
  • cp1077
  • cp1079
  • cp1081
  • cp1083
  • cp1085
  • cp1087
  • cp1089

esams hosts

Please upgrade cache_upload hosts with precedence:

  • cp3051.esams.wmnet
  • cp3053.esams.wmnet
  • cp3055.esams.wmnet
  • cp3057.esams.wmnet
  • cp3059.esams.wmnet
  • cp3061.esams.wmnet
  • cp3063.esams.wmnet
  • cp3065.esams.wmnet

And there's also the cache_text:

  • cp3050.esams.wmnet
  • cp3052.esams.wmnet
  • cp3054.esams.wmnet
  • cp3056.esams.wmnet
  • cp3058.esams.wmnet
  • cp3060.esams.wmnet
  • cp3062.esams.wmnet
  • cp3064.esams.wmnet

Please coordinate depooling/pooling of the servers with the #wikimedia-traffic channel.

Update Checklist

CP system BIOs update directions:

  • - ensure host can be offline with Traffic
  • - shutdown host via OS commands, this will automatically depool the host from pybal
  • - update firmware via mgmt interface
  • - boot host back into OS, puppet run should clear all icinga checks green. (May need to manually refire puppet checks to speed things up.)
  • - Green in icinga, then run 'pool' from the command line of the host

Checks to run between system updates & changing pool state:

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
wiki_willy added a project: DC-Ops.
wiki_willy added a subscriber: wiki_willy.

@RobH - can you work with the traffic to get the bios upgraded on the cp hosts in esams? In T240177, @Papaul
found a Dell bulletin that associated with the random crashes:

"Fixed a continuous reboot issue and Out of Resource error with PCIe IO resource allocation which was observed in the 2.4.7 version." see link below:
https://www.dell.com/support/home/us/en/19/drivers/driversdetails?driverid=wgm2r&oscode=wst14&productcode=poweredge-r440"

Thanks,
Willy

RobH moved this task from Backlog to Hardware Failure / Repair on the ops-esams board.
RobH added subscribers: BBlack, RobH.

Please note that Traffic (and @BBlack) previously asked me NOT to do this on these hosts, while they worked out why they are crashing. (reference T238305)

@BBlack: Are we cleared to upgrade the bios on these?

Please note the idrac 4.0 firmware drastically changes the behavior of idrac, and I'm not sure we should upgrade the fleet without testing. Bios is different, and may fix the crash issue, so no objection there.

@BBlack: Please comment and assign back to me with the procedure and timelines for upgrading. Typically, I can do these one at a time (one in upload, one in text) and bring them fully back online and hand back to you before starting on the next. We can also schedule these out with 1 from each group per day, or I can provide directions for traffic to flash. Any of these work, so please advise what would work best for Traffic and assign back to me!

@BBlack, Can we modify this task to include the eqiad caches that need update as well? I'll be handing these remotely. During this process, if any single server fails and requires on-site work, I'll make a sub-task for its repair off this task.

If that is acceptable, we'll just need to add the eqiad cp systems to be updated into the task description. As I understand it now, we want to update ALL R440 bios versions in our cp clusters, correct?

@RobH - Yes, let's edit this to include eqiad as well. We've had the same symptoms both places, and they're the same approximate generation of hardware configuration (IIRC, only the NVMe changed to a slightly newer/better model from eqiad to esams, but the base system is otherwise fairly identical).

As for process - in general we should be able to cleanly shut these down from the OS level, and they'll self-depool during their shutdown. When a node is ready to go back into service, the command pool from a root prompt will re-pool its services into production use. As long as icinga checks for the nodes are coming back all-green after the reboot (some may not until the initial puppet run has completed!), we should normally be good to repool. Only other caveat is watch out for the ongoing Debian Buster upgrades as well - if it's EU time (US morning), check with Traffic channel in case they're upgrading the same nodes.

RobH renamed this task from Upgrade BIOS and IDRAC firmware on esams caches to Upgrade BIOS and IDRAC firmware on R440 cp systems.Feb 10 2020, 8:26 PM
RobH updated the task description. (Show Details)
RobH updated the task description. (Show Details)

My plan is to do one from each service group (upload/text) at a time, batched together. (It is just as easy to watch two bios updates as one, it doesn't quite scale more than that for close supervision.) Would that be ok?

Mentioned in SAL (#wikimedia-operations) [2020-02-10T21:38:30Z] <robh> cp1075 & cp1076 offline for bios updates per T243167

Mentioned in SAL (#wikimedia-operations) [2020-02-10T21:58:41Z] <robh> cp107[56] returned to service, cp107[78] offline for bios update via T243167

Mentioned in SAL (#wikimedia-operations) [2020-02-10T22:42:10Z] <robh> cp107[89] returned to service, cp108[01] offline for bios update via T243167

Mentioned in SAL (#wikimedia-operations) [2020-02-10T22:43:23Z] <robh> cp107[789] returned to service, cp108[01] offline for bios update via T243167

Mentioned in SAL (#wikimedia-operations) [2020-02-10T23:06:05Z] <robh> cp108[01] returned to service, cp108[23] offline for bios update via T243167

Mentioned in SAL (#wikimedia-operations) [2020-02-10T23:30:14Z] <robh> cp108[23] returned to service via T243167

Mentioned in SAL (#wikimedia-operations) [2020-02-11T20:38:12Z] <robh> depooling cp108[45] for firmware update via T243167

Mentioned in SAL (#wikimedia-operations) [2020-02-11T20:57:30Z] <robh> cp108[45] returned to service, depooling cp108[67]for firmware update via T243167

Please note the outage caused the SAL of my adding back cp108[67] to service, (as the rest weren't really returned, but depooled.)

Mentioned in SAL (#wikimedia-operations) [2020-02-19T17:12:36Z] <robh> cp1088 returned to service, cp1089 & cp1090 offline for firmware update via T243167

Mentioned in SAL (#wikimedia-operations) [2020-02-19T17:48:16Z] <robh> cp1089 cp1090 returned to service via T243167

Please note as of now, all eqiad cp sysetms have been updated to the latest bios revision. If these hosts experience any further crashes, its vital to note it for review.

I'll move onto the esams cp systems in my afternoon, as that will be closer to post-peak esams time.

RobH removed a project: ops-eqiad.

Mentioned in SAL (#wikimedia-operations) [2020-02-19T22:37:52Z] <robh> taking cp3050 & cp3051 offline for firmware update via T243167

Mentioned in SAL (#wikimedia-operations) [2020-02-19T22:54:57Z] <robh> cp3050 & cp3051 returned to service via T243167

RobH changed the task status from Open to Stalled.Feb 26 2020, 6:53 PM

We've had a lot of unrelated to this task work ongoing in esams. That work takes priority, so this is going to sit stalled until Traffic gives sign off to continue.

Ok, this has now sat neglected for awhile. @BBlack: Should I resume updating bios on these hosts in a rotating, one per cluster fashion again? Assigning to you for update, but feel free to kick back to me if/when I should continue!

RobH changed the task status from Stalled to Open.Jul 30 2020, 5:27 PM

This has sat ignored while I was doing procurement, but I'll pick this back up next week and start updating these again. (On clinic duty this week so I don't want to split my attention quite that much.)

So there is currently an experiment going on with caching hosts in esams, and flashing firmware would interrupt that. T264398#6772586 When that is done, this can resume, likely in a week or so.

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all such tickets that haven't been updated in 6 months or more. This does not imply any human judgement about the validity or importance of the task, and is simply the first step in a larger task cleanup effort. Further manual triage and/or requests for updates will happen this month for all such tickets. For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!

Can we resurrect this and I finish out the esams hosts? I'd like to close this out, its just shaming me with its age.

Checking first since the traffic team could be running testing which would be affected by a rolling restart of the majority of esams hosts (in a regular cadence, but it would affect test metrics.)

Synced with Brandon via IRC, and I'm good to resume this. Each host, one per cluster at a time (one upload, one text), disabling puppet agent, depooling, rebooting for firmware, then repooling and reenabling puppet.

Mentioned in SAL (#wikimedia-sre) [2022-05-11T21:01:10Z] <robh> cp305[23] going offline via T243167 for firmware updates (puppet agent disabled and depooled prior to reboot)

RobH changed the task status from Open to In Progress.May 11 2022, 9:01 PM

Mentioned in SAL (#wikimedia-sre) [2022-05-11T21:34:00Z] <robh> cp50[23] returned to service and all green in icinga, cp50[45] depooling for firmware update T243167

Mentioned in SAL (#wikimedia-sre) [2022-05-11T21:34:37Z] <robh> cp30[23] returned to service and all green in icinga, cp30[45] depooling for firmware update T243167

Mentioned in SAL (#wikimedia-sre) [2022-05-11T22:00:55Z] <robh> cp305[45] returned to service and all green in icinga, cp305[67] depooling for firmware update T243167

Mentioned in SAL (#wikimedia-sre) [2022-05-11T22:28:12Z] <robh> cp305[67] returned to service and all green in icinga, cp305[89] depooling for firmware update T243167

RobH changed the task status from In Progress to Open.May 12 2022, 12:03 AM

will resume tomorrow late evening for esams / afternoon for me.

Mentioned in SAL (#wikimedia-sre) [2022-05-12T20:50:30Z] <robh> resuming last 6 esams cp host firmware updates via T243167. cp306[01] going offline

cp3060 refuses to load its idrac https interface, even when i clear browser history and do a racreset on the idrac interface, skipping it and continuing with the rest of them.

Mentioned in SAL (#wikimedia-sre) [2022-05-12T21:15:22Z] <robh> cp306[01] returned to service, cp306[23] coming down for firmware update via T243167

Mentioned in SAL (#wikimedia-sre) [2022-05-12T21:43:12Z] <robh> cp306[23] returned to service, cp306[45] coming down for firmware update via T243167

RobH lowered the priority of this task from High to Low.May 12 2022, 10:27 PM

all done but cp3060 which refuses to pull up https on idrac for the firmware flash, it just endlessly loads the login screen until timeout.

I'll need to loop back and try powering it off, and then turn off power to its ports to drain and fully reset the idrac interface, since it didn't fix itself with a racreset command.

Mentioned in SAL (#wikimedia-operations) [2022-05-19T22:07:30Z] <robh> cp3060 idrac interface frozen, rebooted via power outlet control on T243167

RobH closed this task as Resolved.EditedMay 19 2022, 10:13 PM

cp3060's idrac https interface just pulls up and endlessly is 'loading' (see attached screen shot).

I tried a racreset command via idrac command line, as well as completely draining power to the device via the PDU controls for ps1-oe15-esams. While the idrac interface comes back up for command line ssh control, its https endless loading still occurs.

As this is now a hardware failure, I'm going to create a new task to troubleshoot this, and include the firmware update of cp3060 in that troubleshooting task.

endless.loading.png (2×2 px, 752 KB)

Mentioned in SAL (#wikimedia-operations) [2022-05-20T08:53:58Z] <vgutierrez> re-enabling puppet and repooling cp3060 - T308797 T243167