Page MenuHomePhabricator

Upgrade BIOS and IDRAC firmware on R440 cp systems
Open, HighPublic

Description

We should upgrade BIOS and IDRAC firmware in esams, these are crashing frequently (T238305).
This task was expanded on 2020-02-10 to include eqiad cache systems.

BIOS Version: 2.4.8
iDRAC Firmware Version: 4.00.00.00

eqiad hosts

upload:

  • cp1076
  • cp1078
  • cp1080
  • cp1082
  • cp1084
  • cp1086
  • cp1088
  • cp1090

text:

  • cp1075
  • cp1077
  • cp1079
  • cp1081
  • cp1083
  • cp1085
  • cp1087
  • cp1089

esams hosts

Please upgrade cache_upload hosts with precedence:

  • cp3051.esams.wmnet
  • cp3053.esams.wmnet
  • cp3055.esams.wmnet
  • cp3057.esams.wmnet
  • cp3059.esams.wmnet
  • cp3061.esams.wmnet
  • cp3063.esams.wmnet
  • cp3065.esams.wmnet

And there's also the cache_text:

  • cp3050.esams.wmnet
  • cp3052.esams.wmnet
  • cp3054.esams.wmnet
  • cp3056.esams.wmnet
  • cp3058.esams.wmnet
  • cp3060.esams.wmnet
  • cp3062.esams.wmnet
  • cp3064.esams.wmnet

Please coordinate depooling/pooling of the servers with the #wikimedia-traffic channel.

Update Checklist

CP system BIOs update directions:

  • - ensure host can be offline with Traffic
  • - shutdown host via OS commands, this will automatically depool the host from pybal
  • - update firmware via mgmt interface
  • - boot host back into OS, puppet run should clear all icinga checks green. (May need to manually refire puppet checks to speed things up.)
  • - Green in icinga, then run 'pool' from the command line of the host

Checks to run between system updates & changing pool state:

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
wiki_willy added a project: DC-Ops.
wiki_willy added a subscriber: wiki_willy.

@RobH - can you work with the traffic to get the bios upgraded on the cp hosts in esams? In T240177, @Papaul
found a Dell bulletin that associated with the random crashes:

"Fixed a continuous reboot issue and Out of Resource error with PCIe IO resource allocation which was observed in the 2.4.7 version." see link below:
https://www.dell.com/support/home/us/en/19/drivers/driversdetails?driverid=wgm2r&oscode=wst14&productcode=poweredge-r440"

Thanks,
Willy

RobH moved this task from Backlog to Hardware Failure / Repair on the ops-esams board.
RobH added subscribers: BBlack, RobH.

Please note that Traffic (and @BBlack) previously asked me NOT to do this on these hosts, while they worked out why they are crashing. (reference T238305)

@BBlack: Are we cleared to upgrade the bios on these?

Please note the idrac 4.0 firmware drastically changes the behavior of idrac, and I'm not sure we should upgrade the fleet without testing. Bios is different, and may fix the crash issue, so no objection there.

@BBlack: Please comment and assign back to me with the procedure and timelines for upgrading. Typically, I can do these one at a time (one in upload, one in text) and bring them fully back online and hand back to you before starting on the next. We can also schedule these out with 1 from each group per day, or I can provide directions for traffic to flash. Any of these work, so please advise what would work best for Traffic and assign back to me!

@BBlack, Can we modify this task to include the eqiad caches that need update as well? I'll be handing these remotely. During this process, if any single server fails and requires on-site work, I'll make a sub-task for its repair off this task.

If that is acceptable, we'll just need to add the eqiad cp systems to be updated into the task description. As I understand it now, we want to update ALL R440 bios versions in our cp clusters, correct?

@RobH - Yes, let's edit this to include eqiad as well. We've had the same symptoms both places, and they're the same approximate generation of hardware configuration (IIRC, only the NVMe changed to a slightly newer/better model from eqiad to esams, but the base system is otherwise fairly identical).

As for process - in general we should be able to cleanly shut these down from the OS level, and they'll self-depool during their shutdown. When a node is ready to go back into service, the command pool from a root prompt will re-pool its services into production use. As long as icinga checks for the nodes are coming back all-green after the reboot (some may not until the initial puppet run has completed!), we should normally be good to repool. Only other caveat is watch out for the ongoing Debian Buster upgrades as well - if it's EU time (US morning), check with Traffic channel in case they're upgrading the same nodes.

RobH renamed this task from Upgrade BIOS and IDRAC firmware on esams caches to Upgrade BIOS and IDRAC firmware on R440 cp systems.Feb 10 2020, 8:26 PM
RobH updated the task description. (Show Details)
RobH updated the task description. (Show Details)

My plan is to do one from each service group (upload/text) at a time, batched together. (It is just as easy to watch two bios updates as one, it doesn't quite scale more than that for close supervision.) Would that be ok?

Mentioned in SAL (#wikimedia-operations) [2020-02-10T21:38:30Z] <robh> cp1075 & cp1076 offline for bios updates per T243167

Mentioned in SAL (#wikimedia-operations) [2020-02-10T21:58:41Z] <robh> cp107[56] returned to service, cp107[78] offline for bios update via T243167

Mentioned in SAL (#wikimedia-operations) [2020-02-10T22:42:10Z] <robh> cp107[89] returned to service, cp108[01] offline for bios update via T243167

Mentioned in SAL (#wikimedia-operations) [2020-02-10T22:43:23Z] <robh> cp107[789] returned to service, cp108[01] offline for bios update via T243167

Mentioned in SAL (#wikimedia-operations) [2020-02-10T23:06:05Z] <robh> cp108[01] returned to service, cp108[23] offline for bios update via T243167

Mentioned in SAL (#wikimedia-operations) [2020-02-10T23:30:14Z] <robh> cp108[23] returned to service via T243167

Mentioned in SAL (#wikimedia-operations) [2020-02-11T20:38:12Z] <robh> depooling cp108[45] for firmware update via T243167

Mentioned in SAL (#wikimedia-operations) [2020-02-11T20:57:30Z] <robh> cp108[45] returned to service, depooling cp108[67]for firmware update via T243167

Please note the outage caused the SAL of my adding back cp108[67] to service, (as the rest weren't really returned, but depooled.)

Mentioned in SAL (#wikimedia-operations) [2020-02-19T17:12:36Z] <robh> cp1088 returned to service, cp1089 & cp1090 offline for firmware update via T243167

Mentioned in SAL (#wikimedia-operations) [2020-02-19T17:48:16Z] <robh> cp1089 cp1090 returned to service via T243167

Please note as of now, all eqiad cp sysetms have been updated to the latest bios revision. If these hosts experience any further crashes, its vital to note it for review.

I'll move onto the esams cp systems in my afternoon, as that will be closer to post-peak esams time.

RobH removed a project: ops-eqiad.

Mentioned in SAL (#wikimedia-operations) [2020-02-19T22:37:52Z] <robh> taking cp3050 & cp3051 offline for firmware update via T243167

Mentioned in SAL (#wikimedia-operations) [2020-02-19T22:54:57Z] <robh> cp3050 & cp3051 returned to service via T243167

RobH changed the task status from Open to Stalled.Feb 26 2020, 6:53 PM

We've had a lot of unrelated to this task work ongoing in esams. That work takes priority, so this is going to sit stalled until Traffic gives sign off to continue.

Ok, this has now sat neglected for awhile. @BBlack: Should I resume updating bios on these hosts in a rotating, one per cluster fashion again? Assigning to you for update, but feel free to kick back to me if/when I should continue!

RobH changed the task status from Stalled to Open.Jul 30 2020, 5:27 PM

This has sat ignored while I was doing procurement, but I'll pick this back up next week and start updating these again. (On clinic duty this week so I don't want to split my attention quite that much.)

So there is currently an experiment going on with caching hosts in esams, and flashing firmware would interrupt that. T264398#6772586 When that is done, this can resume, likely in a week or so.

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all such tickets that haven't been updated in 6 months or more. This does not imply any human judgement about the validity or importance of the task, and is simply the first step in a larger task cleanup effort. Further manual triage and/or requests for updates will happen this month for all such tickets. For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!