Page MenuHomePhabricator

Upgrade BIOS and IDRAC firmware on R440 cp systems
Open, HighPublic

Description

We should upgrade BIOS and IDRAC firmware in esams, these are crashing frequently (T238305).
This task was expanded on 2020-02-10 to include eqiad cache systems.

BIOS Version: 2.4.8
iDRAC Firmware Version: 4.00.00.00

eqiad hosts

upload:

  • cp1076
  • cp1078
  • cp1080
  • cp1082
  • cp1084
  • cp1086
  • cp1088
  • cp1090

text:

  • cp1075
  • cp1077
  • cp1079
  • cp1081
  • cp1083
  • cp1085
  • cp1087
  • cp1089

esams hosts

Please upgrade cache_upload hosts with precedence:

  • cp3051.esams.wmnet
  • cp3053.esams.wmnet
  • cp3055.esams.wmnet
  • cp3057.esams.wmnet
  • cp3059.esams.wmnet
  • cp3061.esams.wmnet
  • cp3063.esams.wmnet
  • cp3065.esams.wmnet

And there's also the cache_text:

  • cp3050.esams.wmnet
  • cp3052.esams.wmnet
  • cp3054.esams.wmnet
  • cp3056.esams.wmnet
  • cp3058.esams.wmnet
  • cp3060.esams.wmnet
  • cp3062.esams.wmnet
  • cp3064.esams.wmnet

Please coordinate depooling/pooling of the servers with the #wikimedia-traffic channel.

Update Checklist

CP system BIOs update directions:

  • - ensure host can be offline with Traffic
  • - shutdown host via OS commands, this will automatically depool the host from pybal
  • - update firmware via mgmt interface
  • - boot host back into OS, puppet run should clear all icinga checks green. (May need to manually refire puppet checks to speed things up.)
  • - Green in icinga, then run 'pool' from the command line of the host

Checks to run between system updates & changing pool state:

Event Timeline

Restricted Application added a project: Operations. · View Herald TranscriptJan 20 2020, 8:28 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
MoritzMuehlenhoff triaged this task as High priority.Jan 20 2020, 8:28 AM
wiki_willy assigned this task to RobH.Tue, Jan 21, 4:08 PM
wiki_willy added a project: DC-Ops.
wiki_willy added a subscriber: wiki_willy.

@RobH - can you work with the traffic to get the bios upgraded on the cp hosts in esams? In T240177, @Papaul
found a Dell bulletin that associated with the random crashes:

"Fixed a continuous reboot issue and Out of Resource error with PCIe IO resource allocation which was observed in the 2.4.7 version." see link below:
https://www.dell.com/support/home/us/en/19/drivers/driversdetails?driverid=wgm2r&oscode=wst14&productcode=poweredge-r440"

Thanks,
Willy

CDanis added a subscriber: CDanis.Tue, Jan 21, 4:14 PM
RobH reassigned this task from RobH to BBlack.Tue, Jan 21, 4:53 PM
RobH moved this task from Backlog to Break/Fix on the ops-esams board.
RobH added subscribers: BBlack, RobH.

Please note that Traffic (and @BBlack) previously asked me NOT to do this on these hosts, while they worked out why they are crashing. (reference T238305)

@BBlack: Are we cleared to upgrade the bios on these?

Please note the idrac 4.0 firmware drastically changes the behavior of idrac, and I'm not sure we should upgrade the fleet without testing. Bios is different, and may fix the crash issue, so no objection there.

@BBlack: Please comment and assign back to me with the procedure and timelines for upgrading. Typically, I can do these one at a time (one in upload, one in text) and bring them fully back online and hand back to you before starting on the next. We can also schedule these out with 1 from each group per day, or I can provide directions for traffic to flash. Any of these work, so please advise what would work best for Traffic and assign back to me!

RobH moved this task from Triage to Hardware on the Traffic board.Tue, Jan 21, 4:53 PM
RobH added a comment.Mon, Feb 10, 5:44 PM

@BBlack, Can we modify this task to include the eqiad caches that need update as well? I'll be handing these remotely. During this process, if any single server fails and requires on-site work, I'll make a sub-task for its repair off this task.

If that is acceptable, we'll just need to add the eqiad cp systems to be updated into the task description. As I understand it now, we want to update ALL R440 bios versions in our cp clusters, correct?

@RobH - Yes, let's edit this to include eqiad as well. We've had the same symptoms both places, and they're the same approximate generation of hardware configuration (IIRC, only the NVMe changed to a slightly newer/better model from eqiad to esams, but the base system is otherwise fairly identical).

As for process - in general we should be able to cleanly shut these down from the OS level, and they'll self-depool during their shutdown. When a node is ready to go back into service, the command pool from a root prompt will re-pool its services into production use. As long as icinga checks for the nodes are coming back all-green after the reboot (some may not until the initial puppet run has completed!), we should normally be good to repool. Only other caveat is watch out for the ongoing Debian Buster upgrades as well - if it's EU time (US morning), check with Traffic channel in case they're upgrading the same nodes.

BBlack reassigned this task from BBlack to RobH.Mon, Feb 10, 6:45 PM
RobH renamed this task from Upgrade BIOS and IDRAC firmware on esams caches to Upgrade BIOS and IDRAC firmware on R440 cp systems.Mon, Feb 10, 8:26 PM
RobH updated the task description. (Show Details)
RobH updated the task description. (Show Details)
RobH added a comment.Mon, Feb 10, 8:28 PM

My plan is to do one from each service group (upload/text) at a time, batched together. (It is just as easy to watch two bios updates as one, it doesn't quite scale more than that for close supervision.) Would that be ok?

Mentioned in SAL (#wikimedia-operations) [2020-02-10T21:38:30Z] <robh> cp1075 & cp1076 offline for bios updates per T243167

Mentioned in SAL (#wikimedia-operations) [2020-02-10T21:58:41Z] <robh> cp107[56] returned to service, cp107[78] offline for bios update via T243167

RobH updated the task description. (Show Details)Mon, Feb 10, 10:19 PM
RobH updated the task description. (Show Details)Mon, Feb 10, 10:39 PM

Mentioned in SAL (#wikimedia-operations) [2020-02-10T22:42:10Z] <robh> cp107[89] returned to service, cp108[01] offline for bios update via T243167

Mentioned in SAL (#wikimedia-operations) [2020-02-10T22:43:23Z] <robh> cp107[789] returned to service, cp108[01] offline for bios update via T243167

RobH updated the task description. (Show Details)Mon, Feb 10, 10:44 PM
RobH updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2020-02-10T23:06:05Z] <robh> cp108[01] returned to service, cp108[23] offline for bios update via T243167

Mentioned in SAL (#wikimedia-operations) [2020-02-10T23:30:14Z] <robh> cp108[23] returned to service via T243167

RobH updated the task description. (Show Details)Mon, Feb 10, 11:30 PM

Mentioned in SAL (#wikimedia-operations) [2020-02-11T20:38:12Z] <robh> depooling cp108[45] for firmware update via T243167

RobH updated the task description. (Show Details)Tue, Feb 11, 8:56 PM

Mentioned in SAL (#wikimedia-operations) [2020-02-11T20:57:30Z] <robh> cp108[45] returned to service, depooling cp108[67]for firmware update via T243167

RobH added a comment.Wed, Feb 12, 4:05 PM

Please note the outage caused the SAL of my adding back cp108[67] to service, (as the rest weren't really returned, but depooled.)

RobH updated the task description. (Show Details)Wed, Feb 12, 4:06 PM
Papaul removed a subscriber: Papaul.Wed, Feb 12, 5:09 PM
RobH updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2020-02-19T17:12:36Z] <robh> cp1088 returned to service, cp1089 & cp1090 offline for firmware update via T243167

Mentioned in SAL (#wikimedia-operations) [2020-02-19T17:48:16Z] <robh> cp1089 cp1090 returned to service via T243167

RobH updated the task description. (Show Details)Wed, Feb 19, 5:51 PM

Please note as of now, all eqiad cp sysetms have been updated to the latest bios revision. If these hosts experience any further crashes, its vital to note it for review.

I'll move onto the esams cp systems in my afternoon, as that will be closer to post-peak esams time.

RobH updated the task description. (Show Details)Wed, Feb 19, 5:54 PM
RobH removed a project: ops-eqiad.

Mentioned in SAL (#wikimedia-operations) [2020-02-19T22:37:52Z] <robh> taking cp3050 & cp3051 offline for firmware update via T243167

Mentioned in SAL (#wikimedia-operations) [2020-02-19T22:54:57Z] <robh> cp3050 & cp3051 returned to service via T243167

RobH updated the task description. (Show Details)Wed, Feb 19, 10:55 PM