Page MenuHomePhabricator

Update NIC firmware on all Elastic PowerEdge R440 elastic hosts
Closed, ResolvedPublic3 Estimated Story Points

Description

Before we can update our hosts to Bullseye, we'll need to update the firmware on Dell PowerEdge R440 with Broadcom NetExtreme NICs (the installer won't work until the NIC firmware is updated). As of this writing, the current firmware version is 21.85.21.92 .

AC:

  • All hosts updated to latest approved NIC firmware
  • Process is thoroughly tested and documented.

See https://phabricator.wikimedia.org/T309343#7971329 and https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/Dell_Documentation#Urgent_Firmware_Revision_Notices: for more details.

Related Objects

Event Timeline

bking renamed this task from Update firmware on all elastic hosts to Update NIC firmware on all Elastic PowerEdge R440 elastic hosts.Jul 11 2022, 4:22 PM
bking closed this task as Invalid.
bking updated the task description. (Show Details)
bking updated the task description. (Show Details)

Looks like Puppetboard has the relevant information, here is an example entry for a single host.

Using this info, I made a list of all the Elastic hosts with R440 Chassis .

The NIC firmware version info is in PuppetDB too, but it looks like it'll require a more complex query. Working on it...

bking reopened this task as In Progress.Jul 11 2022, 9:05 PM

Mentioned in SAL (#wikimedia-operations) [2022-07-12T15:48:48Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic1065.eqiad.wmnet with reason: firmware update T312298

Mentioned in SAL (#wikimedia-operations) [2022-07-12T15:49:01Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic1065.eqiad.wmnet with reason: firmware update T312298

Mentioned in SAL (#wikimedia-operations) [2022-07-12T19:19:53Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 4:00:00 on elastic2038.codfw.wmnet with reason: firmware update T312298

Mentioned in SAL (#wikimedia-operations) [2022-07-12T19:20:07Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on elastic2038.codfw.wmnet with reason: firmware update T312298

Firmware is staged on all R440 hosts, will be automatically applied during the reimage. I've reimaged several hosts already; I can verify that the firmware updates apply correctly without disrupting the reimage process.

MPhamWMF set the point value for this task to 3.Jul 18 2022, 3:56 PM
MPhamWMF moved this task from Incoming to In Progress on the Discovery-Search (Current work) board.

After some though, I think the current state (completely applied in CODFW, to be finished in EQIAD) is enough to close. We've had 100% success in CODFW and I don't foresee a problem in EQIAD .

Follow the EQIAD work in T289135 .