⚓ T243167 Upgrade BIOS and IDRAC firmware on R440 cp systems

		Status	Subtype	Assigned	Task
		Stalled		None	T238305 Servers freezing across the caching cluster
		Resolved		RobH	T243167 Upgrade BIOS and IDRAC firmware on R440 cp systems

Restricted Application added a project: SRE. · View Herald TranscriptJan 20 2020, 8:28 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

MoritzMuehlenhoff triaged this task as High priority.Jan 20 2020, 8:28 AM

wiki_willy assigned this task to RobH.Jan 21 2020, 4:08 PM

@RobH - can you work with the traffic to get the bios upgraded on the cp hosts in esams? In T240177, @Papaul
found a Dell bulletin that associated with the random crashes:

"Fixed a continuous reboot issue and Out of Resource error with PCIe IO resource allocation which was observed in the 2.4.7 version." see link below:
https://www.dell.com/support/home/us/en/19/drivers/driversdetails?driverid=wgm2r&oscode=wst14&productcode=poweredge-r440"

Thanks,
Willy

CDanis subscribed.Jan 21 2020, 4:14 PM

Please note that Traffic (and @BBlack) previously asked me NOT to do this on these hosts, while they worked out why they are crashing. (reference T238305)

@BBlack: Are we cleared to upgrade the bios on these?

Please note the idrac 4.0 firmware drastically changes the behavior of idrac, and I'm not sure we should upgrade the fleet without testing. Bios is different, and may fix the crash issue, so no objection there.

@BBlack: Please comment and assign back to me with the procedure and timelines for upgrading. Typically, I can do these one at a time (one in upload, one in text) and bring them fully back online and hand back to you before starting on the next. We can also schedule these out with 1 from each group per day, or I can provide directions for traffic to flash. Any of these work, so please advise what would work best for Traffic and assign back to me!

RobH moved this task from Backlog to Hardware on the Traffic board.Jan 21 2020, 4:53 PM

@BBlack, Can we modify this task to include the eqiad caches that need update as well? I'll be handing these remotely. During this process, if any single server fails and requires on-site work, I'll make a sub-task for its repair off this task.

If that is acceptable, we'll just need to add the eqiad cp systems to be updated into the task description. As I understand it now, we want to update ALL R440 bios versions in our cp clusters, correct?

@RobH - Yes, let's edit this to include eqiad as well. We've had the same symptoms both places, and they're the same approximate generation of hardware configuration (IIRC, only the NVMe changed to a slightly newer/better model from eqiad to esams, but the base system is otherwise fairly identical).

As for process - in general we should be able to cleanly shut these down from the OS level, and they'll self-depool during their shutdown. When a node is ready to go back into service, the command pool from a root prompt will re-pool its services into production use. As long as icinga checks for the nodes are coming back all-green after the reboot (some may not until the initial puppet run has completed!), we should normally be good to repool. Only other caveat is watch out for the ongoing Debian Buster upgrades as well - if it's EU time (US morning), check with Traffic channel in case they're upgrading the same nodes.

BBlack reassigned this task from BBlack to RobH.Feb 10 2020, 6:45 PM

RobH renamed this task from Upgrade BIOS and IDRAC firmware on esams caches to Upgrade BIOS and IDRAC firmware on R440 cp systems.Feb 10 2020, 8:26 PM

RobH updated the task description. (Show Details)

My plan is to do one from each service group (upload/text) at a time, batched together. (It is just as easy to watch two bios updates as one, it doesn't quite scale more than that for close supervision.) Would that be ok?

Mentioned in SAL (#wikimedia-operations) [2020-02-10T21:38:30Z] <robh> cp1075 & cp1076 offline for bios updates per T243167

Mentioned in SAL (#wikimedia-operations) [2020-02-10T21:58:41Z] <robh> cp107[56] returned to service, cp107[78] offline for bios update via T243167

RobH updated the task description. (Show Details)Feb 10 2020, 10:19 PM

RobH updated the task description. (Show Details)Feb 10 2020, 10:39 PM

Mentioned in SAL (#wikimedia-operations) [2020-02-10T22:42:10Z] <robh> cp107[89] returned to service, cp108[01] offline for bios update via T243167

Mentioned in SAL (#wikimedia-operations) [2020-02-10T22:43:23Z] <robh> cp107[789] returned to service, cp108[01] offline for bios update via T243167

RobH updated the task description. (Show Details)Feb 10 2020, 10:44 PM

RobH added a project: ops-eqiad.Feb 10 2020, 11:05 PM

RobH updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2020-02-10T23:06:05Z] <robh> cp108[01] returned to service, cp108[23] offline for bios update via T243167

Mentioned in SAL (#wikimedia-operations) [2020-02-10T23:30:14Z] <robh> cp108[23] returned to service via T243167

RobH updated the task description. (Show Details)Feb 10 2020, 11:30 PM

Mentioned in SAL (#wikimedia-operations) [2020-02-11T20:38:12Z] <robh> depooling cp108[45] for firmware update via T243167

RobH updated the task description. (Show Details)Feb 11 2020, 8:56 PM

Mentioned in SAL (#wikimedia-operations) [2020-02-11T20:57:30Z] <robh> cp108[45] returned to service, depooling cp108[67]for firmware update via T243167

Please note the outage caused the SAL of my adding back cp108[67] to service, (as the rest weren't really returned, but depooled.)

RobH updated the task description. (Show Details)Feb 12 2020, 4:06 PM

Papaul unsubscribed.Feb 12 2020, 5:09 PM

• Cmjohnson moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.Feb 19 2020, 4:20 PM

RobH mentioned this in T245645: cp1088 .Feb 19 2020, 5:08 PM

RobH updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2020-02-19T17:12:36Z] <robh> cp1088 returned to service, cp1089 & cp1090 offline for firmware update via T243167

Mentioned in SAL (#wikimedia-operations) [2020-02-19T17:48:16Z] <robh> cp1089 cp1090 returned to service via T243167

Please note as of now, all eqiad cp sysetms have been updated to the latest bios revision. If these hosts experience any further crashes, its vital to note it for review.

I'll move onto the esams cp systems in my afternoon, as that will be closer to post-peak esams time.

RobH updated the task description. (Show Details)Feb 19 2020, 5:54 PM

RobH removed a project: ops-eqiad.

Mentioned in SAL (#wikimedia-operations) [2020-02-19T22:37:52Z] <robh> taking cp3050 & cp3051 offline for firmware update via T243167

Mentioned in SAL (#wikimedia-operations) [2020-02-19T22:54:57Z] <robh> cp3050 & cp3051 returned to service via T243167

RobH updated the task description. (Show Details)Feb 19 2020, 10:55 PM

We've had a lot of unrelated to this task work ongoing in esams. That work takes priority, so this is going to sit stalled until Traffic gives sign off to continue.

RobH added a subtask: T244127: cp3057 crash (was: network down).Mar 2 2020, 11:09 PM

RobH mentioned this in T244127: cp3057 crash (was: network down).

faidon added a parent task: T238305: Servers freezing across the caching cluster.Apr 1 2020, 9:40 PM

faidon removed a subtask: T244127: cp3057 crash (was: network down).

Vgutierrez mentioned this in T238305: Servers freezing across the caching cluster.Apr 16 2020, 12:53 PM

Ok, this has now sat neglected for awhile. @BBlack: Should I resume updating bios on these hosts in a rotating, one per cluster fashion again? Assigning to you for update, but feel free to kick back to me if/when I should continue!

RobH changed the task status from Stalled to Open.Jul 30 2020, 5:27 PM

@RobH please do when able

BBlack moved this task from Hardware to Radar/Not for service by Traffic on the Traffic board.Sep 29 2020, 8:16 PM

This has sat ignored while I was doing procurement, but I'll pick this back up next week and start updating these again. (On clinic duty this week so I don't want to split my attention quite that much.)

So there is currently an experiment going on with caching hosts in esams, and flashing firmware would interrupt that. T264398#6772586 When that is done, this can resume, likely in a week or so.

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all such tickets that haven't been updated in 6 months or more. This does not imply any human judgement about the validity or importance of the task, and is simply the first step in a larger task cleanup effort. Further manual triage and/or requests for updates will happen this month for all such tickets. For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!

Can we resurrect this and I finish out the esams hosts? I'd like to close this out, its just shaming me with its age.

Checking first since the traffic team could be running testing which would be affected by a rolling restart of the majority of esams hosts (in a regular cadence, but it would affect test metrics.)

Synced with Brandon via IRC, and I'm good to resume this. Each host, one per cluster at a time (one upload, one text), disabling puppet agent, depooling, rebooting for firmware, then repooling and reenabling puppet.

Mentioned in SAL (#wikimedia-sre) [2022-05-11T21:01:10Z] <robh> cp305[23] going offline via T243167 for firmware updates (puppet agent disabled and depooled prior to reboot)

RobH changed the task status from Open to In Progress.May 11 2022, 9:01 PM

RobH updated the task description. (Show Details)May 11 2022, 9:33 PM

Mentioned in SAL (#wikimedia-sre) [2022-05-11T21:34:00Z] <robh> cp50[23] returned to service and all green in icinga, cp50[45] depooling for firmware update T243167

Mentioned in SAL (#wikimedia-sre) [2022-05-11T21:34:37Z] <robh> cp30[23] returned to service and all green in icinga, cp30[45] depooling for firmware update T243167

Mentioned in SAL (#wikimedia-sre) [2022-05-11T22:00:55Z] <robh> cp305[45] returned to service and all green in icinga, cp305[67] depooling for firmware update T243167

RobH updated the task description. (Show Details)May 11 2022, 10:01 PM

Mentioned in SAL (#wikimedia-sre) [2022-05-11T22:28:12Z] <robh> cp305[67] returned to service and all green in icinga, cp305[89] depooling for firmware update T243167

RobH updated the task description. (Show Details)May 11 2022, 10:28 PM

RobH updated the task description. (Show Details)May 11 2022, 11:15 PM

will resume tomorrow late evening for esams / afternoon for me.

Mentioned in SAL (#wikimedia-sre) [2022-05-12T20:50:30Z] <robh> resuming last 6 esams cp host firmware updates via T243167. cp306[01] going offline

cp3060 refuses to load its idrac https interface, even when i clear browser history and do a racreset on the idrac interface, skipping it and continuing with the rest of them.

RobH updated the task description. (Show Details)May 12 2022, 9:14 PM

Mentioned in SAL (#wikimedia-sre) [2022-05-12T21:15:22Z] <robh> cp306[01] returned to service, cp306[23] coming down for firmware update via T243167

RobH updated the task description. (Show Details)May 12 2022, 9:42 PM

Mentioned in SAL (#wikimedia-sre) [2022-05-12T21:43:12Z] <robh> cp306[23] returned to service, cp306[45] coming down for firmware update via T243167

RobH updated the task description. (Show Details)May 12 2022, 10:25 PM

all done but cp3060 which refuses to pull up https on idrac for the firmware flash, it just endlessly loads the login screen until timeout.

I'll need to loop back and try powering it off, and then turn off power to its ports to drain and fully reset the idrac interface, since it didn't fix itself with a racreset command.

Mentioned in SAL (#wikimedia-operations) [2022-05-19T22:07:30Z] <robh> cp3060 idrac interface frozen, rebooted via power outlet control on T243167

cp3060's idrac https interface just pulls up and endlessly is 'loading' (see attached screen shot).

I tried a racreset command via idrac command line, as well as completely draining power to the device via the PDU controls for ps1-oe15-esams. While the idrac interface comes back up for command line ssh control, its https endless loading still occurs.

As this is now a hardware failure, I'm going to create a new task to troubleshoot this, and include the firmware update of cp3060 in that troubleshooting task.

RobH mentioned this in T308797: cp3060 idrac https interface failures.May 19 2022, 10:15 PM

Mentioned in SAL (#wikimedia-operations) [2022-05-20T08:53:58Z] <vgutierrez> re-enabling puppet and repooling cp3060 - T308797 T243167

Upgrade BIOS and IDRAC firmware on R440 cp systems
Closed, ResolvedPublic
Actions

Description

eqiad hosts

esams hosts

Update Checklist

Related Objects
Search...

Event Timeline

	F35156154: endless.loading.png
	May 19 2022, 10:14 PM

Upgrade BIOS and IDRAC firmware on R440 cp systemsClosed, ResolvedPublicActions

Description

eqiad hosts

esams hosts

Update Checklist

Related ObjectsSearch...

Event Timeline

Upgrade BIOS and IDRAC firmware on R440 cp systems
Closed, ResolvedPublic
Actions

Related Objects
Search...