Page MenuHomePhabricator

Upgrade db1104 firmware
Open, HighPublic

Description

After upgrading db1104's kernel (this is s8 eqiad master) the host didn't come back up.
This is probably another case of T216240, we need to get a maintenance window with @Cmjohnson to get its firmware upgraded.

For now I have just started it with an older kernel

Event Timeline

Marostegui triaged this task as High priority.
Marostegui created this task.
Marostegui moved this task from Triage to Ready on the DBA board.

@wiki_willy we're good to move ahead with this one. The host needs to be downtimed and shut down in advance, what (EU friendly) time would work for you?

I can handle this, and a firmware upgrade takes anywhere from 5 to 30 minutes (depending on if its only bios, etc..) This is just bios, so I can handle this anytime this week.

@LSobanski: Please have someone take it offline and then assign this task back to me for the actual firmware update. I can handle this anytime this week, no problem!

Firmware Update Notes:

Workflow:

  • server is depooled, it can be shutdown or left online as needed by DBA team. assign task to @RobH once host is ready for firmware update.
  • rob uploads new idrac firmware (always update idrac to latest before applying other firmware updates is best practice) and this doesn't reboot host.
  • rob uploads new bios firmware, this will reboot the host
  • host has new firmware, this task is resolved which lets DBA team know to bring it back into service

Mentioned in SAL (#wikimedia-operations) [2021-07-13T16:57:59Z] <kormat@cumin1001> START - Cookbook sre.hosts.downtime for 4:00:00 on db1104.eqiad.wmnet with reason: Firmware upgrade T286226

Mentioned in SAL (#wikimedia-operations) [2021-07-13T16:58:05Z] <kormat@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1104.eqiad.wmnet with reason: Firmware upgrade T286226

Mentioned in SAL (#wikimedia-operations) [2021-07-13T16:59:20Z] <kormat@cumin1001> START - Cookbook sre.hosts.downtime for 4:00:00 on 18 hosts with reason: Firmware upgrade T286226

Mentioned in SAL (#wikimedia-operations) [2021-07-13T16:59:27Z] <kormat@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 18 hosts with reason: Firmware upgrade T286226

Machine is depooled and ready for firmware upgrade.

So for best practices, I tend to upload a new idrac firmware before anything else. Since idrac handles the firmware updates of other things, it just seems safest.

That being stated, this host is taking forever to update the idrac firmware properly. I started it immediately after Kormat depooled the host, and its now in idrac reboot (host accessible, idrac down) as I wait for it to come back so I can update the bios.

This has gone exceedingly poorly.

I updated the idrac firmware, and idrac is accessible via SSH but not via HTTPS, as its self signed cert changed to one unsupported (even with bypass option via advanced button) in browser.

I think the solution then is to crashcart and apply the bios update via usb stick or roll the idrac version back one to a version that doesn't break idrac. Without a crash cart connection though, I'm not sure how to roll this back to an earlier version.

I'm hunting around the internet for potential solutions that dont involve our onsites laying hands on the host.

I thought I had this solved but I was incorrect, it wont pull up in any browser.

I am at a loss on how to proceed for this, without connecting a crash cart and rolling back the idrac version via crash cart. Basically the latest idrac firmware made the https idrac interface unreachable. That interface is what we use to flash the bios firmware, so now we're stuck with outdated bios on this host.

@LSobanski: We won't have our normal on-sites there until next week. This host is unfortunately outside of support coverage from Dell, so we cannot have them dispatch a tech. We will need to fix this ourselves, and I'm not 100% certain of the process of using the crash cart to access and roll back firmware, so it would be blind leading the blind if I tried to walk a EQ tech through it.

Thoughts?

@RobH : as the host is still accessible, i'm going to repool it in the meantime.

@RobH Thanks for digging into it. Let's wait until we have people onsite and can crash cart into the host. Who would be the best person to assign this to so it gets scheduled as soon as possible?

Moving over to @Cmjohnson, who will be back before John next week. Thanks, Willy

FYI: My understanding is with a crash cart, one can activate the lifecycle controller manually on rebooting the host in POST and then fail backwards to the last version of any firmware pushed. This option is available via the HTTPS interface, which is not reachable due to this issue.

wiki_willy added a subscriber: Jclark-ctr.

It looks like Chris is going to be out for a while. Moving this task over to @Jclark-ctr, who should be back Tuesday or Wednesday. Thanks, Willy