Page MenuHomePhabricator

PXE Boot defaults to automatically reimaging (normally destroying os and all filesystemdata) on all servers
Open, MediumPublic

Description

This is not theoretical, it happened multiple times before the hack at the parent ticket was setup as a prevention of stateful services getting reimaged and causing outages and/or data loss. An example of this was T160242, and causing an outage, at least a critical patch proxy also was affected, causing (I don't remember the details, may be wrong here) gerrit downtime.

Since a few years ago, no issues have happened as we manually enable and disable the "reimaginibility" of a host manually for both backup and mysql hosts.

Accidental reimage of servers has happened or may happen for some of these reasons:

  • replacement of mother board resets BIOS settings
  • flash/upgrade of BIOS
  • typo on wmf-auto-reimage run
  • restart of a missconfigured host (for example, if its BIOS was incorrectly setup)
  • Failure of disk boot, going to the second option (e.g. primary partition corruption -while the secondary partition has important data-, or grub missconfiguration)

Accidental reboot can happen and it is not big deal (should only cause an outage for as long as it takes to reboot). Reimaging a big server (e.g. labsdb host) can take a week to be put back into service as it has to load 12TB of data back + other setup processes),

Potential solutions:

  • Make a puppet flag, outside of partman, disallowing the reimage of certain set of servers
  • Make a service control this property, on netbox or other stateful service
  • Make a check on wmf-auto-reimage disabling its run for certain servers (this would only fix manual reimages, but not accidental ones like those caused on reboot after maintenance)

Event Timeline

colewhite triaged this task as Medium priority.May 5 2020, 4:20 PM

Can this task be closed? By default hosts reimage now but they do keep /srv (T251768)

From the perspective of DBA, this issue is mostly resolved. Most DB machines will keep /srv on reimage by default (the exceptions are in T255768: Create reuse recipes for tendril/zarcillo/dbprov/backup hosts). The same needs to be done for the backup machines.

As the general issue isn't resolved, i'm adding this to the SRE-tools tag as discussed with @Volans for future work.

Kormat edited projects, added SRE-tools; removed DBA.

https://phabricator.wikimedia.org/T277007 was a recent case where reimaging the OS (after a BIOS update) was problematic for DBA as well.

@LSobanski I'll try to give you some context from the SRE I/F team side of things. Any feedback will be greatly appreciated, also to help set the team's priority around those items.

Current status
  • We've noticed that sometimes the PXE next boot override setting is not super reliable we had both cases of:
    • Hosts where setting force PXE via remote IPMI succeeded, checking the override returned force PXE but then the host rebooted into the local system
    • Hosts where after a force PXE reboot the force setting was not reset back to non-force after a reboot
  • There have been also cases in which the default boot was set to PXE at the BIOS level (as opposed to just force next boot to PXE via IPMI)
  • Monitoring of the BMC is kept to a minimal because we've seen issues in the past when pinging too often the BMC via remote IPMI or SSH for monitoring purposes that lead to failures, making them either noisy or unreliable in the moment of need.
What has been done in the past
  • I did an audit on remote IPMI in 2018 (see T193155) where, among other things, I found ~150 hosts had force PXE set. And then fixed all the outstanding issues that the audit found
  • Thanks to @Kormat's work the partman recipe allow to save the local data on accidental reimage, when used
  • The reimage script checks both that the force PXE is set at reimage time before rebooting and that it get reset after the reboot, alerting the user with a warning, that I fear is mostly ignored because not very clear.
Short term possible actions
  • Perform a new audit of the whole fleet to see the current status and asses the current situation
  • Improve the reimage script check so that it fails more clearly if it detects the force PXE bit after the reimage is done and/or try to reset it multiple times.
Long term solutions
  • There is a work in progress project to automate all the basic BIOS settings of DELL's hosts that should be live by end of this fiscal (end of June) and that would allow to prevent all cases of manual misconfiguration where the BIOS setting was left to boot from PXE instead of local disks
  • There is a plan (no ETA on this yet) to create a PXE boot menu that will have various functionalities like reimage, wipe, live-os, etc.. and that will default to boot from local disks. When this will be available the idea is to have all hosts boot by default on PXE that will then default to local disks and of course have the BIOS fallback to local disks in case PXE is unreachable

If there is some agreement on the short term actions I could take care of both of them in the next week or two.

@Volans thanks for the explanation!
There's something I have been wondering for a long time, not sure if this fits on this part of the project or if it needs to be addressed somewhere else, anyways, posting it here.
Is it doable at some point to also upgrade firmware/bios to the latest available when reimagining a host? We've found many hosts that come with a very old BIOS version when sent from factory and despite having a newer version available, that's not installed. Is that something the installation script could handle?

Thanks!

@Marostegui yes, I didn't mention all the other efforts because a bit off topic, but firmware upgrade is too in scope and in our roadmap. And totally makes sense once we have that to integrate it into the reimage workflow so that we naturally upgrade them along with OS upgrades.

@Volans Both the short and long term actions make sense to me, thanks for the summary.