PXE Boot defaults to automatically reimaging (normally destroying os and all filesystemdata) on all servers
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	jcrespo
	Apr 29 2020, 2:32 PM

Description

This is not theoretical, it happened multiple times before the hack at the parent ticket was setup as a prevention of stateful services getting reimaged and causing outages and/or data loss. An example of this was T160242, and causing an outage, at least a critical patch proxy also was affected, causing (I don't remember the details, may be wrong here) gerrit downtime.

Since a few years ago, no issues have happened as we manually enable and disable the "reimaginibility" of a host manually for both backup and mysql hosts.

Accidental reimage of servers has happened or may happen for some of these reasons:

replacement of mother board resets BIOS settings
flash/upgrade of BIOS
typo on wmf-auto-reimage run
restart of a missconfigured host (for example, if its BIOS was incorrectly setup)
Failure of disk boot, going to the second option (e.g. primary partition corruption -while the secondary partition has important data-, or grub missconfiguration)

Accidental reboot can happen and it is not big deal (should only cause an outage for as long as it takes to reboot). Reimaging a big server (e.g. labsdb host) can take a week to be put back into service as it has to load 12TB of data back + other setup processes),

Potential solutions:

Make a puppet flag, outside of partman, disallowing the reimage of certain set of servers
Make a service control this property, on netbox or other stateful service
Make a check on wmf-auto-reimage disabling its run for certain servers (this would only fix manual reimages, but not accidental ones like those caused on reboot after maintenance)

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Kormat	T251392 Make enabling reimaging for db hosts more humane
		Resolved		Volans	T251416 PXE Boot defaults to automatically reimaging (normally destroying os and all filesystemdata) on all servers

Event Timeline

jcrespo created this task.Apr 29 2020, 2:32 PM

jcrespo added a project: DC-Ops.Apr 29 2020, 2:36 PM

Marostegui moved this task from Triage to Meta/Epic on the DBA board.Apr 30 2020, 4:56 AM

jcrespo mentioned this in T251392: Make enabling reimaging for db hosts more humane.Apr 30 2020, 6:38 AM

Kormat mentioned this in T251768: Make partman/custom/no-srv-format.cfg work.May 4 2020, 1:48 PM

jcrespo added a project: Sustainability (Incident Followup).May 4 2020, 5:06 PM

colewhite triaged this task as Medium priority.May 5 2020, 4:20 PM

Can this task be closed? By default hosts reimage now but they do keep /srv (T251768)

jcrespo assigned this task to Kormat.Jun 25 2020, 2:03 PM

From the perspective of DBA, this issue is mostly resolved. Most DB machines will keep /srv on reimage by default (the exceptions are in T255768: Create reuse recipes for tendril/zarcillo/dbprov/backup hosts). The same needs to be done for the backup machines.

As the general issue isn't resolved, i'm adding this to the SRE-tools tag as discussed with @Volans for future work.

Kormat removed Kormat as the assignee of this task.Jun 25 2020, 2:19 PM

Kormat edited projects, added SRE-tools; removed DBA.

https://phabricator.wikimedia.org/T277007 was a recent case where reimaging the OS (after a BIOS update) was problematic for DBA as well.

@LSobanski I'll try to give you some context from the SRE I/F team side of things. Any feedback will be greatly appreciated, also to help set the team's priority around those items.

Current status

We've noticed that sometimes the PXE next boot override setting is not super reliable we had both cases of:
- Hosts where setting force PXE via remote IPMI succeeded, checking the override returned force PXE but then the host rebooted into the local system
- Hosts where after a force PXE reboot the force setting was not reset back to non-force after a reboot
There have been also cases in which the default boot was set to PXE at the BIOS level (as opposed to just force next boot to PXE via IPMI)
Monitoring of the BMC is kept to a minimal because we've seen issues in the past when pinging too often the BMC via remote IPMI or SSH for monitoring purposes that lead to failures, making them either noisy or unreliable in the moment of need.

What has been done in the past

I did an audit on remote IPMI in 2018 (see T193155) where, among other things, I found ~150 hosts had force PXE set. And then fixed all the outstanding issues that the audit found
Thanks to @Kormat's work the partman recipe allow to save the local data on accidental reimage, when used
The reimage script checks both that the force PXE is set at reimage time before rebooting and that it get reset after the reboot, alerting the user with a warning, that I fear is mostly ignored because not very clear.

Short term possible actions

Perform a new audit of the whole fleet to see the current status and asses the current situation
Improve the reimage script check so that it fails more clearly if it detects the force PXE bit after the reimage is done and/or try to reset it multiple times.

Long term solutions

There is a work in progress project to automate all the basic BIOS settings of DELL's hosts that should be live by end of this fiscal (end of June) and that would allow to prevent all cases of manual misconfiguration where the BIOS setting was left to boot from PXE instead of local disks
There is a plan (no ETA on this yet) to create a PXE boot menu that will have various functionalities like reimage, wipe, live-os, etc.. and that will default to boot from local disks. When this will be available the idea is to have all hosts boot by default on PXE that will then default to local disks and of course have the BIOS fallback to local disks in case PXE is unreachable

If there is some agreement on the short term actions I could take care of both of them in the next week or two.

@Volans thanks for the explanation!
There's something I have been wondering for a long time, not sure if this fits on this part of the project or if it needs to be addressed somewhere else, anyways, posting it here.
Is it doable at some point to also upgrade firmware/bios to the latest available when reimagining a host? We've found many hosts that come with a very old BIOS version when sent from factory and despite having a newer version available, that's not installed. Is that something the installation script could handle?

Thanks!

@Marostegui yes, I didn't mention all the other efforts because a bit off topic, but firmware upgrade is too in scope and in our roadmap. And totally makes sense once we have that to integrate it into the reimage workflow so that we naturally upgrade them along with OS upgrades.

@Volans Both the short and long term actions make sense to me, thanks for the summary.

Aklapper added a project: Infrastructure-Foundations.Jun 21 2021, 8:59 PM

Removing SRE, has already been triaged to a more specific SRE subteam

As an update, with the current situation even if a host reboots into PXE the DHCP will not provide any IP so the reimage will NOT happen. The DHCP is now dynamic and set by the reimage/dhcp cookbooks on the fly just during the time of the reimage.
The only current risk is if a dhcp snippet is left by the reimage/dhcp cookbooks (because ctrl+C-ed twice or killed -9) and that host does reboot into PXE, but this is easily fixable in various ways deleting those potentially leftover files.
So I think that the original concern has been almost completely removed.

Agreed, I think we can simply resolve task.

So I think that the original concern has been almost completely removed.

Agreed, I think we can simply resolve task.

As agreed also on IRC resolving. Additional work on this space can be done in separate tasks.

Joe moved this task from Backlog to Done on the SRE-Sprint-Week-Sustainability-March2023 board.Mar 20 2023, 12:22 PM

Volans mentioned this in T351418: Upgrade from ISC-DHCP Server to KEA-DHCP Server.Apr 10 2024, 9:31 AM

PXE Boot defaults to automatically reimaging (normally destroying os and all filesystemdata) on all serversClosed, ResolvedPublicActions