This is not theoretical, it happened multiple times before the hack at the parent ticket was setup as a prevention of stateful services getting reimaged and causing outages and/or data loss. An example of this was T160242, and causing an outage, at least a critical patch proxy also was affected, causing (I don't remember the details, may be wrong here) gerrit downtime.
Since a few years ago, no issues have happened as we manually enable and disable the "reimaginibility" of a host manually for both backup and mysql hosts.
Accidental reimage of servers has happened or may happen for some of these reasons:
- replacement of mother board resets BIOS settings
- flash/upgrade of BIOS
- typo on wmf-auto-reimage run
- restart of a missconfigured host (for example, if its BIOS was incorrectly setup)
- Failure of disk boot, going to the second option (e.g. primary partition corruption -while the secondary partition has important data-, or grub missconfiguration)
Accidental reboot can happen and it is not big deal (should only cause an outage for as long as it takes to reboot). Reimaging a big server (e.g. labsdb host) can take a week to be put back into service as it has to load 12TB of data back + other setup processes),
Potential solutions:
- Make a puppet flag, outside of partman, disallowing the reimage of certain set of servers
- Make a service control this property, on netbox or other stateful service
- Make a check on wmf-auto-reimage disabling its run for certain servers (this would only fix manual reimages, but not accidental ones like those caused on reboot after maintenance)