Presently, maintenance jobs are systemd timers. They run in both DCs, but inside a wrapper script that first checks the MW config state in etcd to make sure it's running in the current primary_dc and that DC is not read-only; otherwise, it quits without invoking the maintenance script.
Today, when we do a DC switchover, we kill the jobs that are currently running but we don't actually prevent new jobs from starting. That means in the time between stopping maintenance, and setting read-only, new maintenance jobs can start, and they'll run into trouble when the RW DC disappears out from under them. This race condition adds some more coordination complexity, and a little rush, to the early stages of the switchover, which we could eliminate.
When we move periodic maintenance jobs to Kubernetes CronJobs, we'll add an additional config value to etcd specifically for controlling them. There are a few different ways we might structure that, each with pros and cons:
- A new string value maintenance_dc that could be eqiad, codfw, or none. (In an eqiad-codfw switch, we'd change it from eqiad to none, then to codfw.) The downside is this is partly redundant with the primary_dc value; primary_dc: eqiad; maintenance_dc: codfw is a nonsense state that we shouldn't be able to encode.
- A new map value maintenance_enabled like {eqiad: true, codfw: false}. (In an eqiad-codfw switch, we'd change it to {eqiad: false, codfw: false}, then to {eqiad: false, codfw: true}.) This has the same downside as #1, plus it allows the additional nonsense state {eqiad: true, codfw: true}. (On the other hand, read_only has a similar structure already.)
- A new boolean value maintenance_enabled which would be true most of the time and false during a switchover. (In an eqiad-codfw switch, we'd change it from true to false, then switch the primary_dc, then change it back to true.) We'd get the DC state from primary_dc, and check maintenance_enabled along with (or instead of) read_only. The only nonsense state here is if maintenance is enabled while the primary DC is RO, and we can mitigate that by checking both.
Clearly I'm leaning toward #3 for simplicity, but open to other possibilities. (I prefer "enabled" to "disabled," because maintenance_disabled: false is an unnecessarily confusing double negative. Unfortunately that means the sense is reversed from read_only but I think that's okay.)