We currently have three scripts that are deployed via puppet to restart the 3 primary traffic-handling daemons in our edge stack, all in `/usr/local/sbin`: `haproxy-restart`, `varnish-frontend-restart`, and `ats-backend-restart` (which is actually a short wrapper over `ats-restart`, because we originally had more than one distinct instance of this daemon).
The fundamental pattern in all of these is essentially the same, and more or less amounts to `depool foo; sleep; restart bar.service; sleep; repool foo`, where `foo` is some confctl service key, and `bar.service` is the related daemon. Currently we have two confctl keys: `cdn` which controls front edge traffic to haproxy (and thus indirectly to varnish as well), and `ats-be` which controls traffic to the ats backend daemon (from cross-node varnishes in multi-backend datacenters)).
These scripts currently suffer from a couple problems that should probably be addressed by writing a better restarter that handles all these cases properly (or adapting/re-using safe-service-restart might be an option as well, but it's already complex and in use for other cases, and the customization might be a lot?).
Anyways, the problems:
1. `ats-backend-restart` and `haproxy-restart` both check for whether `foo` is already confctl-depooled at the start, and if so they fail out fast. However, `varnish-frontend-restart` lacks this sanity check. The check is important and necessary, because we only have a singular depooled state (which is a much deeper issue), and therefore this is the only way to avoid very slow race conditions (e.g. one person is restarting all the varnishes in ulsfo for a parameter change, while another person was just finishing up a reimage of a ulsfo host that isn't ready to be repooled yet, and the person doing the parameter change then inadvertently ends up pooling the fresh reimage before it's ready).
2. In the `ats-be` / `ats-backend-restart` case, the hieradata setting `profile::cache::varnish::frontend::single_backend` needs to determine very different behaviors. The current behavior (depool `ats-be` to safely restart ats) only works when that setting is `false`, and currently it's a 50/50 split of datacenters with each setting, moving towards more `true` in the long run. The alternate behavior needed when `profile::cache::varnish::frontend::single_backend` is `true` is to depool both the `ats-be` **and** the `cdn` key (because in single-backend mode, a given ATS instance only receives traffic from the machine-local varnish, and the machine-local varnish (and thus also haproxy) can't accept any traffic without a working ats backend daemon).