Traffic cache daemon restart scripts need some rework
Open, MediumPublic
Actions

Assigned To

None

Authored By

	BBlack
	Sep 18 2023, 2:26 PM

Description

We currently have three scripts that are deployed via puppet to restart the 3 primary traffic-handling daemons in our edge stack, all in /usr/local/sbin: haproxy-restart, varnish-frontend-restart, and ats-backend-restart (which is actually a short wrapper over ats-restart, because we originally had more than one distinct instance of this daemon).

The fundamental pattern in all of these is essentially the same, and more or less amounts to depool foo; sleep; restart bar.service; sleep; repool foo, where foo is some confctl service key, and bar.service is the related daemon. Currently we have two confctl keys: cdn which controls front edge traffic to haproxy (and thus indirectly to varnish as well), and ats-be which controls traffic to the ats backend daemon (from cross-node varnishes in multi-backend datacenters).

These scripts currently suffer from a couple problems that should probably be addressed by writing a better restarter that handles all these cases properly (or adapting/re-using safe-service-restart might be an option as well, but it's already complex and in use for other cases, and the customization might be a lot?).

Anyways, the problems:

ats-backend-restart and haproxy-restart both check for whether foo is already confctl-depooled at the start, and if so they fail out fast. However, varnish-frontend-restart lacks this sanity check. The check is important and necessary, because we only have a singular depooled state (which is a much deeper issue), and therefore this is the only way to avoid very slow race conditions (e.g. one person is restarting all the varnishes in ulsfo for a parameter change, while another person was just finishing up a reimage of a ulsfo host that isn't ready to be repooled yet, and the person doing the parameter change then inadvertently ends up pooling the fresh reimage before it's ready).

In the ats-be / ats-backend-restart case, the hieradata setting profile::cache::varnish::frontend::single_backend needs to determine very different behaviors. The current behavior (depool ats-be to safely restart ats) only works when that setting is false, and currently it's a 50/50 split of datacenters with each setting, moving towards more true in the long run. The alternate behavior needed when profile::cache::varnish::frontend::single_backend is true is to depool both the ats-be and the cdn key (because in single-backend mode, a given ATS instance only receives traffic from the machine-local varnish, and the machine-local varnish (and thus also haproxy) can't accept any traffic without a working ats backend daemon).

Event Timeline

BBlack triaged this task as Medium priority.Sep 18 2023, 2:26 PM

BBlack created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 18 2023, 2:26 PM

BBlack updated the task description. (Show Details)Sep 18 2023, 2:27 PM

Vgutierrez moved this task from Backlog to Ready for work on the Traffic board.Sep 19 2023, 8:12 AM

Maintenance_bot added a project: SRE.Sep 19 2023, 8:29 AM

CDanis subscribed.Mar 26 2024, 2:43 PM

Traffic cache daemon restart scripts need some reworkOpen, MediumPublicActions

Description

Event Timeline

Traffic cache daemon restart scripts need some rework
Open, MediumPublic
Actions