Page MenuHomePhabricator

Replace edge cache conftool entries 'varnish-fe' and 'ats-tls' with singular 'cdn'
Closed, ResolvedPublic

Description

  1. The current names are daemon-specific, and are tricky to rename ('ats-tls' is currently the name for an haproxy-based service).
  2. We historically had separate conftool keys for the varnish-fe cache (for port 80) and the tls terminator (port 443) so they could be depooled independently as they were technically run by two separate daemons.
    • However, using them independently only works in one direction: you could in theory depool and stop haproxy's port 443 service while varnish-fe continues serving port 80
    • Going the other way - any need to depool/stop varnish-fe requires depooling the haproxy tls termination, as it depends on varnish-fe implicitly to handle any real requests.
    • Given port 443 is the far more important of the two in the modern era, it's at best confusing to allow them to be separately depooled this way.
  3. The current forward-looking plan is to move port 80 up to haproxy anyways ( T323557 ), as the function of port 80 for the cache clusters is very deterministic (redirect or deny), and haproxy is far more efficient and resilient at such a task (and at handling high connection volume in general, as it doesn't have a thread-per-client-conn scaling model like varnish). When this happens, we'd want both keys to be in sync in all cases anyways (as then you couldn't operate on the other daemon without affecting both ports' traffic).

For all of these reasons: we should replace the 'varnish-fe' and 'ats-tls' keys with a singular 'cdn' key and transition pybal and various supporting scripts to the new scheme.

The 'ats-be' key will still exist independently for now, as it continues to serve an independent purpose for now (for the chashing between cache nodes for the backend). There's some related work to build on this later though, as 'ats-be' will fold into the 'cdn' key over time as well, assuming we're successful with the single-backend model in T288106. As we're in transition and testing, we'll probably need scripts and tools to unify the pool state of 'ats-be' and 'cdn' anyways, just for the single-backend tools (because the single-backend model, stopping an ats-be necessarily impacts the traffic of the haproxy and varnish instances on the same node, since the local ats-be is their *only* backend cache), but simplifying and unifying the two front-edge keys first makes the most sense.

Event Timeline

Change 863336 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/puppet@production] [WIP] Add 'cdn' conftool service to all caches

https://gerrit.wikimedia.org/r/863336

Change 863337 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/puppet@production] [WIP] Switch pybal + scripts to 'cdn' service

https://gerrit.wikimedia.org/r/863337

Change 863338 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/puppet@production] Remove legacy varnish-fe + ats-tls conftool keys

https://gerrit.wikimedia.org/r/863338

Change 863339 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/cookbooks@master] Switch roll-restart-varnish to 'cdn' service

https://gerrit.wikimedia.org/r/863339

BBlack updated the task description. (Show Details)

Change 863336 merged by BBlack:

[operations/puppet@production] Add 'cdn' conftool service to all caches

https://gerrit.wikimedia.org/r/863336

Change 863337 merged by BBlack:

[operations/puppet@production] Switch pybal + scripts to 'cdn' service

https://gerrit.wikimedia.org/r/863337

Mentioned in SAL (#wikimedia-operations) [2022-12-08T00:29:28Z] <bblack> lvs4010: restart pybal to test etcd key changes - T324336

Change 863339 merged by BBlack:

[operations/cookbooks@master] Switch roll-restart-varnish to 'cdn' service

https://gerrit.wikimedia.org/r/863339

Mentioned in SAL (#wikimedia-operations) [2022-12-08T00:47:06Z] <bblack> lvsNNNN: restart pybal to apply etcd key changes on all "secondary" lvs at all sites - T324336 (5 hosts, ulsfo completed previously)

Mentioned in SAL (#wikimedia-operations) [2022-12-08T01:00:41Z] <bblack> lvsNNNN: restart pybal to apply etcd key changes on all "high-traffic2" lvs at all sites - T324336

Mentioned in SAL (#wikimedia-operations) [2022-12-08T01:05:24Z] <bblack> lvsNNNN: restart pybal to apply etcd key changes on all "high-traffic1" lvs at all sites - T324336

Change 863338 merged by BBlack:

[operations/puppet@production] Remove legacy varnish-fe + ats-tls conftool keys

https://gerrit.wikimedia.org/r/863338

BBlack claimed this task.

This is completed now. AFAIK all relevant scripts/automations/etc were updated to match. The conftool service keys for cacheproxy nodes are now just cdn, which controls pooling of the front edge port 80/443 pooling towards pybal, and ats-be, which controls pooling of the varnish->ats-be for cross-node chashing (except in ulsfo and eqsin, which have switched to a single-backend model and don't do this part anymore).