Page MenuHomePhabricator

Experiment with single backend CDN nodes
Open, MediumPublic

Description

We want to evaluate the practical performance implications of using a single, local, cache backend instead of spreading the whole datastet to multiple nodes with c-hashing.

In order to do so, we need to change the current Puppetization to allow taking one host out of c-hash, and use it exclusively as a local backend. This could be done by using a hiera setting, say cache::single_backend_fqdn, and excluding such hostname from the pool of backends for a given cache cluster/DC. Additionally, we need to use only localhost as the cache backend for the host cache::single_backend_fqdn itself.

Dashboards such as cache-hosts-comparison can then be used to observe the impact on hitrate/ttfb on various nodes. By having N-1 backend nodes instead of N there may be some backend hitrate change across the board to take into account when interpreting the results.

The procedure for enabling the experiment on one node ($host) is as follows:

  • Depool the host from all user traffic with sudo -i depool
  • (Optional but good to evaluate hitrate): Stop trafficserver.service, empty ATS backend cache with traffic_server -C clear_cache, start trafficserver.service
  • Set cache::single_backend_fqdn: $host in hiera for the DC/cluster the host is part of (eg: for host=cp4027, ulsfo/text)
  • Run puppet on all cache nodes in the DC/cluster. Ensure that $host is removed from the list of backends on all varnish instances in the DC/cluster with sudo -i varnishadm -n frontend backend.list
  • Ensure that varnish on $host points to localhost and that the node behaves well: https://wikitech.wikimedia.org/wiki/Varnish#Force_your_requests_through_a_specific_Varnish_frontend
  • Repool the host for user traffic with sudo -i pool. Ensure that $host is not listed in /etc/varnish/directors.frontend.vcl on any DC/cluster node

Disabling the experiment:

  • Depool the host from all user traffic with sudo -i depool
  • Unset cache::single_backend_fqdn in hiera for the DC/cluster the host is part of
  • Run puppet on all cache nodes in the DC/cluster. Ensure that $host is added to the list of backends on all varnish instances in the DC/cluster with sudo -i varnishadm -n frontend backend.list
  • Ensure that varnish on $host points to all nodes in the DC/cluster and that the node behaves well: https://wikitech.wikimedia.org/wiki/Varnish#Force_your_requests_through_a_specific_Varnish_frontend
  • Repool the host for user traffic with sudo -i pool. Ensure that $host is listed in /etc/varnish/directors.frontend.vcl on all DC/cluster nodes

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+9 -9
operations/puppetproduction+23 -0
operations/puppetproduction+2 -0
operations/puppetproduction+14 -0
operations/puppetproduction+4 -22
operations/puppetproduction+10 -0
operations/puppetproduction+1 -3
operations/puppetproduction+2 -0
operations/puppetproduction+28 -33
operations/puppetproduction+17 -10
operations/puppetproduction+2 -3
operations/puppetproduction+0 -2
operations/puppetproduction+0 -2
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
operations/puppetproduction+3 -1
operations/puppetproduction+26 -24
operations/puppetproduction+17 -2
operations/puppetproduction+39 -33
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
ema triaged this task as Medium priority.Aug 5 2021, 8:51 AM

Change 710224 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] cache: single backend experiment

https://gerrit.wikimedia.org/r/710224

Change 710236 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] cache: refactor dynamic_backend_caches logic

https://gerrit.wikimedia.org/r/710236

Change 710244 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] cache: enable single backend experiment on cp4027

https://gerrit.wikimedia.org/r/710244

Change 710236 merged by Ema:

[operations/puppet@production] cache: refactor dynamic_backend_caches logic

https://gerrit.wikimedia.org/r/710236

Change 710224 merged by Ema:

[operations/puppet@production] cache: single backend experiment

https://gerrit.wikimedia.org/r/710224

Change 710973 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] cache: use confd only if backend list is on etcd

https://gerrit.wikimedia.org/r/710973

Change 710973 merged by Ema:

[operations/puppet@production] cache: use confd only if backend list is on etcd

https://gerrit.wikimedia.org/r/710973

Change 726912 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] cache: exclude single backend experiment from pooled ATS backends

https://gerrit.wikimedia.org/r/726912

Change 726912 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] cache: exclude single backend experiment from pooled ATS backends

https://gerrit.wikimedia.org/r/726912

To document the fact somewhere with syntax highlighting: the patch above changes the Go template when the experiment is not running anywhere as follows:

--- /etc/confd/templates/_etc_varnish_directors.frontend.vcl.tmpl.orig
+++ /etc/confd/templates/_etc_varnish_directors.frontend.vcl.tmpl
@@ -1,7 +1,7 @@
 new cache_local = directors.shard();
 new cache_local_random = directors.random();
 
-{{range $node := ls "/conftool/v1/pools/esams/cache_text/ats-be/"}}{{ $key := printf "/conftool/v1/pools/esams/cache_text/ats-be/%s" $node }}{{ $data := json (getv $key) }}{{ if eq $data.pooled "yes"}}
+{{range $node := ls "/conftool/v1/pools/esams/cache_text/ats-be/"}}{{ $key := printf "/conftool/v1/pools/esams/cache_text/ats-be/%s" $node }}{{ $data := json (getv $key) }}{{ if and (eq $data.pooled "yes") (ne $node "") }}
 cache_local.add_backend(be_{{ $parts := split $node "." }}{{ join $parts "_" }});
 cache_local_random.add_backend(be_{{ $parts := split $node "." }}{{ join $parts "_" }}, {{ $data.weight }});
 {{end}}{{end}}

Change 726912 merged by Ema:

[operations/puppet@production] cache: exclude single backend experiment from pooled ATS backends

https://gerrit.wikimedia.org/r/726912

Mentioned in SAL (#wikimedia-operations) [2021-10-12T10:23:42Z] <ema> depool/repool ats-be on cp4028 to verify updates to /etc/varnish/directors.frontend.vcl on cp4027 keep on working fine T288106

Mentioned in SAL (#wikimedia-operations) [2021-11-18T09:35:51Z] <ema> cp4021: depool to enable single backend experiment T288106

Change 710244 merged by Ema:

[operations/puppet@production] cache: enable single backend experiment on cp4021

https://gerrit.wikimedia.org/r/710244

Mentioned in SAL (#wikimedia-operations) [2021-11-18T09:41:49Z] <ema> cp4021: stop ats-be and clear its cache T288106

After setting cache::single_backend_fqdn: cp4021.ulsfo.wmnet in hiera, cp4021 is now gone from the list of cache backends on all upload@ulsfo nodes, see for instance cp4022:

root@cp4022:~# varnishadm -n frontend backend.list
Backend name                   Admin      Probe                Last updated
vcl-13b6a7e0-ba08-4804-b583-9b17de9bcb67.be_cp4022_ulsfo_wmnet probe      Healthy             5/5 Thu, 18 Nov 2021 09:37:54 GMT
vcl-13b6a7e0-ba08-4804-b583-9b17de9bcb67.be_cp4023_ulsfo_wmnet probe      Healthy             5/5 Thu, 18 Nov 2021 09:37:54 GMT
vcl-13b6a7e0-ba08-4804-b583-9b17de9bcb67.be_cp4024_ulsfo_wmnet probe      Healthy             5/5 Thu, 18 Nov 2021 09:37:54 GMT
vcl-13b6a7e0-ba08-4804-b583-9b17de9bcb67.be_cp4025_ulsfo_wmnet probe      Healthy             5/5 Thu, 18 Nov 2021 09:37:54 GMT
vcl-13b6a7e0-ba08-4804-b583-9b17de9bcb67.be_cp4026_ulsfo_wmnet probe      Healthy             5/5 Thu, 18 Nov 2021 09:37:54 GMT
vcl-13b6a7e0-ba08-4804-b583-9b17de9bcb67.be_cp4033_ulsfo_wmnet probe      Healthy             5/5 Thu, 18 Nov 2021 09:37:54 GMT
vcl-13b6a7e0-ba08-4804-b583-9b17de9bcb67.be_cp4034_ulsfo_wmnet probe      Healthy             5/5 Thu, 18 Nov 2021 09:37:54 GMT

Except for cp4021 itself:

root@cp4021:~# varnishadm -n frontend backend.list
Backend name                   Admin      Probe                Last updated
vcl-2cad684a-8940-437b-9279-f982d60126a0.be_cp4021_ulsfo_wmnet probe      Healthy             5/5 Thu, 18 Nov 2021 09:43:54 GMT

I have cleared the ats-be cache on cp4021, forced my client to go through such node, and verified that images are served correctly. We can now repool the node for user traffic.

Mentioned in SAL (#wikimedia-operations) [2021-11-18T09:56:29Z] <ema> cp4021: repool w/ single backend experiment enabled T288106

Change 743910 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] cache: enable single backend experiment on cp3051

https://gerrit.wikimedia.org/r/743910

Mentioned in SAL (#wikimedia-operations) [2021-12-08T10:22:49Z] <ema> cp3051: depool to enable single backend experiment T288106

Mentioned in SAL (#wikimedia-operations) [2021-12-08T10:23:55Z] <ema> cp3051: stop ats-be and clear its cache T288106

Change 743910 merged by Ema:

[operations/puppet@production] cache: enable single backend experiment on cp3051

https://gerrit.wikimedia.org/r/743910

Mentioned in SAL (#wikimedia-operations) [2021-12-08T10:35:12Z] <ema> cp3051: repool w/ single backend experiment enabled T288106

Change 749131 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] Revert \"cache: enable single backend experiment on cp4021\"

https://gerrit.wikimedia.org/r/749131

Mentioned in SAL (#wikimedia-operations) [2021-12-21T08:14:35Z] <ema> cp4021: depool to revert single backend experiment T288106

Change 749131 merged by Ema:

[operations/puppet@production] Revert \"cache: enable single backend experiment on cp4021\"

https://gerrit.wikimedia.org/r/749131

Mentioned in SAL (#wikimedia-operations) [2021-12-21T08:29:12Z] <ema> cp4021: pool with single backend experiment reverted T288106

Change 749132 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] Revert \"cache: enable single backend experiment on cp3051\"

https://gerrit.wikimedia.org/r/749132

Mentioned in SAL (#wikimedia-operations) [2021-12-21T08:45:41Z] <ema> cp3051: depool to revert single backend experiment T288106

Change 749132 merged by Ema:

[operations/puppet@production] Revert \"cache: enable single backend experiment on cp3051\"

https://gerrit.wikimedia.org/r/749132

Mentioned in SAL (#wikimedia-operations) [2021-12-21T08:50:48Z] <ema> cp3051: pool with single backend experiment reverted T288106

Although we did briefly discuss the results of this experiment within Traffic, I don't think we ever publicly disclosed our analysis.

In terms of user-perceived latency, using a single local cache backend in the upload@esams cluster results in an increase in TTFB from 36 to 41.5 milliseconds:

Screenshot from 2022-02-09 09-30-57.png (1×1 px, 164 KB)

The amount of data fetched from the origins increases 3x:

Screenshot from 2022-02-09 09-40-36.png (1×1 px, 265 KB)

Especially the latter data point is concerning, given that those fetches happen over backhaul links. The conclusion is that, at least when it comes to the upload@esams cluster which is likely the worst case in terms of amount and type of traffic served, we cannot just simply use a single local backend instead of spreading the dataset to all available ATS backend instances in the cluster.

Further work is needed to determine the following:

  • How does the single backend architecture perform with slower, larger disks? The total sum of backend disk size in esams is currently 12T, and SATA disks of at least that capacity are certainly an option that could be considered. I've got 40T on my workstation at home. :)
  • How about the text cluster? The type and patterns of traffic are different, and so would probably be the results.

Change 817298 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] P:cache::varnish::frontend: Drop confd_experiment_fqdn

https://gerrit.wikimedia.org/r/817298

Change 845648 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/puppet@production] Clean up outdated commentary on requestctl

https://gerrit.wikimedia.org/r/845648

Change 845649 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/puppet@production] Split confd file definitions

https://gerrit.wikimedia.org/r/845649

Change 845650 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/puppet@production] single_backend mode for production varnishes

https://gerrit.wikimedia.org/r/845650

Change 845713 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/puppet@production] Remove confd_experiment_fqdn support

https://gerrit.wikimedia.org/r/845713

Change 845648 merged by BBlack:

[operations/puppet@production] Clean up outdated commentary on requestctl

https://gerrit.wikimedia.org/r/845648

Change 845651 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/puppet@production] cp4045: enable single_backend mode

https://gerrit.wikimedia.org/r/845651

Change 845649 merged by BBlack:

[operations/puppet@production] Split confd file definitions

https://gerrit.wikimedia.org/r/845649

Change 845713 merged by BBlack:

[operations/puppet@production] Remove confd_experiment_fqdn support

https://gerrit.wikimedia.org/r/845713

Change 845650 merged by BBlack:

[operations/puppet@production] single_backend mode for production varnishes

https://gerrit.wikimedia.org/r/845650

Change 845651 merged by BBlack:

[operations/puppet@production] cp4045: enable single_backend mode

https://gerrit.wikimedia.org/r/845651

Change 849598 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/puppet@production] ulsfo: single_backend for all new cache nodes

https://gerrit.wikimedia.org/r/849598

Change 849598 merged by BBlack:

[operations/puppet@production] ulsfo: single_backend for all new cache nodes

https://gerrit.wikimedia.org/r/849598

Change 817298 abandoned by BBlack:

[operations/puppet@production] P:cache::varnish::frontend: Drop confd_experiment_fqdn

Reason:

Already done in https://gerrit.wikimedia.org/r/c/operations/puppet/+/845713

https://gerrit.wikimedia.org/r/817298

BBlack subscribed.

Updates! Since this ticket was last active, there's been progress on various fronts with this issue:

  • We still believe there are some great benefits to the single-backend model in terms of resiliency and scaling - it avoids several inter-related issues around the concentration of fe miss/pass (see also pass_random and related bits) on backend nodes even under normal conditions, the cascading depool effect at the backend layer in scenarios like heavy image linking, and gets rid of the only true reliability dependency between the cache nodes.
  • The original experimental results from the upload cluster indicated it wasn't feasible to do this with our current storage sizes. The switch from chashing to single-backend eliminates our storage set multiplier, making the effective cache set size 1/8th what it was under chashing, and this causes too many misses and a large increase in applayer traffic.
  • We decided that, going forward, we would try to increase the cache storage per-node to offset that effect and be able to do single backend on all new cache nodes going forward with acceptable tradeoffs.
  • ulsfo, eqsin, and esams are all on the calendar for hardware warranty refresh this FY (ulsfo+eqsin in Q1/2, esams nearer the end of the year), which offers an opportunity to move forward on this in half our datacenters in a single year (and is a big missed opporunity in our refresh timeline, if we fail to address this storage issue now).
  • Lacking the time, resources, and supply-chain speediness to actually test multiple potential new hardware configurations (there are several possible disk configurations that could provide the necessary performance and/or size) before it was time to order, we did our best at guestimating various tradeoffs and made some hard choices on total cost and placed orders for both ulsfo + eqsin.
  • The new hardware has several changes (see also: https://wikitech.wikimedia.org/wiki/Traffic_cache_hardware ), but the most relevant bits here are:
    • We stuck with the same style of NVMe drives we were using before (known-great on random access perf, 4K native block size, etc).
    • The size of the drives went up from the 1.6TB model to the 6.4TB model (4x larger)
    • As a cost compromise, the upload clusters gets two of the new disks, while text only gets one
    • In the net, this means the new-hardware upload nodes can go single-backend with no expected downsides: the effective storage size and the disk performance should be ~identical to our previous hardware's chashed config.
    • For the text case, we get the same disk perf guarantees, but we opted for half the storage as a cost compromise (the disks are very expensive!). We think this will be an ok compromise for the text case:
      • We believe the shape of the cache utilization on text is different than upload and doesn't gain as much benefit from excessive backend storage as upload to begin with.
      • The new nodes also have 33% more RAM for the frontend cache, which especially in the text case should reduce reliance on the backend layer for overall rate (especially text because upload's frontend object size cutoff effect is more-pronounced)
      • If the results prove to be poor, we'll circle back and decide how best to use the hardware (there's a few ways we can "fix" things in software, if necessary, or we could opt to buy additional disks for ulsfo+eqsin and argue for the cost hit)

Current stuff:

  • The patches merged above re-work the experimental single-backend configuration from ema into a working production config for the new hardware.
  • There's a new hieradata config key profile::cache::varnish::frontend::single_backend: true that enables this and should only be set on new-hardware nodes
  • As we transition each datacenter to new hardware and the single-backend model, we set this per-host on new hosts until all the legacy hosts are decommed, then we can move it to per-DC hieradata.
  • It will take a few years to transition all DCs to the new model on their natural warranty boundaries. This will give us plenty of time to course-correct as we go. drmrs will be the last one to transition, at its current cp nodes are only ~1 year old at this point.

Update - ulsfo is repooled this morning, with all new hardware on the new configuration, and has the "single-backend" mode enabled for both clusters at ulsfo. We'll be keeping an eye on hitrates here, and then trying to follow the same pattern in the upcoming eqsin hardware transition.

akosiaris subscribed.

Removing SRE, has already been triaged to a more specific SRE subteam

Change 907912 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] hiera: Increase varnish max_connections to ats-be on eqsin|ulsfo

https://gerrit.wikimedia.org/r/907912

Change 907912 merged by Vgutierrez:

[operations/puppet@production] hiera: Increase varnish max_connections to ats-be on eqsin|ulsfo

https://gerrit.wikimedia.org/r/907912

Change 908245 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] hiera: merge: hash for profile::cache::varnish::frontend::cache_be_opts

https://gerrit.wikimedia.org/r/908245

Change 908245 merged by Vgutierrez:

[operations/puppet@production] hiera: merge: hash for profile::cache::varnish::frontend::cache_be_opts

https://gerrit.wikimedia.org/r/908245

Change 948581 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] hiera: enable single backend on esams (post knams migration)

https://gerrit.wikimedia.org/r/948581

Change 948581 abandoned by Ssingh:

[operations/puppet@production] hiera: enable single backend on esams and switch to F4-U hardware config

Reason:

no longer required

https://gerrit.wikimedia.org/r/948581

Change 953700 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/puppet@production] Fix cache_upload timeouts in single-backend sites

https://gerrit.wikimedia.org/r/953700

Change 953700 merged by Vgutierrez:

[operations/puppet@production] Fix cache_upload timeouts in single-backend sites

https://gerrit.wikimedia.org/r/953700