Page MenuHomePhabricator

Experiment with single backend CDN nodes
Open, MediumPublic

Description

We want to evaluate the practical performance implications of using a single, local, cache backend instead of spreading the whole datastet to multiple nodes with c-hashing.

In order to do so, we need to change the current Puppetization to allow taking one host out of c-hash, and use it exclusively as a local backend. This could be done by using a hiera setting, say cache::single_backend_fqdn, and excluding such hostname from the pool of backends for a given cache cluster/DC. Additionally, we need to use only localhost as the cache backend for the host cache::single_backend_fqdn itself.

Dashboards such as cache-hosts-comparison can then be used to observe the impact on hitrate/ttfb on various nodes. By having N-1 backend nodes instead of N there may be some backend hitrate change across the board to take into account when interpreting the results.

The procedure for enabling the experiment on one node ($host) is as follows:

  • Depool the host from all user traffic with sudo -i depool
  • (Optional but good to evaluate hitrate): Stop trafficserver.service, empty ATS backend cache with traffic_server -C clear_cache, start trafficserver.service
  • Set cache::single_backend_fqdn: $host in hiera for the DC/cluster the host is part of (eg: for host=cp4027, ulsfo/text)
  • Run puppet on all cache nodes in the DC/cluster. Ensure that $host is removed from the list of backends on all varnish instances in the DC/cluster with sudo -i varnishadm -n frontend backend.list
  • Ensure that varnish on $host points to localhost and that the node behaves well: https://wikitech.wikimedia.org/wiki/Varnish#Force_your_requests_through_a_specific_Varnish_frontend
  • Repool the host for user traffic with sudo -i pool. Ensure that $host is not listed in /etc/varnish/directors.frontend.vcl on any DC/cluster node

Disabling the experiment:

  • Depool the host from all user traffic with sudo -i depool
  • Unset cache::single_backend_fqdn in hiera for the DC/cluster the host is part of
  • Run puppet on all cache nodes in the DC/cluster. Ensure that $host is added to the list of backends on all varnish instances in the DC/cluster with sudo -i varnishadm -n frontend backend.list
  • Ensure that varnish on $host points to all nodes in the DC/cluster and that the node behaves well: https://wikitech.wikimedia.org/wiki/Varnish#Force_your_requests_through_a_specific_Varnish_frontend
  • Repool the host for user traffic with sudo -i pool. Ensure that $host is listed in /etc/varnish/directors.frontend.vcl on all DC/cluster nodes

Event Timeline

ema triaged this task as Medium priority.Aug 5 2021, 8:51 AM

Change 710224 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] cache: single backend experiment

https://gerrit.wikimedia.org/r/710224

Change 710236 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] cache: refactor dynamic_backend_caches logic

https://gerrit.wikimedia.org/r/710236

Change 710244 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] cache: enable single backend experiment on cp4027

https://gerrit.wikimedia.org/r/710244

Change 710236 merged by Ema:

[operations/puppet@production] cache: refactor dynamic_backend_caches logic

https://gerrit.wikimedia.org/r/710236

Change 710224 merged by Ema:

[operations/puppet@production] cache: single backend experiment

https://gerrit.wikimedia.org/r/710224

Change 710973 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] cache: use confd only if backend list is on etcd

https://gerrit.wikimedia.org/r/710973

Change 710973 merged by Ema:

[operations/puppet@production] cache: use confd only if backend list is on etcd

https://gerrit.wikimedia.org/r/710973

Change 726912 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] cache: exclude single backend experiment from pooled ATS backends

https://gerrit.wikimedia.org/r/726912

Change 726912 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] cache: exclude single backend experiment from pooled ATS backends

https://gerrit.wikimedia.org/r/726912

To document the fact somewhere with syntax highlighting: the patch above changes the Go template when the experiment is not running anywhere as follows:

--- /etc/confd/templates/_etc_varnish_directors.frontend.vcl.tmpl.orig
+++ /etc/confd/templates/_etc_varnish_directors.frontend.vcl.tmpl
@@ -1,7 +1,7 @@
 new cache_local = directors.shard();
 new cache_local_random = directors.random();
 
-{{range $node := ls "/conftool/v1/pools/esams/cache_text/ats-be/"}}{{ $key := printf "/conftool/v1/pools/esams/cache_text/ats-be/%s" $node }}{{ $data := json (getv $key) }}{{ if eq $data.pooled "yes"}}
+{{range $node := ls "/conftool/v1/pools/esams/cache_text/ats-be/"}}{{ $key := printf "/conftool/v1/pools/esams/cache_text/ats-be/%s" $node }}{{ $data := json (getv $key) }}{{ if and (eq $data.pooled "yes") (ne $node "") }}
 cache_local.add_backend(be_{{ $parts := split $node "." }}{{ join $parts "_" }});
 cache_local_random.add_backend(be_{{ $parts := split $node "." }}{{ join $parts "_" }}, {{ $data.weight }});
 {{end}}{{end}}

Change 726912 merged by Ema:

[operations/puppet@production] cache: exclude single backend experiment from pooled ATS backends

https://gerrit.wikimedia.org/r/726912

Mentioned in SAL (#wikimedia-operations) [2021-10-12T10:23:42Z] <ema> depool/repool ats-be on cp4028 to verify updates to /etc/varnish/directors.frontend.vcl on cp4027 keep on working fine T288106

Mentioned in SAL (#wikimedia-operations) [2021-11-18T09:35:51Z] <ema> cp4021: depool to enable single backend experiment T288106

Change 710244 merged by Ema:

[operations/puppet@production] cache: enable single backend experiment on cp4021

https://gerrit.wikimedia.org/r/710244

Mentioned in SAL (#wikimedia-operations) [2021-11-18T09:41:49Z] <ema> cp4021: stop ats-be and clear its cache T288106

After setting cache::single_backend_fqdn: cp4021.ulsfo.wmnet in hiera, cp4021 is now gone from the list of cache backends on all upload@ulsfo nodes, see for instance cp4022:

root@cp4022:~# varnishadm -n frontend backend.list
Backend name                   Admin      Probe                Last updated
vcl-13b6a7e0-ba08-4804-b583-9b17de9bcb67.be_cp4022_ulsfo_wmnet probe      Healthy             5/5 Thu, 18 Nov 2021 09:37:54 GMT
vcl-13b6a7e0-ba08-4804-b583-9b17de9bcb67.be_cp4023_ulsfo_wmnet probe      Healthy             5/5 Thu, 18 Nov 2021 09:37:54 GMT
vcl-13b6a7e0-ba08-4804-b583-9b17de9bcb67.be_cp4024_ulsfo_wmnet probe      Healthy             5/5 Thu, 18 Nov 2021 09:37:54 GMT
vcl-13b6a7e0-ba08-4804-b583-9b17de9bcb67.be_cp4025_ulsfo_wmnet probe      Healthy             5/5 Thu, 18 Nov 2021 09:37:54 GMT
vcl-13b6a7e0-ba08-4804-b583-9b17de9bcb67.be_cp4026_ulsfo_wmnet probe      Healthy             5/5 Thu, 18 Nov 2021 09:37:54 GMT
vcl-13b6a7e0-ba08-4804-b583-9b17de9bcb67.be_cp4033_ulsfo_wmnet probe      Healthy             5/5 Thu, 18 Nov 2021 09:37:54 GMT
vcl-13b6a7e0-ba08-4804-b583-9b17de9bcb67.be_cp4034_ulsfo_wmnet probe      Healthy             5/5 Thu, 18 Nov 2021 09:37:54 GMT

Except for cp4021 itself:

root@cp4021:~# varnishadm -n frontend backend.list
Backend name                   Admin      Probe                Last updated
vcl-2cad684a-8940-437b-9279-f982d60126a0.be_cp4021_ulsfo_wmnet probe      Healthy             5/5 Thu, 18 Nov 2021 09:43:54 GMT

I have cleared the ats-be cache on cp4021, forced my client to go through such node, and verified that images are served correctly. We can now repool the node for user traffic.

Mentioned in SAL (#wikimedia-operations) [2021-11-18T09:56:29Z] <ema> cp4021: repool w/ single backend experiment enabled T288106

Change 743910 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] cache: enable single backend experiment on cp3051

https://gerrit.wikimedia.org/r/743910

Mentioned in SAL (#wikimedia-operations) [2021-12-08T10:22:49Z] <ema> cp3051: depool to enable single backend experiment T288106

Mentioned in SAL (#wikimedia-operations) [2021-12-08T10:23:55Z] <ema> cp3051: stop ats-be and clear its cache T288106

Change 743910 merged by Ema:

[operations/puppet@production] cache: enable single backend experiment on cp3051

https://gerrit.wikimedia.org/r/743910

Mentioned in SAL (#wikimedia-operations) [2021-12-08T10:35:12Z] <ema> cp3051: repool w/ single backend experiment enabled T288106

Change 749131 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] Revert \"cache: enable single backend experiment on cp4021\"

https://gerrit.wikimedia.org/r/749131

Mentioned in SAL (#wikimedia-operations) [2021-12-21T08:14:35Z] <ema> cp4021: depool to revert single backend experiment T288106

Change 749131 merged by Ema:

[operations/puppet@production] Revert \"cache: enable single backend experiment on cp4021\"

https://gerrit.wikimedia.org/r/749131

Mentioned in SAL (#wikimedia-operations) [2021-12-21T08:29:12Z] <ema> cp4021: pool with single backend experiment reverted T288106

Change 749132 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] Revert \"cache: enable single backend experiment on cp3051\"

https://gerrit.wikimedia.org/r/749132

Mentioned in SAL (#wikimedia-operations) [2021-12-21T08:45:41Z] <ema> cp3051: depool to revert single backend experiment T288106

Change 749132 merged by Ema:

[operations/puppet@production] Revert \"cache: enable single backend experiment on cp3051\"

https://gerrit.wikimedia.org/r/749132

Mentioned in SAL (#wikimedia-operations) [2021-12-21T08:50:48Z] <ema> cp3051: pool with single backend experiment reverted T288106

Although we did briefly discuss the results of this experiment within Traffic, I don't think we ever publicly disclosed our analysis.

In terms of user-perceived latency, using a single local cache backend in the upload@esams cluster results in an increase in TTFB from 36 to 41.5 milliseconds:

Screenshot from 2022-02-09 09-30-57.png (1×1 px, 164 KB)

The amount of data fetched from the origins increases 3x:

Screenshot from 2022-02-09 09-40-36.png (1×1 px, 265 KB)

Especially the latter data point is concerning, given that those fetches happen over backhaul links. The conclusion is that, at least when it comes to the upload@esams cluster which is likely the worst case in terms of amount and type of traffic served, we cannot just simply use a single local backend instead of spreading the dataset to all available ATS backend instances in the cluster.

Further work is needed to determine the following:

  • How does the single backend architecture perform with slower, larger disks? The total sum of backend disk size in esams is currently 12T, and SATA disks of at least that capacity are certainly an option that could be considered. I've got 40T on my workstation at home. :)
  • How about the text cluster? The type and patterns of traffic are different, and so would probably be the results.

Change 817298 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] P:cache::varnish::frontend: Drop confd_experiment_fqdn

https://gerrit.wikimedia.org/r/817298