Page MenuHomePhabricator

Revisit varnish dynamic backends mechanism
Open, MediumPublic

Description

This task uses upload@eqsin as an example, but applies to all DCs/clusters.

We currently use confd in production to generate the list of origin servers (ATS backend instances) used by Varnish frontends. The list is generated using the template /etc/confd/templates/_etc_varnish_directors.frontend.vcl.tmpl, which looks for all pooled nodes defined in etcd under /conftool/v1/pools/eqsin/cache_upload/ats-be/ and writes the VCL file /etc/varnish/directors.frontend.vcl, which looks like this:

new cache_local = directors.shard();
new cache_local_random = directors.random();

cache_local.add_backend(be_cp5001_eqsin_wmnet);
cache_local_random.add_backend(be_cp5001_eqsin_wmnet, 100);
[...]

While the file above defines which cache backends are pooled, the list of all available backends is in /etc/varnish/wikimedia-common_upload-frontend.inc.vcl, and is driven by the cache::nodes['upload']['eqsin'] hiera setting.

# Generated list of cache backend hosts for director consumption
backend be_cp5001_eqsin_wmnet {
        .host = "cp5001.eqsin.wmnet";
        .port = "3128";
        .connect_timeout = 5s;
        .first_byte_timeout = 35s;
        .between_bytes_timeout = 60s;
        .max_connections = 50000;
        .probe = varnish;
}
[...]

The assumption is that for each hostname defined in etcd there is a matching entry in hiera. If that is not the case (ie: if a host is defined in etcd but not in puppet/hiera), confd generates /etc/varnish/directors.frontend.vcl successfully, but the symbol for the given host (eg: be_cp5001_eqsin_wmnet) is not defined anywhere, which means that reloading the VCL or restarting Varnish would fail. The issue goes unnoticed until the first VCL reload triggered by a puppet change, which may happen several days after the initial misalignment between etcd and hiera.

Further, even after the missing node is added to cache::nodes, the state file /var/run/reload-vcl-state is KO instead of OK.

We should think of ways to simplify/improve the mechanism, and have a working Icinga check alerting as soon as the issue arises.

Event Timeline

Change 692249 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] varnish: fix check_vcl_reload state check

https://gerrit.wikimedia.org/r/692249

ema triaged this task as Medium priority.May 17 2021, 7:58 AM

Change 692249 merged by Ema:

[operations/puppet@production] varnish: fix check_vcl_reload state check

https://gerrit.wikimedia.org/r/692249

Mentioned in SAL (#wikimedia-operations) [2021-05-17T09:06:18Z] <ema> cp_eqsin: run confd-reload-vcl manually to fix /var/run/reload-vcl-state T282880

and is driven by the cache::nodes['upload']['eqsin'] hiera setting.

In relation to this would it be better to pull this information directly from puppetdb. This would mean that list would contain all nodes that have run the cache::text or cache:upload role. We could also add further constraints as well.

I created a quick PoC which would be called with e.g wmflib::cache::nodes('upload', 'eqsin')

Change 692286 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/puppet@production] wmflib::role_hosts: new function return list of hosts running a role

https://gerrit.wikimedia.org/r/692286

I created a quick PoC which would be called with e.g wmflib::cache::nodes('upload', 'eqsin')

After review the signiture has been made more generic so you can do the following

wmflib::role_hosts('cache::upload')                     #  Returns all hosts running the cache::upload role
wmflib::role_hosts('cache::upload', 'eqsin')            #  Returns all hosts running the cache::upload role with a hostname matching /esqin/
wmflib::role_hosts('cache::upload', ['eqiad' 'eqsin'])  #  Returns all hosts running the cache::upload role with a fqdn matching /eqiad|esqin/

Change 709645 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] cache: use wmflib::role_hosts instead of cache::nodes

https://gerrit.wikimedia.org/r/709645

Bump - we should revisit this, but perhaps after finishing the cache role name cleanup (text vs text_envoy vs text_haproxy...).