This task uses upload@eqsin as an example, but applies to all DCs/clusters.
We currently use confd in production to generate the list of origin servers (ATS backend instances) used by Varnish frontends. The list is generated using the template /etc/confd/templates/_etc_varnish_directors.frontend.vcl.tmpl, which looks for all pooled nodes defined in etcd under /conftool/v1/pools/eqsin/cache_upload/ats-be/ and writes the VCL file /etc/varnish/directors.frontend.vcl, which looks like this:
new cache_local = directors.shard(); new cache_local_random = directors.random(); cache_local.add_backend(be_cp5001_eqsin_wmnet); cache_local_random.add_backend(be_cp5001_eqsin_wmnet, 100); [...]
While the file above defines which cache backends are pooled, the list of all available backends is in /etc/varnish/wikimedia-common_upload-frontend.inc.vcl, and is driven by the cache::nodes['upload']['eqsin'] hiera setting.
# Generated list of cache backend hosts for director consumption backend be_cp5001_eqsin_wmnet { .host = "cp5001.eqsin.wmnet"; .port = "3128"; .connect_timeout = 5s; .first_byte_timeout = 35s; .between_bytes_timeout = 60s; .max_connections = 50000; .probe = varnish; } [...]
The assumption is that for each hostname defined in etcd there is a matching entry in hiera. If that is not the case (ie: if a host is defined in etcd but not in puppet/hiera), confd generates /etc/varnish/directors.frontend.vcl successfully, but the symbol for the given host (eg: be_cp5001_eqsin_wmnet) is not defined anywhere, which means that reloading the VCL or restarting Varnish would fail. The issue goes unnoticed until the first VCL reload triggered by a puppet change, which may happen several days after the initial misalignment between etcd and hiera.
Further, even after the missing node is added to cache::nodes, the state file /var/run/reload-vcl-state is KO instead of OK.
We should think of ways to simplify/improve the mechanism, and have a working Icinga check alerting as soon as the issue arises.