Page MenuHomePhabricator

integrate (pybal|varnish)->varnish backend config/state with etcd or similar
Closed, ResolvedPublic

Description

We need something like an etcd cluster (or similar alternative) controlling the active cache lists at the various layers (pybal/LVS -> varnish, varnish -> varnish). Ideally the puppet nodelists would populate the basics nodelists in etcd (currently in hieradata/common/cache/*.yaml), and we'd have the data structure set up so that nodes can be runtime-depooled via etcd updates at runtime independently of that, and then wrap some tooling around it to easily depool a given cache node in both the frontend and backend senses globally. From there it would become much easier to script up daemon/host restarts with automatic depooling, and we could also have nodes self-(de|re)pool around clean reboots on their own as well via initscripts->etcd.

In code terms, the pybal integration could be direct, while varnish would probably have data-updated-triggered regeneration of a VCL fragment + reload-vcl.

Related Objects

Event Timeline

BBlack raised the priority of this task from to Low.
BBlack updated the task description. (Show Details)
BBlack added projects: acl*sre-team, Traffic.
BBlack added subscribers: BBlack, MoritzMuehlenhoff, Joe.

So, given we chose to go ahead with etcd, we will use confd for writing a single VCL fragment containing the backends info, and our traditional scripts to reload it.

an alternative approach for pybal would be to do the same thing as varnish: generate config files with confd and wait for pybal to pick them up on the local FS via file://, possibly at a shorter interval like 10s or immediately via inotify

For Varnish switch: I am verifying that all hosts are represented correctly in the generated lists,

so far verified the text cluster and the lists are 1:1 with what we get from puppet.

I've applied all the custom hardware-based weighting that matters at all levels for nginx/varnish-* pools.

I'm auditing that data too, but at the confctl level rather than the output-file level. It all looked correct for nodelists, pooled=yes, weights. Only a few exceptions:

Findings that actually need cleaning:

  • dc=esams,cluster=cache_text,service=varnish-be
    • bad entry (which I created when testing a tool): {"cp3011.esams.wmnex": {"pooled": "no", "weight": 128}}
  • dc=codfw,cluster=cache_text,service=varnish-be
    • bad entry (same, but accidental): {"co2001.codfw.wmnet": {"pooled": "no", "weight": 100}}

Totally invalid sub-trees (source data corrected since, but useless/pointless keys still exist in data), the ones I know of are:

  • cluster=cache_bits,service=varnish-be (all dcs)

Change 223029 had a related patch set uploaded (by Giuseppe Lavagetto):
varnish: enable dynamic directors for a subset of ulsfo hosts

https://gerrit.wikimedia.org/r/223029

Change 223030 had a related patch set uploaded (by Giuseppe Lavagetto):
varnish: enable dynamic directors in ulsfo

https://gerrit.wikimedia.org/r/223030

Change 223029 merged by Giuseppe Lavagetto:
varnish: enable dynamic directors for a subset of ulsfo hosts

https://gerrit.wikimedia.org/r/223029

Change 223030 merged by Giuseppe Lavagetto:
varnish: enable dynamic directors in ulsfo

https://gerrit.wikimedia.org/r/223030

Change 223312 had a related patch set uploaded (by BBlack):
varnish: enable dynamic directors in esams

https://gerrit.wikimedia.org/r/223312

Change 223312 merged by BBlack:
varnish: enable dynamic directors in esams

https://gerrit.wikimedia.org/r/223312

Change 224649 had a related patch set uploaded (by BBlack):
varnish: default dynamic_directors true (changes eqiad)

https://gerrit.wikimedia.org/r/224649

Change 224649 merged by BBlack:
varnish: default dynamic_directors true (changes eqiad)

https://gerrit.wikimedia.org/r/224649