Page MenuHomePhabricator

Alert on Varnish high thread count
Closed, ResolvedPublic

Description

We want to be alerted if the thread count is too high. varnish_main_threads already exists but we don't have any way to compare it to maximum configuration values. To that end, we'll need to expose the maximum values configured as we did in T292815 for ATS

Event Timeline

Change 863406 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] varnish: Export runtime params for Prometheus

https://gerrit.wikimedia.org/r/863406

Change 863406 merged by BCornwall:

[operations/puppet@production] varnish: Export runtime params for Prometheus

https://gerrit.wikimedia.org/r/863406

Change 877263 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] prometheus: Add Varnish thread percent usage rule

https://gerrit.wikimedia.org/r/877263

Change 878166 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/alerts@master] varnish: Alert on high thread count

https://gerrit.wikimedia.org/r/878166

Unfortunately, merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/863406/ has caused logspam every ten minutes in /var/log/messages.

03:27 <vgutierrez> brett: BTW.. as a side effect of https://gerrit.wikimedia.org/r/c/operations/puppet/+/863406 varnish params get now dumped on /var/log/messages every 10 minutes
05:05 <bblack> ugh that's annoying
05:09 <bblack> doesn't look like there's any easy flag to disable that either
05:09 <bblack> because it's the server side of the CLI connection doing the logging.  There is a parameter to turn it off, but it's global.
05:14 <bblack> executing twice for the two items is at least less-spammy
05:33 <vgutierrez> yep.. we can cut the output in half by reusing the varnishadm output
05:43 <vgutierrez> hmm it's actually being reused right now
05:53 <bblack> yeah it pulls two different params
06:02 <vgutierrez> so.. the number of threads can be pulled from varnishstats
06:02 <vgutierrez> (current threads)
06:02 <vgutierrez> MAIN.threads                         500          .   Total number of threads
06:03 <vgutierrez> and the max one can be fetched from our puppetization instead of using varnishadm
06:32 <bblack> yeah this is all about the config side (varnishadm vs puppet).  We went through these arguments before (brett + I), but the logfile spam, we didn't consider or realize.
06:34 <bblack> just in like "data basics" sorts of terms: it makes the most sense to consider puppet the single source of truth for config, and have puppet send the puppet-configured value to both the real config files /and/ to prometheus (for X/max abilities in grafana or alerting)
06:34 <bblack> but then there's a few counterarguments (of varying degrees) for the dynamic varnishadm method:
06:35 <bblack> *) I think there was an argument about a static, unchanging prometheus input file causing staleness alerts (and even if you fix that in puppetization: how do you know in general such a file is being regenerated when it should be, if the normal status quo is for the timestamp to be stale?)
06:36 <bblack> *) Technically, all these params can be modified at runtime with varnishadm.  Therefore if we only report the static config value to prom, if someone goes and edits thread_pool_max in response to an incident or whatever, the data/graphs become wrong.
06:37 <bblack> *) Currently, the values in question about threads (thread_pools + thread_pools_max) aren't even arguments or hieradata.  One is unspecified at all (we use the varnish default implicitly), and the other is a hardcoded string in a systemd file at present.
06:39 <bblack> anyways, it's not a very clear decision, the tradeoffs get a bt bikesheddy
06:40 <bblack> actually s/hardcoded string/templated from #cpus/ I think
06:40 <bblack> which is probably the wrong way to be doing it, given varnish's threading design, but it's what we have right now
06:41 <bblack> yeah:
06:41 <bblack> -p thread_pool_min=250 -p thread_pool_max=<%= @processorcount * 250 -%>
06:41 <bblack> (processorcount doubled on the new hosts, so the wrongness is starting to matter)
06:45 <bblack> ---
06:45 <bblack> anyways, if we want to try going the other way (on the argument that the logfile spam is the last straw knocking down the viability of this path)
06:46 <bblack> it's basically:
06:47 <bblack> elevate processcount*250 calculation to some variable that can be output for both cases, create thread_pools=2 somewhere in hieradata and pass it around, and fix the prom staleness thing, and accept that the /max part will be wrong if edited via varnishadm for temporary reasons.
06:48 <bblack> (and maybe document that thread_pools is at the implicit default and not intentionally "tuned", so it will be less-confusing when it causes some problem in varnish-8.x or whatever.
06:48 <bblack> )
06:49 <bblack> or I guess, for the thread_pools part, you could also make the argument to avoid parameterizing it explicitly for our systemd unit, and instead just writing a fixed "2" in the prom file and leaving some comment about it in puppet somewhere (since that implicit default hasn't changed in a long time and there's no good reason to think we would ever need to)
06:51 <bblack> (we could also just fold it up into one param for prometheus at that point, honestly): varnish_param_threads_max = thread_pools_max*2
06:51 <bblack> )
06:52 <bblack> all solutions have some kind of pain, and we can't predict the future anyways, etc

Specifically, varnishadm is causing the main varnish service to spit out all parameters, so this isn't a simple fix within the new prometheus_varnish_params.service

Change 878180 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] varnish: Revert export of Prometheus params

https://gerrit.wikimedia.org/r/878180

Change 878201 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] varnish: Template out thread pool settings

https://gerrit.wikimedia.org/r/878201

Change 878201 merged by BCornwall:

[operations/puppet@production] varnish: Template out thread pool settings

https://gerrit.wikimedia.org/r/878201

Change 879660 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] prometheus: Generate varnish params file

https://gerrit.wikimedia.org/r/879660

Change 878180 merged by BCornwall:

[operations/puppet@production] varnish: Revert export of Prometheus params

https://gerrit.wikimedia.org/r/878180

Change 879829 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] prometheus: Properly set old script as absent

https://gerrit.wikimedia.org/r/879829

Change 879829 merged by BCornwall:

[operations/puppet@production] prometheus: Properly set old script as absent

https://gerrit.wikimedia.org/r/879829

Change 879848 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/alerts@master] node: Exclude varnish params file from stale check

https://gerrit.wikimedia.org/r/879848

Change 879660 merged by BCornwall:

[operations/puppet@production] prometheus: Generate varnish params file

https://gerrit.wikimedia.org/r/879660

Change 877263 merged by BCornwall:

[operations/puppet@production] prometheus: Add Varnish thread percent usage rule

https://gerrit.wikimedia.org/r/877263

Change 879848 merged by BCornwall:

[operations/alerts@master] node: Exclude varnish params file from stale check

https://gerrit.wikimedia.org/r/879848

Change 878166 merged by BCornwall:

[operations/alerts@master] varnish: Alert on high thread count

https://gerrit.wikimedia.org/r/878166

Change 957680 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] varnish: Fix thread_pool_max on esams, eqsin, ulsfo

https://gerrit.wikimedia.org/r/957680

@Vgutierrez Thanks for your patch fixing thread_pool_max; IIRC @BBlack had advised the flat 12000 max threads due to the arbitrary nature of the processorcount. Is this patch to say that you'd rather have the old cpu-based value rather than overriding a default?

To clarify and expand on my position about this thread count parameter (which is really just a side-issue related to this ticket, which is fundamentally complete):

  1. Varnish's threading model is one that I'm not fond of to begin with (it's approximately thread-per-connection). We can't change that, but just saying. The models I'm more fond of are closer to thread-per-cpu (or a very small multiple thereof), with some kind of eventloop system multiplexing many connections or work items per thread.
  2. Because varnish's threads are modeled this way, it really doesn't make sense to derive from or relate to the CPU count when determining the thread count. If you were to impose any hardware-related limit on the thread count, it would probably be related to the maximum amount of per-thread stack + heap allocations you're willing to suffer out of the total available host memory. (Whereas in the other kind of threading model, it totally makes sense to base the calculation on CPU core counts).
  3. The real effect, as far as we care, of tuning varnish's maximum thread count, is that it's an indirect limit on the maximum amount of concurrent connections/transactions allowed to flow through varnish, implemented in a confusing way that mixes up the concerns of frontend-facing and backend-facing connections+txns. It's on that basis that it should be tuned.
  4. We don't know what the "right" value is for this anyways, but we do know that (a) under "normal" conditions, we use orders of magnitude fewer threads than the current 24K limit (12K per pool, 2 pools) and (b) under some adverse traffic scenarios, we quickly spin up the full 24K, but mostly to our own detriment. Spinning up a ton of excess threads is not going to fix the problem being induced by the traffic.

Therefore, for now, I don't advise going back to the CPU multiplier model of deriving this, and I think it's best left alone at the current fixed settings unless/until we come up with better rationales for how to tune this (perhaps after solving some limits-based issues in haproxy + ats first!).

Closing due to lack of response. Please re-open if you'd like to re-ignite discussion. Thanks!