Page MenuHomePhabricator

Fix the general problem of randomly-bad puppet agent cron timings within redundant clusters
Closed, DeclinedPublic

Description

As observed before, and most-recently in https://wikitech.wikimedia.org/wiki/Incident_documentation/20170322-AuthDNS , sometimes the random splay of our cron entries for puppet agent runs conspires against us, stacking servers which are critically-redundant with each other close together in the time axis. We should come up with some way to alleviate this issue. The wmflib cron_splay parser function addresses a similar problem in a much reduced scope and may serve as a starting point for how we could address this, or there could be other creative solutions that are simpler in nature.

Event Timeline

BBlack renamed this task from Fix the general problem of randomly-bad puppet agent cron timings within redundancy clusters to Fix the general problem of randomly-bad puppet agent cron timings within redundant clusters.Mar 22 2017, 7:22 PM

I was trying to think of a way to do this that isn't quite as stateful as current cron_splay, but I haven't thought of a good one yet. If we assume we're trying to just extend the cron_splay mechanism to cover a more-general case like this, it would need the entire nodelist the global cron is applied to, as well as some way to notice which nodes are part of a shared cluster (e.g. name of applied role class?), and some way to identify datacenter/site (current cron_splay uses NNNN from hostname, which doesn't apply to all clusters).

Given those inputs and the host being currently-compiled for, the logic would go something like this:

$cluster_hosts = [list of all hosts with the same role as this host, including this host itself]
$cluster_hosts_interleaved = [above list, re-interleaved/randomized on a per-DC basis like cron_splay does with NNNN numbers, but based on $::site]
... generate cron offset of this host, based on the stable ordering above ...

And then the interval support would need fixing (to support e.g. 30 minute intervals and other such arbitrary values. Currently it only does "hourly", "daily", and "weekly")

I agree with the principle, but we should also take into account the total distribution against the puppetmasters to avoid congestions and be careful with the per-DC basis.

A few caveats that comes to my mind:

  1. to be deterministic (don't change the crontab timing at each puppet run, at least until the size of the cluster stays the same) we could end up with an algorithm that for any given cluster will always start the first host at 0 for example, ending up with a congestion on the puppetmasters for some specific times in the hour. We should ensure to keep the global distribution even also when taking into account the cluster distribution.
  2. we have puppetmasters only in eqiad and codfw so $::site doesn't help for the global distributions because they go to another site's puppetmaster.
    • the counter argument here could be that the other sites are small enough that we assume they cannot create congestion on the puppetmasters as long as they are generally distributed
  3. if do re-interleaved/randomized on a per-DC basis, we might still not solve the DNS hosts problem, being them on different DCs they might end up again with the same time if I have correctly understood your logic above.
jbond subscribed.

Although the principle still stands, I wonder if this is still something that we want to modify or time has told us that basically we can live without it?

joanna_borun changed the task status from Invalid to Declined.