Fix the general problem of randomly-bad puppet agent cron timings within redundant clusters
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	BBlack
	Mar 22 2017, 7:00 PM

Description

As observed before, and most-recently in https://wikitech.wikimedia.org/wiki/Incident_documentation/20170322-AuthDNS , sometimes the random splay of our cron entries for puppet agent runs conspires against us, stacking servers which are critically-redundant with each other close together in the time axis. We should come up with some way to alleviate this issue. The wmflib cron_splay parser function addresses a similar problem in a much reduced scope and may serve as a starting point for how we could address this, or there could be other creative solutions that are simpler in nature.

Related Objects

Mentioned In: T171191: Should puppet auto-restart slapd?

Event Timeline

BBlack created this task.Mar 22 2017, 7:00 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 22 2017, 7:00 PM

BBlack renamed this task from Fix the general problem of randomly-bad puppet agent cron timings within redundancy clusters to Fix the general problem of randomly-bad puppet agent cron timings within redundant clusters.Mar 22 2017, 7:22 PM

BBlack added a subscriber: Volans.Mar 23 2017, 5:15 PM

I was trying to think of a way to do this that isn't quite as stateful as current cron_splay, but I haven't thought of a good one yet. If we assume we're trying to just extend the cron_splay mechanism to cover a more-general case like this, it would need the entire nodelist the global cron is applied to, as well as some way to notice which nodes are part of a shared cluster (e.g. name of applied role class?), and some way to identify datacenter/site (current cron_splay uses NNNN from hostname, which doesn't apply to all clusters).

Given those inputs and the host being currently-compiled for, the logic would go something like this:

$cluster_hosts = [list of all hosts with the same role as this host, including this host itself]
$cluster_hosts_interleaved = [above list, re-interleaved/randomized on a per-DC basis like cron_splay does with NNNN numbers, but based on $::site]
... generate cron offset of this host, based on the stable ordering above ...

And then the interval support would need fixing (to support e.g. 30 minute intervals and other such arbitrary values. Currently it only does "hourly", "daily", and "weekly")

I agree with the principle, but we should also take into account the total distribution against the puppetmasters to avoid congestions and be careful with the per-DC basis.

A few caveats that comes to my mind:

to be deterministic (don't change the crontab timing at each puppet run, at least until the size of the cluster stays the same) we could end up with an algorithm that for any given cluster will always start the first host at 0 for example, ending up with a congestion on the puppetmasters for some specific times in the hour. We should ensure to keep the global distribution even also when taking into account the cluster distribution.
we have puppetmasters only in eqiad and codfw so $::site doesn't help for the global distributions because they go to another site's puppetmaster.
- the counter argument here could be that the other sites are small enough that we assume they cannot create congestion on the puppetmasters as long as they are generally distributed
if do re-interleaved/randomized on a per-DC basis, we might still not solve the DNS hosts problem, being them on different DCs they might end up again with the same time if I have correctly understood your logic above.

fgiunchedi mentioned this in T171191: Should puppet auto-restart slapd?.Jul 21 2017, 10:10 AM

• Phabricator_maintenance moved this task from Backlog to Acknowledged on the SRE board.Jan 26 2019, 9:08 PM

Krinkle added a project: Sustainability (Incident Followup).Sep 28 2021, 9:46 PM

jbond edited projects, added Puppet; removed SRE.Nov 4 2022, 11:36 AM

jbond subscribed.

Restricted Application added a project: Infrastructure-Foundations. · View Herald TranscriptNov 4 2022, 11:36 AM

Although the principle still stands, I wonder if this is still something that we want to modify or time has told us that basically we can live without it?

Volans moved this task from Backlog to E_TOO_BIG_MAYBE_OKR? on the SRE-Sprint-Week-Sustainability-March2023 board.Mar 20 2023, 12:00 PM

joanna_borun edited projects, added Puppet-Core; removed Puppet.Jun 12 2023, 2:51 PM

joanna_borun closed this task as Invalid.Feb 12 2024, 3:56 PM

joanna_borun changed the task status from Invalid to Declined.

Fix the general problem of randomly-bad puppet agent cron timings within redundant clustersClosed, DeclinedPublicActions

Description

Related Objects

Event Timeline

Fix the general problem of randomly-bad puppet agent cron timings within redundant clusters
Closed, DeclinedPublic
Actions