Review check_puppetrun frequency
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	herron
	Aug 16 2017, 3:56 PM

Description

"Puppet last run" status is currently checked with a 1 minute check interval, 1 minute retry interval, and 3 max check. So it should alert about 3 minutes after a failed puppet run. It's also checking many times in-between runs. Tuning this could help reduce the number of service checks icinga needs to perform.

How often are run failure alerts actionable i.e. not false positive related to maintenance/already aware of issue?
Could the check interval be aligned with the puppet run interval of 30 minutes? They would not be synchronized, but perhaps feedback of recent run failure is sufficient.
Could this check alert after some number of consecutive run failures? This may help reduce inactionable alerts where someone was already actively fixing the issue. A longer time window could allow the fix to happen before alerting.

Details

	Subject	Repo	Branch	Lines +/-
	check_puppetrun: Execute less often	operations/puppet	production	+4 -2

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		herron	T173050 Investigate icinga (einsteinium) load
		Resolved		akosiaris	T173427 Review check_puppetrun frequency

Event Timeline

herron created this task.Aug 16 2017, 3:56 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 16 2017, 3:56 PM

faidon moved this task from Inbox to Up next on the observability board.Aug 21 2017, 3:17 PM

Indeed, 1 minute may be a bit excessive. I'm also not sure of the point of doing 3 checks spaced by 1 minute before alerting either -- that feels useless unless I'm missing something.

Setting the interval to 30 minutes wouldn't be great though: since the two intervals (puppet run and icinga's) are randomly allocated and not synchronized with each other, this means that we'll have up to 30 minutes of delay between failure and alerting (or, equivalently, between recovery and alerting).

An alternative option could be to make this check passive, with a freshness threshold of like 35m, with the data pushed directly by the run-puppet-agent/puppet-run after each run. If the check is stale (no data received by icinga after the threshold) than an active check can be performed automatically, allowing (I think) to keep the current logic of warning/critical for last puppet run.

So far I've seen only frack checks being passive (for obvious reasons), so I don't know if passive checks within prod were discarded for some reason I'm missing.

faidon added a project: SRE.Aug 28 2017, 4:31 PM

In T173427#3553083, @Volans wrote:

An alternative option could be to make this check passive, with a freshness threshold of like 35m, with the data pushed directly by the run-puppet-agent/puppet-run after each run. If the check is stale (no data received by icinga after the threshold) than an active check can be performed automatically, allowing (I think) to keep the current logic of warning/critical for last puppet run.

No please let's not go there. Passive checks in icinga are a pain already and we had pretty much the described setup in the past (there was SNMP involved and SNMP traps sent to the host which added to the pain, but still) and managed to run away from it a few years ago. I am all for diminishing the amount of false positives and this is probably gonna go the other way

In T173427#3553055, @faidon wrote:

Indeed, 1 minute may be a bit excessive. I'm also not sure of the point of doing 3 checks spaced by 1 minute before alerting either -- that feels useless unless I'm missing something.

It's about avoiding transient network failure positives and making sure the failure is not a transient one.

Setting the interval to 30 minutes wouldn't be great though: since the two intervals (puppet run and icinga's) are randomly allocated and not synchronized with each other, this means that we'll have up to 30 minutes of delay between failure and alerting (or, equivalently, between recovery and alerting).

Agreed. We can increase it from 1min (it's a bit excessive) but I would not go above 5 mins.

! In T173427#3591926, @akosiaris wrote:

It's about avoiding transient network failure positives and making sure the failure is not a transient one.

This, but about puppet runs, is what I was looking to accomplish (crudely) with the 30 minute interval. Avoid critical alerting on transient puppet issues by requiring multiple consecutive run failures.

In T173427#3592881, @herron wrote:

! In T173427#3591926, @akosiaris wrote:

It's about avoiding transient network failure positives and making sure the failure is not a transient one.

This, but about puppet runs, is what I was looking to accomplish (crudely) with the 30 minute interval. Avoid critical alerting on transient puppet issues by requiring multiple consecutive run failures.

This is not gonna be so easy. Transient puppet issues take at least a puppet run interval to resolve and are unfortunately indistinguishable from human introduced issues, which is something we want to catch and catch relatively fast as they are potentially blocking multiple people.

To give an example of why this are currently indistinguishable, imagine the 2 following scenarios:

A compilation failure issue introduced due to a bad change by a human
The known but not currently solveable "Attempt to assign to a reserved variable name: 'trusted'" aka T153246

Ideally we would like to alert quickly on the 1st and ignore the 2nd. But we can't really in a nice way because aside from parsing the puppet agent stdout and figuring out which of the 2 cases above we are dealing with, there isn't really another way of detecting it. And that way is error prone and bound to break at some point. Instead we kind of choose to ignore it until we are ready to upgrade to newer puppetdb and terminus versions which reportedly fix it.

To answer as best as I can a few of the questions posted above

How often are run failure alerts actionable i.e. not false positive related to maintenance/already aware of issue?

I don't think we have numbers on those. It would be nice if we had.

Could the check interval be aligned with the puppet run interval of 30 minutes? They would not be synchronized, but perhaps feedback of recent run failure is sufficient.

No, the check_interval can not be aligned with the puppet run. The scheduling of checks is internal to icinga and the scheduling of puppet runs is our own implementation. We could go around and force a rescheducling of the check on the icinga server but that would require every server to be be able to submit commands to the icinga server which is not wise from a security point of view.

Could this check alert after some number of consecutive run failures? This may help reduce inactionable alerts where someone was already actively fixing the issue. A longer time window could allow the fix to happen before alerting.

There is really no difference in the summary/last_run YAML files between runs so it would be up to the check to store that information. And statefulness in icinga checks introduces complexity and with it bugs that at least in my experience tend to increase the false positives. In any case, for this to be useful we would have to align the checks with the puppet runs anyway which is already pretty difficult as I pointed out above.

I think the proposal is to bump check interval from 1 minute to 5 minutes, right? Any other actionables here?

Yes, let's just bump the check interval a bit. 5 mins sounds ok and given the on-demand nature of schedules on failing services on hosts we can be pretty sure we won't lose the state anyway

Change 381982 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] check_puppetrun: Execute less often

https://gerrit.wikimedia.org/r/381982

gerritbot added a project: Patch-For-Review.Oct 3 2017, 1:56 PM

Change 381982 merged by Alexandros Kosiaris:
[operations/puppet@production] check_puppetrun: Execute less often

https://gerrit.wikimedia.org/r/381982