Page MenuHomePhabricator

wait_for_optimal() should ignore acked alerts
Open, In Progress, MediumPublic

Description

I was running a cookbook which internally uses wait_for_optimal() (sre.maps.roll-restart), but all the hosts had a long-standing Icinga alert which was marked as acknowledged in Icinga (regarding Postgres replication lag). This leads to a situation that SREBatchBase attempts to retry wait_for_optimal() 15 times until one needs to approve manually to proceed.

I think it would be better if alerts marked as acknowlegded were not accounted for in wait_for_optimal().

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Interesting use case, I can see both pros and cons of doing that. I need also to check how easy is to detect the ack from the data exported.
Would "ask the user" be an option here? Basically doing:

  • poll for optimal, but if there are ACK'ed alerts stop as soon as all but the ACK'ed alerts are green
  • tell the user that there are ACK'ed alerts non green (with a list) and ask what to do: continue, wait for the ACK'ed too, abort
  • wait poll again waiting until it's all green

Interesting use case, I can see both pros and cons of doing that. I need also to check how easy is to detect the ack from the data exported.
Would "ask the user" be an option here? Basically doing:

  • poll for optimal, but if there are ACK'ed alerts stop as soon as all but the ACK'ed alerts are green
  • tell the user that there are ACK'ed alerts non green (with a list) and ask what to do: continue, wait for the ACK'ed too, abort
  • wait poll again waiting until it's all green

Agreed, that would also be a good solution and might also catch cases where the acked issue is known to be temporary (or where we believe it would resolve after a reboot issued via SREBatchBase)

In spicerack we'll add a skip_acked=False to the wait_for_optimal and "acked" properties to HostStatus and HostsStatus datatypes.
When skip_acked we'll disregard the regular "optimal" check and do a new check that specifically looks for non-acked alerts. Setting skip_acked will also skip recheck_failed_services(), as we except the wait_for_optimal to always be called at least once with the new argument set to False. This should become more obvious when used in the cookbooks.

If encounter a optimal == False, but have acked alerts, an exception is raised, which can be handled in the cookbooks.

In the cookbooks, when encountering the exception that we're in a non optimal state, but have acked alerts, the user will be prompted to ignore the acked alerts or abort. If the user chooses to ignore the alerts, rerun wait_for_optimal with skip_acked set to True.

Change 840128 had a related patch set uploaded (by Slyngshede; author: Slyngshede):

[operations/software/spicerack@master] icinga: allow wait_for_optimal to ignore ack'ed alerts.

https://gerrit.wikimedia.org/r/840128

I thought i would bring my response here.

Setting skip_acked will also skip recheck_failed_services()

Regardless of if we call wait_for_optimal(True) or wait_for_optimal(False) we should always call recheck_failed_services(), if we dont call this method between retries then there the retry decorator is just wasting cycles when we call wait_for_optimal(True) . If we want to first check for all services (with retry) and then do some one of check for only unacknowledged checks then it would be better to move check to an instance method and call it without the retry decorator for the skip_acked=False scenario. e.g.

class IcingaHosts:

  def check(self, skip_acked=False):
      # move the current check comand here
  

  def wait_for_optimal(self, skip_acked=False):
     self.recheck_failed_services()
    # uill leave it to riccardo to make the following prettier
    retry(tries=15,delay=timedelta(seconds=3),backoff_mode="linear",exceptions=(IcingaError,)(self.check)(skip_acked)

Then in the cookbook we just do the following

icinag.wait_for_optimal()
#  
icinga.check(False)

as we except the wait_for_optimal to always be called at least once with the new argument set to False.

I dont see why, we either care about acked alerts or we dont, and in most case i think we probably dont care. As such i would say we can change the default behaviour in cookbooks such as reimage, to just call wait_for_optimal(True), exclude acked alerts from the checking logic and completely ignore acked alerts.

I thought i would bring my response here.

Setting skip_acked will also skip recheck_failed_services()

Regardless of if we call wait_for_optimal(True) or wait_for_optimal(False) we should always call recheck_failed_services(), if we dont call this method between retries then there the retry decorator is just wasting cycles when we call wait_for_optimal(True) . If we want to first check for all services (with retry) and then do some one of check for only unacknowledged checks then it would be better to move check to an instance method and call it without the retry decorator for the skip_acked=False scenario. e.g.

ignore all this i hit send and then ralised its check and not wait_for_optimal being decorated Doh

as we except the wait_for_optimal to always be called at least once with the new argument set to False.

I dont see why, we either care about acked alerts or we dont, and in most case i think we probably dont care. As such i would say we can change the default behaviour in cookbooks such as reimage, to just call wait_for_optimal(True), exclude acked alerts from the checking logic and completely ignore acked alerts.

TBH i think this is still valid. The prosed solution is to

  • call wait_for_optimal(False)
  • if errors prompt the user to continue
  • call wait_for_optimal(True)

however im not sure how this is that much different then the current solution of

  • call wait_for_optimal()
  • if error prompt the user to continue

It seems to me that the original proposal was to make it quicker and potential reduce the manual prompts for hosts that have acknowledged alerts which to me seems like the right thing to do

I understand your concerns

Regardless of if we call wait_for_optimal(True) or wait_for_optimal(False) we should always call recheck_failed_services(), if we dont call this method between retries then there the retry decorator is just wasting cycles when we call wait_for_optimal(True)

The original idea was that we don't want to ignore ack'ed alerts blindly, so that a cookbook would first best effort try to wait for optimal and if only ack'ed alerts are there ask for confirmation to the user if it's ok to proceed or not.
Now in order to reach that we have two options, one is to put the whole logic inside spicerack, that actually would be easier, but would make wait_for_optimal an interactive method and I'm not sure we want that.
The other option is to expose something that allows a cookbook to catch an exception and show the interactive part.
We went for the latter, but that's of course up for discussion.

If we want to keep things simpler, we could at this point just make wait_for_optimal() raise a different exception based on the ack'ed or modify the IcingaCheckError one to have a acked_only:bool parameter that a cookbook could check.
Then the cookbook will just call a plain wait_for_optimal() and then decide what to do.
Even without changing the behaviour of the wait_for_optimal() the cookbook could just check icinga_hosts.get_status().unacknowledged_failed_services and decide what to do based on that one.

SLyngshede-WMF changed the task status from Open to In Progress.Oct 11 2022, 12:49 PM
SLyngshede-WMF moved this task from Backlog to In Progress on the Infrastructure-Foundations board.

The original idea was that we don't want to ignore ack'ed alerts blindly

Im not sure this was the original idea going from the task description

edit: to avoid putting words in simons mouth or missintepriting them,. i have come across issues before where alerts there are acked, i know they are acked , i know that's fine and i still want to be able to reimage a host without any manual steps or waiting the additional ~45 seconds for the wait_for_optimal to fail because of failed acked services. happy to discuss this in a separate task if people feel its a different ask

I think it would be better if alerts marked as acknowlegded were not accounted for in wait_for_optimal().

TBH i agree with this statement i think that in most cases we should ignore acked alerts blindly, thats the reason we ack them in the first place

so that a cookbook would first best effort try to wait for optimal and if only ack'ed alerts are there ask for confirmation to the user if it's ok to proceed or not.

i dont really see how that is much of an improvement to how it works now, in both cases:

  • you have to wait for all the retries
  • you have to preform some manual action

Change 840128 merged by Slyngshede:

[operations/software/spicerack@master] icinga: allow wait_for_optimal to ignore ack'ed alerts.

https://gerrit.wikimedia.org/r/840128

Spicerack v6.2.1 ships with this new feature, and it works as expected. One small improvements that we should add below.
I think we should add support for skip_acked=True to recheck_failed_services() too so that we when forcing a recheck we do it only for the failing services that have not been ack'ed if skip_acked=True.