Page MenuHomePhabricator

VictorOps behavior on long-ack'd incidents
Closed, ResolvedPublic

Description

Related to T258336: db1082 crashed and specifically db1082's BBU which failed (and paged) on Jul 18th and then failed again on Jul 25th but a page wasn't issued to VO.

In fact the pages from the 25th deduplicated into the existing incidents on VO which were acknowledged but not resolved, specifically:

Normally incidents resolve themselves in VO when icinga issues the recovery, however in this case we had notifications disabled for db1082 shortly after the incident and thus the recoveries never made it to VO to resolve the incidents of the 18th.

Non-exhaustive of non mutually exclusive solutions:

  1. "disable notifications" should still issue recoveries
  2. auto-resolve ack'd incidents after a threshold (options for 1h-24h are built in to VO)
  3. remember to manually resolve VO incidents
  4. auto-retrigger ack'd incidents after a threshold (options for 1h-24h are built in to VO). This will cause the alert to page again, until resolved.

Related Objects

Event Timeline

Is there a way to send reminders about not resolved incidents? Ideally not via a page :)
I am not sure about options #1 and #2.

Option #1 means that the whole point of disabling notifications to avoid disturbing the pager would be bypassed.
There can be incidents that can take long time to resolve, like a HW being broken or needing replacement. Like the BBU issue. Although that brings the question about, if the initial issue is mitigated, maybe the incident itself is resolved and what is pending is just follow-ups. In that case a reminder would be useful, as something like:
"Hey, it's been 3 days after this incident, are you sure this is still and incident or can be closed and followed up somewhere else?"

A reminder might work! We'll be inquiring VO about that possibility e.g. via email when an incident stays open for more than X hours.

herron updated the task description. (Show Details)

I've updated the description to outline the two auto-retrigger and auto-resolve options as available by VO today.

IMO a good near-term balance (that errs on the side of being noisy) is a combination of #3 and #4.

Long-acked alerts would re-page after a timeout (maxes out at 24h afaict), and the attention this draws should result in manual resolution (#3) happening.

A reminder might work! We'll be inquiring VO about that possibility e.g. via email when an incident stays open for more than X hours.

I've reached out to VO support, there isn't such a possibility at the moment (i.e. a report via email) although they have logged a feature request. We could still implement it ourselves by using the VO API though.

akosiaris triaged this task as Medium priority.Aug 7 2020, 8:54 AM
fgiunchedi renamed this task from db1082 failed on Jul 18th and 25th, however on the 25th pages didn't go out to VO/phones to VictorOps behavior on long-ack'd incidents.Aug 11 2020, 8:43 AM

The current thinking is to try option #4: ack'd incidents in VO that haven't been resolved within X hours will re-trigger, using X = 12. The normal workflow is sth like this:

  1. Icinga issues a CRITICAL to VO
  2. VO opens (or deduplicates) an incident, and escalates as needed
  3. The incidents get's ack'd by the folk(s) that got paged
  4. The incident is worked on
  5. Icinga issues a RECOVERY which resolves the ack'd incident

In some cases the ack'd alert stays open because of lack of RECOVERY, notably when notifications are disabled for the host/service before the RECOVERY could come in. Although for the most part incidents are auto-resolved in normal circumstances.

Are we considering the retrigger to be something implemented for the prod SRE rotation only? On WMCS, we seem pretty ok with manually resolving (small group and all), and a second page wouldn't be something we'd be happy about, for sure. Our paging schedule is very "follow the sun except when you cannot" so a retrigger would likely page people who are sleeping for something that we just decided to fix in the morning, perhaps.

Are we considering the retrigger to be something implemented for the prod SRE rotation only? On WMCS, we seem pretty ok with manually resolving (small group and all), and a second page wouldn't be something we'd be happy about, for sure. Our paging schedule is very "follow the sun except when you cannot" so a retrigger would likely page people who are sleeping for something that we just decided to fix in the morning, perhaps.

Good question re: SRE rotation only, I forgot to specify that the setting is unfortunately global per organization, hence it'll apply to all "wikimedia" rotations (the names VO uses for these things are "pop out of ack" and "auto resolve" for reference: https://help.victorops.com/knowledge-base/auto-resolve-pop-ack/).

I see your point re: triggering after 12h, perhaps we can start with 24h re-trigger (the max permitted) and see how we go.

@Bstorm how does the incident workflow look on your side ATM? I'm asking also because I see incidents for WMCS ack'd but not resolved from Jul 31st, and I would have expected those to be auto-resolved by icinga instead.

Good question re: SRE rotation only, I forgot to specify that the setting is unfortunately global per organization, hence it'll apply to all "wikimedia" rotations (the names VO uses for these things are "pop out of ack" and "auto resolve" for reference: https://help.victorops.com/knowledge-base/auto-resolve-pop-ack/).

I see your point re: triggering after 12h, perhaps we can start with 24h re-trigger (the max permitted) and see how we go.

@Bstorm how does the incident workflow look on your side ATM? I'm asking also because I see incidents for WMCS ack'd but not resolved from Jul 31st, and I would have expected those to be auto-resolved by icinga instead.

All the "incidents" in there right now are all from imaging new hosts. Everything we page on seems to go off when we do that. I believe @Andrew was just leaving that until the paging stops for sure (and I'm not sure that was 100% over with?). We expect false alarms because we have a lot of hosts that we'd rather get false alarms from than have them not page. That's us fighting with icinga and the build script more than anything.

That said, our general rotation is to page the people working. If not acked, it will eventually page people awake. Then if *still* not acked, it will page everyone. So any retrigger in the middle of the night in both Europe and western North America will wake the whole team, potentially. 24 hours is less likely to cause a problem for us because at least it will be the same time! 🙂 If there's risk of retrigger, we would probably be as motivated to "resolve" as we are currently to "ack". I don't think it is common for us to leave alerts acked for long periods either way?

Good question re: SRE rotation only, I forgot to specify that the setting is unfortunately global per organization, hence it'll apply to all "wikimedia" rotations (the names VO uses for these things are "pop out of ack" and "auto resolve" for reference: https://help.victorops.com/knowledge-base/auto-resolve-pop-ack/).

I see your point re: triggering after 12h, perhaps we can start with 24h re-trigger (the max permitted) and see how we go.

@Bstorm how does the incident workflow look on your side ATM? I'm asking also because I see incidents for WMCS ack'd but not resolved from Jul 31st, and I would have expected those to be auto-resolved by icinga instead.

All the "incidents" in there right now are all from imaging new hosts. Everything we page on seems to go off when we do that. I believe @Andrew was just leaving that until the paging stops for sure (and I'm not sure that was 100% over with?). We expect false alarms because we have a lot of hosts that we'd rather get false alarms from than have them not page. That's us fighting with icinga and the build script more than anything.

That said, our general rotation is to page the people working. If not acked, it will eventually page people awake. Then if *still* not acked, it will page everyone. So any retrigger in the middle of the night in both Europe and western North America will wake the whole team, potentially. 24 hours is less likely to cause a problem for us because at least it will be the same time! 🙂 If there's risk of retrigger, we would probably be as motivated to "resolve" as we are currently to "ack". I don't think it is common for us to leave alerts acked for long periods either way?

Thanks for the explanation! It seems to me that 24h would work, also given that long-ack'd incidents are not supposed to be the norm either way.

It seems to me we're good to go, so the plan is to turn on the "Pop-Out-Of-Ack after 24h" option early next week (i.e. Aug 17th)

Mentioned in SAL (#wikimedia-operations) [2020-08-18T07:45:19Z] <godog> VictorOps ack'd incidents will re-trigger after 24h if not resolved - T259465

fgiunchedi changed the task status from Open to Stalled.Aug 18 2020, 7:50 AM

The change is active now for the 'wikimedia' organization, stalling the task while waiting to see how this pans out!

fgiunchedi claimed this task.

This policy change has been implemented for >6 months now and seems to work well (i.e. no incidents left acknowledged)