Page MenuHomePhabricator

ErrorBudgetBurn
Open, Needs TriagePublic

Description

Common information

  • alertname: ErrorBudgetBurn
  • recorder: thanos-rule@main
  • revision: 1
  • service: mpic
  • severity: warning
  • slo: xlab-standalone-event-system-success-rate-v1
  • source: thanos
  • team: experiment-platform

Firing alerts


  • alertname: ErrorBudgetBurn
  • exhaustion: 2w
  • long: 1d
  • recorder: thanos-rule@main
  • revision: 1
  • service: mpic
  • severity: warning
  • short: 2h
  • slo: xlab-standalone-event-system-success-rate-v1
  • source: thanos
  • team: experiment-platform
  • Source

  • alertname: ErrorBudgetBurn
  • exhaustion: 4w
  • long: 4d
  • recorder: thanos-rule@main
  • revision: 1
  • service: mpic
  • severity: warning
  • short: 6h
  • slo: xlab-standalone-event-system-success-rate-v1
  • source: thanos
  • team: experiment-platform
  • Source

Event Timeline

Note that Phaultfinder will keep creating duplicate tasks unless it can find a task with the exact title in the exact projects it wants. Cross-ref T395942

@RLazarus possible to add something to the subject line to make these sufficiently unique? I was thinking the slo field could go into the title. And then does Phaultfinder just add new entries as a new comment or something? I'm wondering if this would be something of a perma-ticket Experiment Platform would want to just drag from sprint to sprint and keep open but drag around on the board whenever work needs to be done to analyze what's going on, and possibly further action.

I had renamed T412467: ErrorBudgetBurn (sticky-headers, part 2), hence this duplicate coming. Thing is, I had also before that renamed T412448: ErrorBudgetBurn (sticky-headers) (which has a different (exhaustion, long, short) tuple), although I think that one is not resurfacing presently because the underlying velocity for that one was contained.

Oh, I see it updates the task description. Neat. I'd seen phaultfinder before but didn't understand its behavior beyond filing a ticket.

@dr0ptp4kt (And cc @herron) Good question!

As Pppery noted (not tagging to respect the unsubscribe) it's intended that re-firing alerts will go to the same task as long as there's one still open -- the thinking there is, it's not unusual for multiple alerts to fire in succession with a single underlying cause, and you'd probably rather not be flooded with new tasks about it, so instead new updates are added to the task description (see e.g. @phaultfinder's update just above this comment.)

The expected workflow is that when the underlying problem is fixed, the task gets resolved -- so any new alert represents a new problem and opens a new task, which is usually better for getting your attention.

That's the theory, anyway. And worth noting that all the above applies production-wide, for any alert that gets routed to phabricator, not just these new SLO burn alerts. And, in theory, theory is the same as practice, but in practice, sometimes they're different. :) So that's why it is the way it is, but we should find a workflow that makes sense for you.

A couple of specific mechanical points -- alerts will get deduped into any open task with the same title in the same project.

  • "Same title" means, as you found, renaming the task will always mean a new task gets generated on the next alert. (That can be good, if you want to repurpose the task for long-running followup work, while still getting a new task when something new comes up.)
  • "Open" means its status -- so e.g. when you moved T412448 into the Done column on the workboard, but didn't set its status to Resolved, it's still considered an open task for this purpose, and the only reason it didn't get reused was its new name. That matters because it might be easy to miss an alert that way -- if you get an SLO alert, fix the bug, slide the task to Done, and two weeks later a new alert shows up on the same task, are you sure you'd notice it?

I think we can customize which fields go into the title, to add the slo field in there as you suggest, but I don't know how exactly -- I see in alertmanager.yml.erb that there's a title urlparam but I can't immediately tell what its format is. @herron, any ideas?

Last thing, the exact format of the alert might change slightly as we shift the implementation from Pyrra to Sloth, but all the above is still basically true in the Sloth world. Thanks again for volunteering to try this all out.

Thanks @RLazarus! @herron thanks in advance for any guidance on the question on task titling from Alertmanager rules.

And, in theory, theory is the same as practice, but in practice, sometimes they're different.

:)

the only reason it didn't get reused was its new name

And the fact that it was in a milestone not the main project: T395942

I think we can customize which fields go into the title, to add the slo field in there as you suggest, but I don't know how exactly -- I see in alertmanager.yml.erb that there's a title urlparam but I can't immediately tell what its format is. @herron, any ideas?

{{ Label }} essentially

SLO alerts have a different set of labels that don't fully map to the default pattern, I've opened T412965: Tune phaultfinder task titles for SLO alerts as a broader task to improve the title formatting on these

Thanks you @Pppery and @herron for the additional context and task filing.

Just a heads up I'm going to attempt silencing the warning until after the holiday break, mainly so that people are getting no more messages than strictly necessary. I'll be OoO after Friday, and back after the holiday break. We're seesawing around the 99.9% SLO so are in okay shape (although higher would be better, of course!). @RLazarus I'll loop you on a related IM where I'm trying to dig just a little more into whether there are perhaps additional diagnostic pieces to consider.