Page MenuHomePhabricator

Lift Wing alerting
Open, Needs TriagePublic

Description

As an engineer,

I want to have enable alerts for inference-services deployed on Lift Wing,
so that I can be notified when something goes wrong and act upon it.

The initial focus could be based more on "how" to do it rather than "what". That means that we can implement some alerts on our stack and also decide on the appropriate communication channel (e.g. specific irc channel, slack etc.) and afterwards we can work on the specific alerts we want to implement as well as define severity levels (warning vs alert).

Event Timeline

As part of the initial investigation we can create the a group/team for the alerts in operations/alerts repo and then create a dummy alert to test the capabilities offered in our stack and how we want to interact with alerts as a team rather than optimizing and discussing about the query that will trigger the alert.
So an initial proposal could be:

  • Create a team team-ml in the alerts repo
  • Create an alert that fires when SLO error budget for a revscoring model(could be any model) drops below 50%
  • Create an alert of severity warning when SLO error budget drops below 70%
  • Automatically trigger a task creation in our phabricator board in any of these cases.

It would help if there is some prom metric in staging that we could manually set in order to test that alerts are triggered correctly. I'll investigate if this is an option.

I think that we should coordinate with SRE (@RLazarus for example) before proceeding further with SLO alarming, we don't want to derail from the SRE recommendations :)

My 2c: having an alert that fires at certain threshold of budget burned is not useful, and it can't be "resolved" since the budget will never increase until we reset the time window (that is another problem of how the budget burned is calculated). We should concentrate on rate of budget decrease, if too "steep" it can highlight an outage (and when we resolve it, the alert with recover as well).

One thing that is still not clear to me is what time window to use for the calculation, since the official one at the moment is a sliding window of 3 months (following quarters but shifted earlier by a month). Coding this time window into alerts will need some thinking with SRE, this is why I'd propose to not do things alone :)

Sure! I agree, we'll follow what SRE does regarding SLO alarming.

We have some plans for SLO-based alerting in the pipeline, but nothing implemented yet.

The summary is that @elukey is exactly right, as ever: we'll alert on error budget burn rate. If your SLO allows for an error ratio of X%, then that adds up to Y total errors over the course of a quarter, so if you're consuming that error budget at a rate higher than (Y / 1 quarter) then you'll eventually violate your SLO, and you should be alerted before that happens so that you can correct it.

There are a couple of practical complications. One is that computing Y is difficult and involves foreknowledge of the future: the error rate SLO is a percentage, so the number of allowed errors in the quarter depends on how many total requests you'll serve. So to do this correctly, we don't pick a specific number for error rate and use it as an alerting threshold; instead we make projections for error rate and total traffic, calculate when the error budget will be exhausted if nothing changes, and alert if it's too soon.

Another complication is that the time window is really important. There are two different thresholds for action -- if your whole error budget will be consumed within a week, you should get a ticket alert or something similarly low-urgency; that doesn't need to be dealt with over a weekend. But if your error budget will be consumed within a couple hours, you should get paged so you can fix it right away. So we'll have a couple of different alerts with different thresholds, different time windows, and (@isarantopoulos is also exactly right) different channels and severities.

Anyway TL;DR we're working on it, and we intend to support this centrally so that teams don't have to roll their own, but we don't have anything pre-built to offer you yet.

Change 958072 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/puppet@production] alertmanager: create ml team alerts

https://gerrit.wikimedia.org/r/958072

@RLazarus thanks a lot! We can wait and be the first beta-testers of the new alerts if you are ok!

@isarantopoulos he first alert that I would add is based on these metrics. When the Kafka consumer lag (in this case, Change Prop) starts to increase, it means that something is going on and Lift Wing is misbehaving (returning 500s and forcing CP to retry, etc..). What do you think as first use case?

@elukey We could do that. but I'll need to find the appropriate query for the alert.
On top of that I was thinking that we could add an alert when a pod reaches 80% of its memory limit.

@elukey We could do that. but I'll need to find the appropriate query for the alert.

The metric is kafka_burrow_partition_lag{group=~"cpjobqueue-ORESFetchScoreJob"} (Thanos query URL)

On top of that I was thinking that we could add an alert when a pod reaches 80% of its memory limit.

I'd use 90%, but yes it seems a good. The metric could be something like:

container_memory_usage_bytes{namespace=~"revscoring.*", container="kserve-container", prometheus="k8s-mlserve"}

See the Thanos query.

Change 958072 merged by Elukey:

[operations/puppet@production] alertmanager: create ml team alerts

https://gerrit.wikimedia.org/r/958072