Page MenuHomePhabricator

Lift Wing alerting
Closed, ResolvedPublic5 Estimated Story Points

Description

As an engineer,

I want to have enable alerts for inference-services deployed on Lift Wing,
so that I can be notified when something goes wrong and act upon it.

The initial focus could be based more on "how" to do it rather than "what". That means that we can implement some alerts on our stack and also decide on the appropriate communication channel (e.g. specific irc channel, slack etc.) and afterwards we can work on the specific alerts we want to implement as well as define severity levels (warning vs alert).

Event Timeline

As part of the initial investigation we can create the a group/team for the alerts in operations/alerts repo and then create a dummy alert to test the capabilities offered in our stack and how we want to interact with alerts as a team rather than optimizing and discussing about the query that will trigger the alert.
So an initial proposal could be:

  • Create a team team-ml in the alerts repo
  • Create an alert that fires when SLO error budget for a revscoring model(could be any model) drops below 50%
  • Create an alert of severity warning when SLO error budget drops below 70%
  • Automatically trigger a task creation in our phabricator board in any of these cases.

It would help if there is some prom metric in staging that we could manually set in order to test that alerts are triggered correctly. I'll investigate if this is an option.

I think that we should coordinate with SRE (@RLazarus for example) before proceeding further with SLO alarming, we don't want to derail from the SRE recommendations :)

My 2c: having an alert that fires at certain threshold of budget burned is not useful, and it can't be "resolved" since the budget will never increase until we reset the time window (that is another problem of how the budget burned is calculated). We should concentrate on rate of budget decrease, if too "steep" it can highlight an outage (and when we resolve it, the alert with recover as well).

One thing that is still not clear to me is what time window to use for the calculation, since the official one at the moment is a sliding window of 3 months (following quarters but shifted earlier by a month). Coding this time window into alerts will need some thinking with SRE, this is why I'd propose to not do things alone :)

Sure! I agree, we'll follow what SRE does regarding SLO alarming.

We have some plans for SLO-based alerting in the pipeline, but nothing implemented yet.

The summary is that @elukey is exactly right, as ever: we'll alert on error budget burn rate. If your SLO allows for an error ratio of X%, then that adds up to Y total errors over the course of a quarter, so if you're consuming that error budget at a rate higher than (Y / 1 quarter) then you'll eventually violate your SLO, and you should be alerted before that happens so that you can correct it.

There are a couple of practical complications. One is that computing Y is difficult and involves foreknowledge of the future: the error rate SLO is a percentage, so the number of allowed errors in the quarter depends on how many total requests you'll serve. So to do this correctly, we don't pick a specific number for error rate and use it as an alerting threshold; instead we make projections for error rate and total traffic, calculate when the error budget will be exhausted if nothing changes, and alert if it's too soon.

Another complication is that the time window is really important. There are two different thresholds for action -- if your whole error budget will be consumed within a week, you should get a ticket alert or something similarly low-urgency; that doesn't need to be dealt with over a weekend. But if your error budget will be consumed within a couple hours, you should get paged so you can fix it right away. So we'll have a couple of different alerts with different thresholds, different time windows, and (@isarantopoulos is also exactly right) different channels and severities.

Anyway TL;DR we're working on it, and we intend to support this centrally so that teams don't have to roll their own, but we don't have anything pre-built to offer you yet.

Change 958072 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/puppet@production] alertmanager: create ml team alerts

https://gerrit.wikimedia.org/r/958072

@RLazarus thanks a lot! We can wait and be the first beta-testers of the new alerts if you are ok!

@isarantopoulos he first alert that I would add is based on these metrics. When the Kafka consumer lag (in this case, Change Prop) starts to increase, it means that something is going on and Lift Wing is misbehaving (returning 500s and forcing CP to retry, etc..). What do you think as first use case?

@elukey We could do that. but I'll need to find the appropriate query for the alert.
On top of that I was thinking that we could add an alert when a pod reaches 80% of its memory limit.

@elukey We could do that. but I'll need to find the appropriate query for the alert.

The metric is kafka_burrow_partition_lag{group=~"cpjobqueue-ORESFetchScoreJob"} (Thanos query URL)

On top of that I was thinking that we could add an alert when a pod reaches 80% of its memory limit.

I'd use 90%, but yes it seems a good. The metric could be something like:

container_memory_usage_bytes{namespace=~"revscoring.*", container="kserve-container", prometheus="k8s-mlserve"}

See the Thanos query.

Change 958072 merged by Elukey:

[operations/puppet@production] alertmanager: create ml team alerts

https://gerrit.wikimedia.org/r/958072

Change 962056 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/alerts@master] ml-alerts: add alert for increased ORESFetchScoreJob

https://gerrit.wikimedia.org/r/962056

isarantopoulos changed the task status from Open to In Progress.Sep 29 2023, 3:32 PM

I started by adding an alert for the following query which I borrowed from the Jobqueue Grafana dashboard

max(avg_over_time(cpjobqueue_normal_rule_processing{quantile="0.5",service="cpjobqueue", rule=~"ORESFetchScoreJob-mediawiki-job-ORESFetchScoreJob"}[15m]) * 1000) by (rule) > 1000

@elukey regarding the kafka_burrow_partition_lag: which threshold would be good for an alert? Above 0 for 15-30 minutes or sth similar?

I started by adding an alert for the following query which I borrowed from the Jobqueue Grafana dashboard

max(avg_over_time(cpjobqueue_normal_rule_processing{quantile="0.5",service="cpjobqueue", rule=~"ORESFetchScoreJob-mediawiki-job-ORESFetchScoreJob"}[15m]) * 1000) by (rule) > 1000

@elukey regarding the kafka_burrow_partition_lag: which threshold would be good for an alert? Above 0 for 15-30 minutes or sth similar?

I would personally not add alerts on the rule processing, getting the right throughput may be tricky and prone to error over time if the number of events increases/decreases.. The lag should be fine in theory, since it is a clear sign that changeprop is slowed down (that is what we want to alert on IIUC).

I would use a value like 100 for more than an hour for the lag, to avoid alerting for spikes etc... I checked the Thanos metric and we have constantly a lag above zero, that auto-resolves and that never stays up for more than some mins.

Ok! I have updated the alert by adding the kafka consumer lag.
I also added one for container memory using the following query to reflect 90% memory usage:

(container_memory_usage_bytes{namespace=~"revscoring.*", container="kserve-container", prometheus="k8s-mlserve"} / container_spec_memory_limit_bytes{namespace=~"revscoring.*", container="kserve-container", prometheus="k8s-mlserve"}) * 100 > 90

Before proceeding with these alerts I suggest we also create some runbooks (even if they are super simple at the beginning). I can create them in Lift Wing page in Wikitech

Change 963724 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/alerts@master] team-ml: add alert for memory spike in inf services

https://gerrit.wikimedia.org/r/963724

For the kafka lag when I try the query kafka_burrow_partition_lag{group="cpjobqueue-ORESFetchScoreJob"} in thanos I see 2 topics for codfw (one for the retry):

topic="codfw.mediawiki.job.ORESFetchScoreJob" and topic="codfw.cpjobqueue.retry.mediawiki.job.ORESFetchScoreJob".
Shall I use a sum by (exported_cluster)?

I have tried to set the alerting rule to sum by (exported_cluster) (kafka_burrow_partition_lag{group="cpjobqueue-ORESFetchScoreJob"}) > 100
which would bring 1 entry per cluster (one for the initial job and one for retry) but can't get the unit tests to pass as the alert doesn't fire.

Change 962056 merged by jenkins-bot:

[operations/alerts@master] team-ml: add alert for Kafka consumer lag for ores extension

https://gerrit.wikimedia.org/r/962056

calbon set the point value for this task to 5.Nov 2 2023, 7:05 PM
calbon triaged this task as Medium priority.Nov 2 2023, 7:25 PM

Change 975736 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/alerts@master] team-ml: add site to the ORES alert's dashboard link

https://gerrit.wikimedia.org/r/975736

Change 975736 merged by jenkins-bot:

[operations/alerts@master] team-ml: add site to the ORES alert's dashboard link

https://gerrit.wikimedia.org/r/975736

Change 963724 merged by jenkins-bot:

[operations/alerts@master] team-ml: add alert for memory spike in inf services

https://gerrit.wikimedia.org/r/963724

We have 2 alerts related to Lift Wing at the moment

  • ORESFetchScoreJobKafkaLag: when this fires it means that there is a lag between messages landing in Kafka topics and message consumption rate from Changeprop.
  • InfServiceHighMemoryUsage : This alert fires when memory utilization of the kserve-container of an Inference Service is above 90% of the container limit for more than 5 minutes.

More thorough description can be found in the runbooks