Create basic alerts for isvcs to catch outages
Open, Needs TriagePublic1 Estimated Story Points
Actions

Assigned To

Authored By

	elukey
	Tue, Apr 16, 1:51 PM

Description

At the current moment we don't have reliable monitoring and alerting to figure out if a service/isvc is totally off like what happened in T362503

We need to wait for tasks like T351390 to progress on the SLO dashboard/alerting front, so we should have something to use in the meantime.

Overall steps:

Decide what metric(s) to alert on, the simpler the better.
Add generic monitors that check all isvcs

Details

	Subject	Repo	Branch	Lines +/-
	team-ml: Add alerting rule for high error rate in LW services	operations/alerts	master	+59 -0

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Open	BUG REPORT	None	T362503 ORES doesn't work (at least for ru- and ukwiki)
Open		None	T362674 2024 Q4 Goal: Operational Excellence - Improve base monitoring, alerting and logging of Lift Wing services
Open		klausman	T362661 Create basic alerts for isvcs to catch outages

Event Timeline

elukey created this task.Tue, Apr 16, 1:51 PM

elukey mentioned this in T362503: ORES doesn't work (at least for ru- and ukwiki).Tue, Apr 16, 1:58 PM

Probably something like:

(sum by (destination_canonical_service) (rate(istio_requests_total{response_code!="200"}[5m])))/
(sum by (destination_canonical_service) (rate(istio_requests_total{}[5m])))

The [5m] likely could be shorter, depending what we put on the "for" clause in the alert definition. as for the acceptable rate of non-200 responses (and if we might want to make 400s and 300s as non-bad), that is up for discussion, of course.

If the above query is too heavy, we could multiple rules, with explicit selectors of destination_canonical_service. I am not sure how much that would help, but we might try it. It'd be a lot more maintenance load, though. OTOH, it would give us the option to have differen thresholds for different services.

I've experimented a bit on Thanos, and arrived at this query:

(sum by (destination_canonical_service, prometheus) (rate(istio_requests_total{destination_canonical_service!="unknown", prometheus="k8s-mlserve", response_code!~"(2|3|4).."}[5m])))/
(sum by (destination_canonical_service, prometheus) (rate(istio_requests_total{destination_canonical_service!="unknown", prometheus="k8s-mlserve"}[5m])))

Even with an evaluation/graphing window of a week, this still completes in ~5s, which I think is reasonable for our use case (we'll likely have a for clause that is on the order of an hour or less, which takes ~2s to evaluate).

The ruwiki outage is clerarly visible in the graph with an error (500s) rate of >0.5 (aka 50%).

The above query could be modified to also alert on high rates of 3xx and 4xx codes (which are usually benign, but 90% of responses having that code would indicate either a service problem or a malfunctioning client/attack).

There are two kinds of istio metrics - the ones from the gateway and the ones from the sidecars (inbound and outbound). In theory it should be sufficient to check the Gateway metrics, since if a sidecar misbehaves it should be clearly visible from it, and it should reduce the volume of metrics pulled even further. The gateway metrics should be distinguishable from the rest via the kubernetes_namespace="istio-system".

klausman claimed this task.Thu, Apr 18, 10:03 AM

Change #1021417 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/alerts@master] team-ml: Add alerting rule for high error rate in LW services

https://gerrit.wikimedia.org/r/1021417

gerritbot added a project: Patch-For-Review.Thu, Apr 18, 10:15 AM

Change #1021417 merged by jenkins-bot:

[operations/alerts@master] team-ml: Add alerting rule for high error rate in LW services

https://gerrit.wikimedia.org/r/1021417

klausman set the point value for this task to 1.Tue, Apr 23, 2:07 PM

klausman set Final Story Points to 1.

klausman moved this task from Unsorted to In Progress on the Machine-Learning-Team board.Tue, Apr 23, 2:15 PM

klausman added a parent task: T362674: 2024 Q4 Goal: Operational Excellence - Improve base monitoring, alerting and logging of Lift Wing services.Tue, Apr 23, 2:41 PM

Maintenance_bot removed a project: Patch-For-Review.Fri, Apr 26, 5:55 PM

klausman moved this task from In Progress to 2023-2024 Q4 Done on the Machine-Learning-Team board.Tue, Apr 30, 9:10 AM

Create basic alerts for isvcs to catch outagesOpen, Needs TriagePublic1 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

Create basic alerts for isvcs to catch outages
Open, Needs TriagePublic1 Estimated Story Points
Actions

Related Objects
Search...