Page MenuHomePhabricator

Create basic alerts for isvcs to catch outages
Open, Needs TriagePublic1 Estimated Story Points

Description

At the current moment we don't have reliable monitoring and alerting to figure out if a service/isvc is totally off like what happened in T362503

We need to wait for tasks like T351390 to progress on the SLO dashboard/alerting front, so we should have something to use in the meantime.

Overall steps:

  • Decide what metric(s) to alert on, the simpler the better.
  • Add generic monitors that check all isvcs

Event Timeline

Probably something like:

(sum by (destination_canonical_service) (rate(istio_requests_total{response_code!="200"}[5m])))/
(sum by (destination_canonical_service) (rate(istio_requests_total{}[5m])))

The [5m] likely could be shorter, depending what we put on the "for" clause in the alert definition. as for the acceptable rate of non-200 responses (and if we might want to make 400s and 300s as non-bad), that is up for discussion, of course.

If the above query is too heavy, we could multiple rules, with explicit selectors of destination_canonical_service. I am not sure how much that would help, but we might try it. It'd be a lot more maintenance load, though. OTOH, it would give us the option to have differen thresholds for different services.

I've experimented a bit on Thanos, and arrived at this query:

(sum by (destination_canonical_service, prometheus) (rate(istio_requests_total{destination_canonical_service!="unknown", prometheus="k8s-mlserve", response_code!~"(2|3|4).."}[5m])))/
(sum by (destination_canonical_service, prometheus) (rate(istio_requests_total{destination_canonical_service!="unknown", prometheus="k8s-mlserve"}[5m])))

Even with an evaluation/graphing window of a week, this still completes in ~5s, which I think is reasonable for our use case (we'll likely have a for clause that is on the order of an hour or less, which takes ~2s to evaluate).

The ruwiki outage is clerarly visible in the graph with an error (500s) rate of >0.5 (aka 50%).

The above query could be modified to also alert on high rates of 3xx and 4xx codes (which are usually benign, but 90% of responses having that code would indicate either a service problem or a malfunctioning client/attack).

There are two kinds of istio metrics - the ones from the gateway and the ones from the sidecars (inbound and outbound). In theory it should be sufficient to check the Gateway metrics, since if a sidecar misbehaves it should be clearly visible from it, and it should reduce the volume of metrics pulled even further. The gateway metrics should be distinguishable from the rest via the kubernetes_namespace="istio-system".

Change #1021417 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/alerts@master] team-ml: Add alerting rule for high error rate in LW services

https://gerrit.wikimedia.org/r/1021417

Change #1021417 merged by jenkins-bot:

[operations/alerts@master] team-ml: Add alerting rule for high error rate in LW services

https://gerrit.wikimedia.org/r/1021417

klausman set the point value for this task to 1.Tue, Apr 23, 2:07 PM
klausman set Final Story Points to 1.