Page MenuHomePhabricator

Define SLI/SLO for Lift Wing
Closed, ResolvedPublic

Description

Now that Lift Wing seems getting more and more the shape of something production-ready, we should probably think about defining SLIs and SLOs for the inference service.

The end result of this task should be:

Event Timeline

https://docs.google.com/document/d/1NspQtkfyuD_kiYCgms1gRZeFFiAaetnk/edit <- My thoughts so far, comments here or on the doc welcome.

There is nothing in the way of concrete numbers on the doc yet. Mainly because:

  1. I am unsure what the services currently can do, latency- and load-wise
  2. It is unclear what the consumer/customer expectations are/will be

I think that we could start comparing what others did before us, see https://wikitech.wikimedia.org/wiki/SLO#Published_SLOs (maybe the API Gateway one?)

We could also involve more specifically SRE, in order to understand what is the preferred path to follow.

Should we start creating our own SLO page based on the template linked above? https://wikitech.wikimedia.org/wiki/SLO/template_instructions contains good instructions about what to follow etc..

Yes, my plan was to elaborte on my write up a bit (it's mostly for sorting my thoughts), and then use the template you mentioned and develop that into something like the API GW SLO (with plenty SRE input).

Luca and I had a longer discussion via mail and IRC, about whether the backend-induced latency of an Inference Service should count towards the SLO budget or not.

For now, the consensus is that we should treat services running on LW as black boxes, i.e. whatever the end-to-end latency may be is what counts as the SLI/SLO.

If, in the future, we need more differentiated SLIs, we can always define service-specific SLO threshold (and custom SLIs of course). I have updated the SLO draft accordingly.

https://grafana.wikimedia.org/goto/x7S0HpjVk?orgId=1 I've started an SLO dahsboard here. It only has one metric (Latency) so far, but it's a start.

https://grafana.wikimedia.org/goto/x7S0HpjVk?orgId=1 I've started an SLO dahsboard here. It only has one metric (Latency) so far, but it's a start.

Please also check stuff like https://wikitech.wikimedia.org/wiki/Grafana#Grizzly, IIRC SRE offers a template for the SLO dashboards, so we all have the same one etc..

I am working on Grafana/Thanos directly for now because it's a shorter change-try loop to find the right metrics than doing it with Grizzly directly. Even with templating, we still need specific metrics to put into slo_definitions.libsonnet. Arguably, that's the most difficult part due to the sheer number of metrics surfaced even just by our own Prom instance.

But I agree that the final version of the SLO dashboard should be done in Grizzly.

I made some progress on the experimental dashboard (https://grafana.wikimedia.org/goto/VSolQfj4k?orgId=1). Request count (and 200 vs non-200) I now have a better mental model/grasp of. The current setup is not ML-specific, but rather Istio-specific. So it may not apply to all k8s setups, but only those where Istio is used.

The latency metrics are harder to get right since they're not simple counters but histograms.

I've now also managed to add some latency bucketing stuff. Not 100% yet if it is what we want, but in any case, it's progress.

Change 942448 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/grafana-grizzly@master] LiftWing/Revscoring: add first latency/availability dashboards

https://gerrit.wikimedia.org/r/942448

While making/experimenting with the SLO dahsboard it became clear that the label cardinality of our input metrics is so high (>10k) that direct computation from the input metrics is not feasible --- the result set is so large that we get at best partial results.

The usual way around is to have recording rules, which compute aggregate numbers every time a metric is scraped, and then doing the SLx queries using that.

So given e.g. this base metric:

istio_request_duration_milliseconds_bucket{
   app="arwiki-articletopic-predictor-default-00006",
   app_kubernetes_io_managed_by="Helm",
   app_wmf="kserve-inference",
   chart="kserve-inference-0.4.2",
   component="predictor",
   connection_security_policy="none",
   destination_app="arwiki-articletopic-predictor-default-00006",
*  destination_canonical_revision="arwiki-articletopic-predictor-default-00006",
*  destination_canonical_service="arwiki-articletopic-predictor-default",
   destination_cluster="Kubernetes",
   destination_principal="unknown",
   destination_service="arwiki-articletopic-predictor-default-00006-private.revscoring-articletopic.svc.cluster.local",
   destination_service_name="arwiki-articletopic-predictor-default-00006-private",
*  destination_service_namespace="revscoring-articletopic",
   destination_version="unknown",
   destination_workload="arwiki-articletopic-predictor-default-00006-deployment",
   destination_workload_namespace="revscoring-articletopic",
   instance="10.194.19.75:15020",
   job="k8s-pods",
   kubernetes_namespace="revscoring-articletopic",
   kubernetes_pod_name="arwiki-articletopic-predictor-default-00006-deployment-5d5q75pt",
*  le="+Inf",
   pod_template_hash="5d566c44f8",
   prometheus="k8s-mlserve",
   release="main",
   reporter="destination",
   request_protocol="http",
*  response_code="200",
*  response_flags="-",
   security_istio_io_tlsMode="istio",
   service_istio_io_canonical_name="arwiki-articletopic-predictor-default",
   service_istio_io_canonical_revision="arwiki-articletopic-predictor-default-00006",
   serving_knative_dev_configuration="arwiki-articletopic-predictor-default",
   serving_knative_dev_configurationGeneration="6",
   serving_knative_dev_configurationUID="2d606aec-5a3f-48a0-ab88-38f2c07f8a05",
   serving_knative_dev_revision="arwiki-articletopic-predictor-default-00006",
   serving_knative_dev_revisionUID="723c47eb-cf86-417c-8ba9-b12d7b203ae5",
   serving_knative_dev_service="arwiki-articletopic-predictor-default",
   serving_knative_dev_serviceUID="789ca7cc-6925-441d-8f32-67afac682cc1",
   serving_kserve_io_inferenceservice="arwiki-articletopic",
   site="codfw",
   source_app="istio-ingressgateway",
   source_canonical_revision="latest",
   source_canonical_service="istio-ingressgateway",
   source_cluster="Kubernetes",
   source_principal="unknown",
   source_version="unknown",
   source_workload="istio-ingressgateway",
   source_workload_namespace="istio-system"
}

we would likely to sum by the labels marked with a *. Note that le is the label that distinguishes the latency histogram buckets (it stands for less than or equal, i.e. the value of the metric is the # of requests that completed in less than whatever le has as a value, in milliseconds). A lot of the other labels are mostly redundant information or uninteresting.

With the above labels deemed interesting, we'd have recording rules like this:

# SLx: Buckets for latency
record: destsvc_rev_ns_rc_rf:istio_request_duration_milliseconds_bucket:rate5m
expr: sum by (destination_canonical_service, destination_canonical_revision, destination_service_namespace, le, response_code, response_flags) (rate(istio_request_duration_milliseconds_bucket{}[5m]))

record: destsvc_rev_ns_rc_rf:istio_request_duration_milliseconds_bucket:rate90d
expr: sum by (destination_canonical_service, destination_canonical_revision, destination_service_namespace, le, response_code, response_flags) (rate(istio_request_duration_milliseconds_bucket{}[90d]))

# SLx: Total requests for latency
record: destsvc_rev_ns_rc_rf:istio_request_duration_milliseconds_count:rate5m
expr:sum by (destination_canonical_service, destination_canonical_revision, destination_service_namespace, le, response_code, response_flags) (rate(istio_request_duration_milliseconds_count{}[5m]))

record: destsvc_rev_ns_rc_rf:istio_request_duration_milliseconds_count:rate90d
expr:sum by (destination_canonical_service, destination_canonical_revision, destination_service_namespace, le, response_code, response_flags) (rate(istio_request_duration_milliseconds_coun{}t[90d]))

# SLx: Request counts with status
record: destsvc_rev_ns_rc_rf:istio_requests_total:rate5m
expr: sum by (destination_canonical_service, destination_canonical_revision, destination_service_namespace, response_code, response_flags) (rate(istio_requests_total{}[5m]))

record: destsvc_rev_ns_rc_rf:istio_requests_total:rate90d
expr: sum by (destination_canonical_service, destination_canonical_revision, destination_service_namespace, response_code, response_flags) (rate(istio_requests_total{}[90d]))

We have three base recording rules: buckets, total count and requests by code, doubled for both a 5m (to display current graph on the SLO dashboard) rate and a 90d rate (for the actual SLO).

By convention, the recording rule name is [labels which are retained]:[original metric name]:[description of aggregation, omitting 'sum']

If we used the full label names, this would make for a very long name for the recorded metric, so I did some creative abbreviation, and omitted le, since it's implied in the histogram metric type.

It may seem that we could only sum the counters and then run the rates at query time (reducing the number of recording rules from 6 to 3), but this has problems regarding counter resets and the like (the Prometheus blog has an article that covers this).

Change 948149 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/puppet@production] prometheus: Add recording rules for istio traffic on k8s

https://gerrit.wikimedia.org/r/948149

Change 948149 merged by Klausman:

[operations/puppet@production] prometheus: Add recording rules for istio traffic on k8s

https://gerrit.wikimedia.org/r/948149

Change 942448 abandoned by Klausman:

[operations/grafana-grizzly@master] LiftWing/Revscoring: add first latency/availability dashboards

Reason:

superseded by 952226

https://gerrit.wikimedia.org/r/942448

Change 952226 had a related patch set uploaded (by Elukey; author: Klausman):

[operations/grafana-grizzly@master] LiftWing: add latency/availability SLO dashboards

https://gerrit.wikimedia.org/r/952226

Used https://wikitech.wikimedia.org/wiki/Grafana#Previewing_the_Change_('grr_preview') to get a preview of the dashboards in https://gerrit.wikimedia.org/r/c/operations/grafana-grizzly/+/952226/, and they look good. I am going to wait for the final observability's +1 and then I think we'll be ready to merge!

One nit - I increased the SLO for revert risk and revscoring model servers to 98% (was: 95%), since we should be ok with a more tight theshold.

To put it in numbers: 98% means that we'll have an error budget of 2% over 3 months. We can always revisit the choice, but in theory we should strive for 99% in my opinion.

After some brainbounce with Ilias we decided to have separate SLOs for the revertrisk namespace (so differentiate between agnostic/multilingual/etc..).

I used the destination_canonical_service label in the grizzly config, that seems good enough for the use case.

Updated the code review and the SLO wikitech page for Lift Wing. The new proposal is the following:

  • We use the 95% SLO for experimental isvcs, like outlink or revert risk multi-lingual, and we observe how things evolve (namely if we want to have a stricter SLO etc..).
  • We use the 98% SLO for services like revscoring or revert risk, that we consider stable and changing not frequently. We'll aim for 99%+ if possible in the future.

In this way we have an easy way to add SLOs to new services, that can evolve over time if needed.

Change 952226 merged by Elukey:

[operations/grafana-grizzly@master] Lift Wing: add latency/availability SLO dashboards

https://gerrit.wikimedia.org/r/952226

Change 955355 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/grafana-grizzly@master] Add SLO definition for the ORES Legacy service

https://gerrit.wikimedia.org/r/955355

Change 955355 merged by Elukey:

[operations/grafana-grizzly@master] Add SLO definition for the ORES Legacy service

https://gerrit.wikimedia.org/r/955355

I am not 100% of the ores-legacy one, but https://grafana.wikimedia.org/d/slo-ORES_Legacy/ores-legacy-slo-s?orgId=1

The main difference with the others is that I had to factor in 2xx 3xx 4xx among the "good" responses, not only 2xx, since the calculation was totally off otherwise.

I tried to review the exact metrics that we used in our SLO/SLI calculations, and something doesn't feel right. For istio we have the following recording rules:

- record: destsvc_rev_ns_rc_rf:istio_request_duration_milliseconds_bucket:rate5m
  expr: sum by (destination_canonical_service, destination_canonical_revision, destination_service_namespace, le, response_code, response_flags, site, prometheus) (rate(istio_request_duration_milliseconds_bucket{}[5m]))
# SLx: Total requests for latency
- record: destsvc_rev_ns_rc_rf:istio_request_duration_milliseconds_count:rate5m
  expr: sum by (destination_canonical_service, destination_canonical_revision, destination_service_namespace, le, response_code, response_flags, site, prometheus) (rate(istio_request_duration_milliseconds_count{}[5m]))
# SLx: Request counts with status
- record: destsvc_rev_ns_rc_rf:istio_requests_total:rate5m
  expr: sum by (destination_canonical_service, destination_canonical_revision, destination_service_namespace, response_code, response_flags, site, prometheus) (rate(istio_requests_total{}[5m]))

Let's pick the SLO budget calculation, that is the following (from the repo's guidelines):

// To compute error budget remaining (*_slo_query), use one of these formulas,
// depending on whether a timeseries is available for good_events or bad_events.
//
//
//                                  ⎛  bad events  ⎞
//                                  ⎜ ──────────── ⎟
//                                  ⎝ total events ⎠
// pct error budget remaining = 1 - ────────────────
//                                    error budget
//
//                            = 1 - (bad_events / total_events) / error_budget
//
//
//                                   ⎛     good events  ⎞
//                                   ⎜ 1 - ──────────── ⎟
//                                   ⎝     total events ⎠
// pct error budget remaining = 1 - ─────────────────────
//                                       error budget
//
//                            = 1 - (1 - good_events / total_events) / error_budget
//
//
// - Substitute the appropriate timeseries for good/bad and total events, and the appropriate numeric
//   value for error budget.
// - Express the error budget as a fraction, not a percentage. For example, if the service has a
//   99.9% SLO target, fill in 0.001 for the error budget.
// - For the duration selector, we want one quarter; approximate it as "[90d]".

If we pick "total events" in our settings, we do:

(sum by (destination_service_namespace) (avg_over_time(destsvc_rev_ns_rc_rf:istio_requests_total:rate5m{response_code=~"...", site=~"$site", prometheus=~"k8s-mlserve.*", destination_service_namespace=~"revscoring.*"}[90d])))

I picked up the previous work in progress code review so took those values for granted, but an avg of a rate doesn't feel correct to me. I would have thought something like:

sum by (destination_service_namespace) (increase(istio_requests_total{response_code=~"...", site=~"eqiad", prometheus=~"k8s-mlserve.*", destination_service_namespace=~"revscoring.*"}[90d]))

The difference is:

{destination_service_namespace="revscoring-articletopic"}
19.819579877921402
{destination_service_namespace="revscoring-editquality-damaging"}
52.772535973215454
{destination_service_namespace="revscoring-editquality-goodfaith"}
81.24742727101449
{destination_service_namespace="revscoring-editquality-reverted"}
12.446285469673908
{destination_service_namespace="revscoring-articlequality"}
28.11326819083035
{destination_service_namespace="revscoring-draftquality"}
7.069326442998623
{destination_service_namespace="revscoring-drafttopic"}
6.05662922395137

vs

{destination_service_namespace="revscoring-articletopic"}
85587048.2284664
{destination_service_namespace="revscoring-editquality-damaging"}
313775622.43407017
{destination_service_namespace="revscoring-editquality-goodfaith"}
316124751.0298485
{destination_service_namespace="revscoring-editquality-reverted"}
73911111.41972806
{destination_service_namespace="revscoring-articlequality"}
102892004.36752039
{destination_service_namespace="revscoring-draftquality"}
25082004.62260486
{destination_service_namespace="revscoring-drafttopic"}
44900191.82625936

So something is off....

Change 955958 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/grafana-grizzly@master] WIP: improve Lift Wing's SLO/SLI calculations

https://gerrit.wikimedia.org/r/955958

Change 956841 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::thanos: add increase-based rec rules for Istio

https://gerrit.wikimedia.org/r/956841

Opened T346144 as related change for the dashboards :)

Change 956841 merged by Elukey:

[operations/puppet@production] profile::thanos: add increase-based rec rules for Istio

https://gerrit.wikimedia.org/r/956841

Change 955958 abandoned by Elukey:

[operations/grafana-grizzly@master] WIP: improve Lift Wing's SLO/SLI calculations

Reason:

https://gerrit.wikimedia.org/r/955958

Change 958897 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/grafana-grizzly@master] Improve ML team's SLO calculations

https://gerrit.wikimedia.org/r/958897

Change 958897 merged by Elukey:

[operations/grafana-grizzly@master] Improve ML team's SLO calculations

https://gerrit.wikimedia.org/r/958897

Last changes applied, we should be good to close!

Change 959695 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/grafana-grizzly@master] slo_definitions: improve Lift Wing's service SLO/SLI calculations

https://gerrit.wikimedia.org/r/959695

Change 959695 merged by Elukey:

[operations/grafana-grizzly@master] slo_definitions: improve Lift Wing's service SLO/SLI availability

https://gerrit.wikimedia.org/r/959695