Define SLI/SLO for Lift Wing
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	elukey
	Jan 23 2023, 8:53 AM

Description

Now that Lift Wing seems getting more and more the shape of something production-ready, we should probably think about defining SLIs and SLOs for the inference service.

The end result of this task should be:

a brief document on Wikitech in which we highlight SLIs/SLOs etc..
a grafana dashboard to monitor things like the error budget over time. Example: https://grafana.wikimedia.org/d/iyumW7LGz/etcd-slos. Worth to note: https://wikitech.wikimedia.org/wiki/Grafana#Grizzly (IIUC there is a SLO dashboard template that we can use).

Details

Subject	Repo	Branch	Lines +/-
Lift Wing: add latency/availability SLO dashboards	operations/grafana-grizzly	master	+86 -3
Add SLO definition for the ORES Legacy service	operations/grafana-grizzly	master	+23 -1
slo_definitions: improve Lift Wing's service SLO/SLI availability	operations/grafana-grizzly	master	+8 -8
Improve ML team's SLO calculations	operations/grafana-grizzly	master	+40 -40
WIP: improve Lift Wing's SLO/SLI calculations	operations/grafana-grizzly	master	+8 -8
profile::thanos: add increase-based rec rules for Istio	operations/puppet	production	+6 -0
LiftWing/Revscoring: add first latency/availability dashboards	operations/grafana-grizzly	master	+29 -2
prometheus: Add recording rules for istio traffic on k8s	operations/puppet	production	+17 -0

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T333453 Lift Wing improvements to get out of MVP state
		Resolved		klausman	T327620 Define SLI/SLO for Lift Wing

Event Timeline

elukey created this task.Jan 23 2023, 8:53 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 23 2023, 8:53 AM

calbon assigned this task to klausman.Jan 24 2023, 3:15 PM

calbon moved this task from Unsorted to In Progress on the Machine-Learning-Team board.Jan 31 2023, 3:25 PM

From another task:

https://wikitech.wikimedia.org/wiki/SLO/template_instructions
https://wikitech.wikimedia.org/wiki/SLO/template

elukey added a parent task: T333453: Lift Wing improvements to get out of MVP state.Mar 29 2023, 3:10 PM

https://docs.google.com/document/d/1NspQtkfyuD_kiYCgms1gRZeFFiAaetnk/edit <- My thoughts so far, comments here or on the doc welcome.

There is nothing in the way of concrete numbers on the doc yet. Mainly because:

I am unsure what the services currently can do, latency- and load-wise
It is unclear what the consumer/customer expectations are/will be

I think that we could start comparing what others did before us, see https://wikitech.wikimedia.org/wiki/SLO#Published_SLOs (maybe the API Gateway one?)

We could also involve more specifically SRE, in order to understand what is the preferred path to follow.

Should we start creating our own SLO page based on the template linked above? https://wikitech.wikimedia.org/wiki/SLO/template_instructions contains good instructions about what to follow etc..

Yes, my plan was to elaborte on my write up a bit (it's mostly for sorting my thoughts), and then use the template you mentioned and develop that into something like the API GW SLO (with plenty SRE input).

https://wikitech.wikimedia.org/wiki/SLO/Lift_Wing Started a draft doc here

Luca and I had a longer discussion via mail and IRC, about whether the backend-induced latency of an Inference Service should count towards the SLO budget or not.

For now, the consensus is that we should treat services running on LW as black boxes, i.e. whatever the end-to-end latency may be is what counts as the SLI/SLO.

If, in the future, we need more differentiated SLIs, we can always define service-specific SLO threshold (and custom SLIs of course). I have updated the SLO draft accordingly.

https://grafana.wikimedia.org/goto/x7S0HpjVk?orgId=1 I've started an SLO dahsboard here. It only has one metric (Latency) so far, but it's a start.

In T327620#9015842, @klausman wrote:

https://grafana.wikimedia.org/goto/x7S0HpjVk?orgId=1 I've started an SLO dahsboard here. It only has one metric (Latency) so far, but it's a start.

Please also check stuff like https://wikitech.wikimedia.org/wiki/Grafana#Grizzly, IIRC SRE offers a template for the SLO dashboards, so we all have the same one etc..

I am working on Grafana/Thanos directly for now because it's a shorter change-try loop to find the right metrics than doing it with Grizzly directly. Even with templating, we still need specific metrics to put into slo_definitions.libsonnet. Arguably, that's the most difficult part due to the sheer number of metrics surfaced even just by our own Prom instance.

But I agree that the final version of the SLO dashboard should be done in Grizzly.

I made some progress on the experimental dashboard (https://grafana.wikimedia.org/goto/VSolQfj4k?orgId=1). Request count (and 200 vs non-200) I now have a better mental model/grasp of. The current setup is not ML-specific, but rather Istio-specific. So it may not apply to all k8s setups, but only those where Istio is used.

The latency metrics are harder to get right since they're not simple counters but histograms.

I've now also managed to add some latency bucketing stuff. Not 100% yet if it is what we want, but in any case, it's progress.

Change 942448 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/grafana-grizzly@master] LiftWing/Revscoring: add first latency/availability dashboards

https://gerrit.wikimedia.org/r/942448

gerritbot added a project: Patch-For-Review.Jul 27 2023, 3:07 PM

While making/experimenting with the SLO dahsboard it became clear that the label cardinality of our input metrics is so high (>10k) that direct computation from the input metrics is not feasible --- the result set is so large that we get at best partial results.

The usual way around is to have recording rules, which compute aggregate numbers every time a metric is scraped, and then doing the SLx queries using that.

So given e.g. this base metric:

istio_request_duration_milliseconds_bucket{
   app="arwiki-articletopic-predictor-default-00006",
   app_kubernetes_io_managed_by="Helm",
   app_wmf="kserve-inference",
   chart="kserve-inference-0.4.2",
   component="predictor",
   connection_security_policy="none",
   destination_app="arwiki-articletopic-predictor-default-00006",
*  destination_canonical_revision="arwiki-articletopic-predictor-default-00006",
*  destination_canonical_service="arwiki-articletopic-predictor-default",
   destination_cluster="Kubernetes",
   destination_principal="unknown",
   destination_service="arwiki-articletopic-predictor-default-00006-private.revscoring-articletopic.svc.cluster.local",
   destination_service_name="arwiki-articletopic-predictor-default-00006-private",
*  destination_service_namespace="revscoring-articletopic",
   destination_version="unknown",
   destination_workload="arwiki-articletopic-predictor-default-00006-deployment",
   destination_workload_namespace="revscoring-articletopic",
   instance="10.194.19.75:15020",
   job="k8s-pods",
   kubernetes_namespace="revscoring-articletopic",
   kubernetes_pod_name="arwiki-articletopic-predictor-default-00006-deployment-5d5q75pt",
*  le="+Inf",
   pod_template_hash="5d566c44f8",
   prometheus="k8s-mlserve",
   release="main",
   reporter="destination",
   request_protocol="http",
*  response_code="200",
*  response_flags="-",
   security_istio_io_tlsMode="istio",
   service_istio_io_canonical_name="arwiki-articletopic-predictor-default",
   service_istio_io_canonical_revision="arwiki-articletopic-predictor-default-00006",
   serving_knative_dev_configuration="arwiki-articletopic-predictor-default",
   serving_knative_dev_configurationGeneration="6",
   serving_knative_dev_configurationUID="2d606aec-5a3f-48a0-ab88-38f2c07f8a05",
   serving_knative_dev_revision="arwiki-articletopic-predictor-default-00006",
   serving_knative_dev_revisionUID="723c47eb-cf86-417c-8ba9-b12d7b203ae5",
   serving_knative_dev_service="arwiki-articletopic-predictor-default",
   serving_knative_dev_serviceUID="789ca7cc-6925-441d-8f32-67afac682cc1",
   serving_kserve_io_inferenceservice="arwiki-articletopic",
   site="codfw",
   source_app="istio-ingressgateway",
   source_canonical_revision="latest",
   source_canonical_service="istio-ingressgateway",
   source_cluster="Kubernetes",
   source_principal="unknown",
   source_version="unknown",
   source_workload="istio-ingressgateway",
   source_workload_namespace="istio-system"
}

we would likely to sum by the labels marked with a *. Note that le is the label that distinguishes the latency histogram buckets (it stands for less than or equal, i.e. the value of the metric is the # of requests that completed in less than whatever le has as a value, in milliseconds). A lot of the other labels are mostly redundant information or uninteresting.

With the above labels deemed interesting, we'd have recording rules like this:

# SLx: Buckets for latency
record: destsvc_rev_ns_rc_rf:istio_request_duration_milliseconds_bucket:rate5m
expr: sum by (destination_canonical_service, destination_canonical_revision, destination_service_namespace, le, response_code, response_flags) (rate(istio_request_duration_milliseconds_bucket{}[5m]))

record: destsvc_rev_ns_rc_rf:istio_request_duration_milliseconds_bucket:rate90d
expr: sum by (destination_canonical_service, destination_canonical_revision, destination_service_namespace, le, response_code, response_flags) (rate(istio_request_duration_milliseconds_bucket{}[90d]))

# SLx: Total requests for latency
record: destsvc_rev_ns_rc_rf:istio_request_duration_milliseconds_count:rate5m
expr:sum by (destination_canonical_service, destination_canonical_revision, destination_service_namespace, le, response_code, response_flags) (rate(istio_request_duration_milliseconds_count{}[5m]))

record: destsvc_rev_ns_rc_rf:istio_request_duration_milliseconds_count:rate90d
expr:sum by (destination_canonical_service, destination_canonical_revision, destination_service_namespace, le, response_code, response_flags) (rate(istio_request_duration_milliseconds_coun{}t[90d]))

# SLx: Request counts with status
record: destsvc_rev_ns_rc_rf:istio_requests_total:rate5m
expr: sum by (destination_canonical_service, destination_canonical_revision, destination_service_namespace, response_code, response_flags) (rate(istio_requests_total{}[5m]))

record: destsvc_rev_ns_rc_rf:istio_requests_total:rate90d
expr: sum by (destination_canonical_service, destination_canonical_revision, destination_service_namespace, response_code, response_flags) (rate(istio_requests_total{}[90d]))

We have three base recording rules: buckets, total count and requests by code, doubled for both a 5m (to display current graph on the SLO dashboard) rate and a 90d rate (for the actual SLO).

By convention, the recording rule name is [labels which are retained]:[original metric name]:[description of aggregation, omitting 'sum']

If we used the full label names, this would make for a very long name for the recorded metric, so I did some creative abbreviation, and omitted le, since it's implied in the histogram metric type.

It may seem that we could only sum the counters and then run the rates at query time (reducing the number of recording rules from 6 to 3), but this has problems regarding counter resets and the like (the Prometheus blog has an article that covers this).

Change 948149 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/puppet@production] prometheus: Add recording rules for istio traffic on k8s

https://gerrit.wikimedia.org/r/948149

Change 948149 merged by Klausman:

[operations/puppet@production] prometheus: Add recording rules for istio traffic on k8s

https://gerrit.wikimedia.org/r/948149

Change 942448 abandoned by Klausman:

[operations/grafana-grizzly@master] LiftWing/Revscoring: add first latency/availability dashboards

Reason:

superseded by 952226

https://gerrit.wikimedia.org/r/942448

Maintenance_bot removed a project: Patch-For-Review.Aug 25 2023, 12:11 PM

Xaosflux subscribed.Aug 28 2023, 4:54 PM

Change 952226 had a related patch set uploaded (by Elukey; author: Klausman):

[operations/grafana-grizzly@master] LiftWing: add latency/availability SLO dashboards

https://gerrit.wikimedia.org/r/952226

gerritbot added a project: Patch-For-Review.Aug 29 2023, 3:14 PM

Used https://wikitech.wikimedia.org/wiki/Grafana#Previewing_the_Change_('grr_preview') to get a preview of the dashboards in https://gerrit.wikimedia.org/r/c/operations/grafana-grizzly/+/952226/, and they look good. I am going to wait for the final observability's +1 and then I think we'll be ready to merge!

One nit - I increased the SLO for revert risk and revscoring model servers to 98% (was: 95%), since we should be ok with a more tight theshold.

To put it in numbers: 98% means that we'll have an error budget of 2% over 3 months. We can always revisit the choice, but in theory we should strive for 99% in my opinion.

After some brainbounce with Ilias we decided to have separate SLOs for the revertrisk namespace (so differentiate between agnostic/multilingual/etc..).

I used the destination_canonical_service label in the grizzly config, that seems good enough for the use case.

Updated the code review and the SLO wikitech page for Lift Wing. The new proposal is the following:

We use the 95% SLO for experimental isvcs, like outlink or revert risk multi-lingual, and we observe how things evolve (namely if we want to have a stricter SLO etc..).
We use the 98% SLO for services like revscoring or revert risk, that we consider stable and changing not frequently. We'll aim for 99%+ if possible in the future.

In this way we have an easy way to add SLOs to new services, that can evolve over time if needed.

Change 952226 merged by Elukey:

[operations/grafana-grizzly@master] Lift Wing: add latency/availability SLO dashboards

https://gerrit.wikimedia.org/r/952226

https://grafana.wikimedia.org/d/slo-Lift_Wing_Article_Topic_Outlink/lift-wing-article-topic-outlink-slo-s
https://grafana.wikimedia.org/d/slo-Lift_Wing_Revert_Risk_LA/lift-wing-revert-risk-la-slo-s
https://grafana.wikimedia.org/d/slo-Lift_Wing_Revert_Risk_ML/lift-wing-revert-risk-ml-slo-s
https://grafana.wikimedia.org/d/slo-Lift_Wing_Revscoring/lift-wing-revscoring-slo-s

Due to https://gerrit.wikimedia.org/r/c/operations/grafana-grizzly/+/954964 I had to renamed the Revert Risk dashboard names to something shorter (Language Agnostic -> LA, MultiLingual -> ML) but it seems clear anyway.

Next step: ores-legacy dashboard :)

Maintenance_bot removed a project: Patch-For-Review.Sep 5 2023, 4:11 PM

Change 955355 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/grafana-grizzly@master] Add SLO definition for the ORES Legacy service

https://gerrit.wikimedia.org/r/955355

gerritbot added a project: Patch-For-Review.Sep 6 2023, 3:35 PM

Change 955355 merged by Elukey:

[operations/grafana-grizzly@master] Add SLO definition for the ORES Legacy service

https://gerrit.wikimedia.org/r/955355

Maintenance_bot removed a project: Patch-For-Review.Sep 8 2023, 12:30 PM

I am not 100% of the ores-legacy one, but https://grafana.wikimedia.org/d/slo-ORES_Legacy/ores-legacy-slo-s?orgId=1

The main difference with the others is that I had to factor in 2xx 3xx 4xx among the "good" responses, not only 2xx, since the calculation was totally off otherwise.

I tried to review the exact metrics that we used in our SLO/SLI calculations, and something doesn't feel right. For istio we have the following recording rules:

- record: destsvc_rev_ns_rc_rf:istio_request_duration_milliseconds_bucket:rate5m
  expr: sum by (destination_canonical_service, destination_canonical_revision, destination_service_namespace, le, response_code, response_flags, site, prometheus) (rate(istio_request_duration_milliseconds_bucket{}[5m]))
# SLx: Total requests for latency
- record: destsvc_rev_ns_rc_rf:istio_request_duration_milliseconds_count:rate5m
  expr: sum by (destination_canonical_service, destination_canonical_revision, destination_service_namespace, le, response_code, response_flags, site, prometheus) (rate(istio_request_duration_milliseconds_count{}[5m]))
# SLx: Request counts with status
- record: destsvc_rev_ns_rc_rf:istio_requests_total:rate5m
  expr: sum by (destination_canonical_service, destination_canonical_revision, destination_service_namespace, response_code, response_flags, site, prometheus) (rate(istio_requests_total{}[5m]))

Let's pick the SLO budget calculation, that is the following (from the repo's guidelines):

// To compute error budget remaining (*_slo_query), use one of these formulas,
// depending on whether a timeseries is available for good_events or bad_events.
//
//
//                                  ⎛  bad events  ⎞
//                                  ⎜ ──────────── ⎟
//                                  ⎝ total events ⎠
// pct error budget remaining = 1 - ────────────────
//                                    error budget
//
//                            = 1 - (bad_events / total_events) / error_budget
//
//
//                                   ⎛     good events  ⎞
//                                   ⎜ 1 - ──────────── ⎟
//                                   ⎝     total events ⎠
// pct error budget remaining = 1 - ─────────────────────
//                                       error budget
//
//                            = 1 - (1 - good_events / total_events) / error_budget
//
//
// - Substitute the appropriate timeseries for good/bad and total events, and the appropriate numeric
//   value for error budget.
// - Express the error budget as a fraction, not a percentage. For example, if the service has a
//   99.9% SLO target, fill in 0.001 for the error budget.
// - For the duration selector, we want one quarter; approximate it as "[90d]".

If we pick "total events" in our settings, we do:

(sum by (destination_service_namespace) (avg_over_time(destsvc_rev_ns_rc_rf:istio_requests_total:rate5m{response_code=~"...", site=~"$site", prometheus=~"k8s-mlserve.*", destination_service_namespace=~"revscoring.*"}[90d])))

I picked up the previous work in progress code review so took those values for granted, but an avg of a rate doesn't feel correct to me. I would have thought something like:

sum by (destination_service_namespace) (increase(istio_requests_total{response_code=~"...", site=~"eqiad", prometheus=~"k8s-mlserve.*", destination_service_namespace=~"revscoring.*"}[90d]))

The difference is:

{destination_service_namespace="revscoring-articletopic"}
19.819579877921402
{destination_service_namespace="revscoring-editquality-damaging"}
52.772535973215454
{destination_service_namespace="revscoring-editquality-goodfaith"}
81.24742727101449
{destination_service_namespace="revscoring-editquality-reverted"}
12.446285469673908
{destination_service_namespace="revscoring-articlequality"}
28.11326819083035
{destination_service_namespace="revscoring-draftquality"}
7.069326442998623
{destination_service_namespace="revscoring-drafttopic"}
6.05662922395137

{destination_service_namespace="revscoring-articletopic"}
85587048.2284664
{destination_service_namespace="revscoring-editquality-damaging"}
313775622.43407017
{destination_service_namespace="revscoring-editquality-goodfaith"}
316124751.0298485
{destination_service_namespace="revscoring-editquality-reverted"}
73911111.41972806
{destination_service_namespace="revscoring-articlequality"}
102892004.36752039
{destination_service_namespace="revscoring-draftquality"}
25082004.62260486
{destination_service_namespace="revscoring-drafttopic"}
44900191.82625936

So something is off....

Change 955958 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/grafana-grizzly@master] WIP: improve Lift Wing's SLO/SLI calculations

https://gerrit.wikimedia.org/r/955958

gerritbot added a project: Patch-For-Review.Sep 8 2023, 2:50 PM

Change 956841 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::thanos: add increase-based rec rules for Istio

https://gerrit.wikimedia.org/r/956841

Opened T346144 as related change for the dashboards :)

Change 956841 merged by Elukey:

[operations/puppet@production] profile::thanos: add increase-based rec rules for Istio

https://gerrit.wikimedia.org/r/956841