Page MenuHomePhabricator

Calculate tone check model service metrics for fixed calendar window
Closed, ResolvedPublic

Description

In T390706: Create SLO dashboard for tone (peacock) check model we defined an SLO for the tone check model along with its corresponding dashboards using Pyrra.
However as noted in {T403729#11200918} we cannot utilize this to report on the specific SLIs for a fixed calendar window.
In the context of the running A/B test for Tone Check we have defined some leading indicators in T394463: [A/B Test] Report on Tone Check leading indicators among which are also the 2 SLO metrics: service availability and latency. We would like to be able to report on these metrics by October 1st.
To overcome the above limitations we would like to explore the option of extracting the following metrics for a fixed calendar window by creating some queries in thanos on top of the already published prometheus metrics.

  1. The percentage of all requests receiving a non-error response (non 5xx).
  2. The percentage of all successful requests (2xx) that complete within 1000 milliseconds (1 sec), measured at the server side.

The above metrics should be calculated using istio metrics as the kserve reported latency metrics do not take into consideration additional overhead that might be added.

The queries should be reusable so that we can report on specific time periods when needed. If we switch to another tool (T404171: Evaluate Sloth as a possible replacement for Pyrra) that can cover this need then this will be an ad hoc request which we will not be using further. Otherwise we can evaluate the option of turning these queries in a Grafana Dashboard that we can use for multiple services

Event Timeline

Starting from the Istio grafana dashboard that presents the p90 latency of the service I came up with the following query which gives us what % of the requests receive a 2xx response in less or equal than 1000ms.

100 * (
  sum(
    rate(istio_request_duration_milliseconds_bucket{
      source_workload_namespace="istio-system",
      app="istio-ingressgateway",
      destination_service_namespace=~"edit-check",
      destination_service_name=~"edit-check-predictor.*",
      response_code=~"2..",
      le="1000"
    }[20d])
  )
  /
  sum(
    rate(istio_request_duration_milliseconds_count{
      source_workload_namespace="istio-system",
      app="istio-ingressgateway",
      destination_service_namespace=~"edit-check",
      destination_service_name=~"edit-check-predictor.*",
      response_code=~"2.."
    }[20d])
  )
)

The result of the above query is 97.72 %.
If this is correct then we could utilize this in a Grafana panel to be able to define absolute time ranges instead of just relative as we do in the queries (e.g. 20d)

@isarantopoulos it is not that easy :)

Passing down Grafana time ranges to Prometheus PromQL, that doesn't support fixed dates, requires some tricks. For example, Sloth does the following to achieve a "fixed" window starting from the first data of the calendar month (compared to what is "now" for Grafana) towards the end of it:

"description": "This graph shows the month error budget burn down chart (starts the 1st until the end of the month)",
"expr": "

1 - ( sum_over_time(
                 (  slo:sli_error:ratio_rate1h{sloth_service=\"${service}\",sloth_slo=\"${slo}\"}
                   * on() group_left() ( month() == bool vector(${__to:date:M}))[32d:1h]) / on(sloth_id) 

                  (slo:error_budget:ratio{sloth_service=\"${service}\",sloth_slo=\"${slo}\"} *on() group_left() (24 * days_in_month())))",
                   "legendFormat": "Remaining error budget (month)",

It uses recording rules, but the idea is that it gets 1 hours diffs (the rate1h) for the past 32d from "now", and it discards whatever doesn't belong to the current month (via month() etc..).

Also note that the istio raw metrics are high cardinality and it will be very expensive to measure big-enough windows without recording rules.

@isarantopoulos it is not that easy :)

I assumed so :)

Although this won't allow us to calculate over a fixed window, wouldn't looking in this direction give us the % of 2xx requests over the last 20 days? This is the main thing to tackle at this point so we can report by Oct 1st.
That said, I understand that this kind of calculation still gives us results in a 20day sliding window as each value is calculated using the previous 20days so it definitely needs some work. Should we use increase instead of rate since we're basically want to calculate increasing values?
cc: @klausman

I think this would work:

(
	sum by (destination_canonical_service) (
		increase(istio_requests_total{prometheus="k8s-mlserve",destination_workload_namespace="edit-check", response_code="200"}[21d])
	)
)/(
	sum by (destination_canonical_service) (
		increase(istio_requests_total{prometheus="k8s-mlserve",destination_workload_namespace="edit-check"}[21d])
	)
)*100

A few notes:

  • increase() handles counter-resets gracefully. From the docs:

increase(v range-vector) calculates the increase in the time series in the range vector. Breaks in monotonicity (such as counter resets due to target restarts) are automatically adjusted for.
[...]
increase should only be used with counters (for both floats and histograms). It is syntactic sugar for rate(v) multiplied by the number of seconds under the specified time range window, and should be used primarily for human readability. Use rate in recording rules so that increases are tracked consistently on a per-second basis.

  • The result here is the instant-vector, i.e. only a number, not a graph. A graph would show the value as it was at various times in the past
  • I have deliberately omitted site="eqiad" and similar, as the SLO doesn't really care about sites. This can be added back of course, to see where SLO budget burns happen
  • This currently selects all services in the e-c NS, and sums by canonical service. If. e.g. we did this for revert risk, we'd get four separate numbers, for the four different pods (ML vs LA and predictor pod separate)
  • This assumes that the base to compute the 200-rate against is _all_ status codes, even the client errors that usually have a code of 0, as well as 400s. If that is not the right base-rate, the second subquery must be adjusted accordingly.
  • This query is not very light but even with a 21d time frame, it takes maybe 2-3 seconds, so I think it is good enough for now.

Thank you for the clarification! The above query responds to the availability SLI (1st item in task description). I tried to tackle #2 which is more difficult since we did face increased latencies.
I agree with al the points except the following:

This assumes that the base to compute the 200-rate against is _all_ status codes

Our base is indeed _all_statu_codes but the rate we want is non 5xx. So the above query can be transformed to the following:

(
	sum by (destination_canonical_service) (
		increase(istio_requests_total{prometheus="k8s-mlserve",destination_workload_namespace="edit-check", response_code!~"5.."}[21d])
	)
)/(
	sum by (destination_canonical_service) (
		increase(istio_requests_total{prometheus="k8s-mlserve",destination_workload_namespace="edit-check"}[21d])
	)
)*100

The result of the above is 100% (which is also what the SLO dashboards tell us).
Regarding #2 The percentage of all successful requests (2xx) that complete within 1000 milliseconds
What would the equivalent query be?

For latency, we'd use something like this:

(	
	sum by (destination_canonical_service) (
		increase(istio_request_duration_milliseconds_bucket{prometheus=~"k8s-mlserve", destination_workload_namespace="edit-check", response_code="200", le="1000"}[21d])
	)
) / (
	sum by (destination_canonical_service) (
		increase(istio_request_duration_milliseconds_count{prometheus=~"k8s-mlserve", destination_workload_namespace="edit-check", response_code="200"}[21d])
	)
) * 100

This assumes that we don't care about any latency but the 200-code ones, so this could be switched to use response_code!~"5.." as well. Though of course the question is if 3xx and 4xx latencies really should be entering into this SLO.

As for the discrepancy (~97% vs. ~99%), I just ran the equivalent of my query (using`increase` etc) but instead of looking at the destination namespace of edit-check, used the selectors from your query (source_workload_namespace="istio-system", app="istio-ingressgateway", destination_service_namespace=~"edit-check", destination_service_name=~"edit-check-predictor.*"), and now my query agrees with the ~97% result from your initial query:

(	
	sum by (destination_canonical_service) (
		increase(istio_request_duration_milliseconds_bucket{
			prometheus=~"k8s-mlserve", 
			source_workload_namespace="istio-system",
			app="istio-ingressgateway",
			destination_service_namespace=~"edit-check",
			destination_service_name=~"edit-check-predictor.*",
			response_code=~"2..", 
			le="1000"
		}[20d])
	)
) / (
	sum by (destination_canonical_service) (
		increase(istio_request_duration_milliseconds_count{
			prometheus=~"k8s-mlserve", 
			source_workload_namespace="istio-system",
			app="istio-ingressgateway",
			destination_service_namespace=~"edit-check",
			destination_service_name=~"edit-check-predictor.*",
			response_code=~"2.."
		}[20d])
	)
) * 100

This currently returns 97.46508947581462. One can also offset a query into the past, for example this is for the 20 days preceding today one week ago. PromQL can't do absolute times as Luca mentioned, but at least for manual queries, this can be used to look at values/windows in the past.

(	
	sum by (destination_canonical_service) (
		increase(istio_request_duration_milliseconds_bucket{
			prometheus=~"k8s-mlserve", 
			source_workload_namespace="istio-system",
			app="istio-ingressgateway",
			destination_service_namespace=~"edit-check",
			destination_service_name=~"edit-check-predictor.*",
			response_code=~"2..", 
			le="1000"
		}[20d] offset 1w)
	)
) / (
	sum by (destination_canonical_service) (
		increase(istio_request_duration_milliseconds_count{
			prometheus=~"k8s-mlserve", 
			source_workload_namespace="istio-system",
			app="istio-ingressgateway",
			destination_service_namespace=~"edit-check",
			destination_service_name=~"edit-check-predictor.*",
			response_code=~"2.."
		}[20d] offset 1w)
	)
) * 100

Importantly, our query for the success rate (the first one I added to this task) would also use the istio-system NS selectors as above, for consistency.

Another small note: In theory, one can omit all the qualifiers on the second query (on istio_request_duration_milliseconds_count) and add ignoring(le) after /, but that means that Prometheus has to gather all timeseries for the second part and will thus timeout/run out of memory, so we must repeat the field selectors.

Ok! so I'm pasting the modified queries for the availability and latency metrics using the last 21d
The first one results in 99.99% availability:

(
	sum by (destination_canonical_service) (
		increase(istio_requests_total{prometheus=~"k8s-mlserve", 
			source_workload_namespace="istio-system",
			app="istio-ingressgateway",
			destination_service_namespace=~"edit-check",
			destination_service_name=~"edit-check-predictor.*",response_code="200"}[21d])
	)
)/(
	sum by (destination_canonical_service) (
		increase(istio_requests_total{prometheus=~"k8s-mlserve", 
			source_workload_namespace="istio-system",
			app="istio-ingressgateway",
			destination_service_namespace=~"edit-check",
			destination_service_name=~"edit-check-predictor.*"}[21d])
	)
)*100

and the one for the latency which results in 97.04 % of requests

(	
	sum by (destination_canonical_service) (
		increase(istio_request_duration_milliseconds_bucket{
			prometheus=~"k8s-mlserve", 
			source_workload_namespace="istio-system",
			app="istio-ingressgateway",
			destination_service_namespace=~"edit-check",
			destination_service_name=~"edit-check-predictor.*",
			response_code=~"2..", 
			le="1000"
		}[21d])
	)
) / (
	sum by (destination_canonical_service) (
		increase(istio_request_duration_milliseconds_count{
			prometheus=~"k8s-mlserve", 
			source_workload_namespace="istio-system",
			app="istio-ingressgateway",
			destination_service_namespace=~"edit-check",
			destination_service_name=~"edit-check-predictor.*",
			response_code=~"2.."
		}[21d])
	)
) * 100
isarantopoulos claimed this task.
isarantopoulos moved this task from Unsorted to 2025-2026 Q1 Done on the Machine-Learning-Team board.