Page MenuHomePhabricator

liftwing SLO performance issues
Closed, ResolvedPublic

Assigned To
Authored By
tappof
Feb 26 2025, 2:50 PM
Referenced Files
F58503592: image.png
Feb 26 2025, 2:50 PM
F58503588: image.png
Feb 26 2025, 2:50 PM
F58503584: image.png
Feb 26 2025, 2:50 PM
F58503578: image.png
Feb 26 2025, 2:50 PM
F58503566: image.png
Feb 26 2025, 2:50 PM
F58503553: image.png
Feb 26 2025, 2:50 PM
F58503548: image.png
Feb 26 2025, 2:50 PM
F58503542: image.png
Feb 26 2025, 2:50 PM

Description

Relative to tasks T302995: Transition to Pyrra for SLO Visualization and Management and T302995: Transition to Pyrra for SLO Visualization and Management, I ran some tests.

As a recap, applying the patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/1102346, where the Liftwing SLOs were moved away from pre-configured recording rules, led to performance issues on the Thanos side.

image.png (250×736 px, 21 KB)

image.png (247×741 px, 26 KB)

image.png (250×733 px, 27 KB)

image.png (401×736 px, 48 KB)

I executed the entire set of queries generated by the Pyrra config for every liftwing SLO using the /query API via curl through the thanos-query-frontend endpoint.

I ran the same test directly on the eqiad k8s-mlserve Prometheus instance (prometheus1006) to gather additional feedback on performance, specifically regarding query execution time and resource consumption on the host machine (Ed. This is also to explore the possibility of configuring Thanos Query to request the most recent four weeks of data directly from the Prometheus sidecar).

The purpose of this task is also to collect the necessary data to open an issue directly on the Pyrra repository, asking for guidance.

thanos (curl, /query, query-frontend, titan1002)

3.569	(sum(rate(istio_request_duration_milliseconds_count{destination_service_namespace=~"revscoring.*",response_code=~"2..",site=~"eqiad"}[3d])) - sum(rate(istio_request_duration_milliseconds_bucket{destination_service_namespace=~"revscoring.*",le="5000",response_code=~"2..",site=~"eqiad"}[3d]))) / sum(rate(istio_request_duration_milliseconds_count{destination_service_namespace=~"revscoring.*",response_code=~"2..",site=~"eqiad"}[3d]))

3.817	sum(rate(istio_requests_total{destination_service_namespace=~"revscoring.*",response_code!~"(2|3|4)..",site=~"eqiad"}[3d])) / sum(rate(istio_requests_total{destination_service_namespace=~"revscoring.*",response_code=~"...",site=~"eqiad"}[3d]))

12.694	sum(rate(istio_requests_total{destination_service_namespace=~"revscoring.*",response_code!~"(2|3|4)..",site=~"eqiad"}[12d])) / sum(rate(istio_requests_total{destination_service_namespace=~"revscoring.*",response_code=~"...",site=~"eqiad"}[12d]))

12.972 (sum(rate(istio_request_duration_milliseconds_count{destination_service_namespace=~"revscoring.*",response_code=~"2..",site=~"eqiad"}[12d])) - sum(rate(istio_request_duration_milliseconds_bucket{destination_service_namespace=~"revscoring.*",le="5000",response_code=~"2..",site=~"eqiad"}[12d]))) / sum(rate(istio_request_duration_milliseconds_count{destination_service_namespace=~"revscoring.*",response_code=~"2..",site=~"eqiad"}[12d]))

48.040	sum by (destination_service_namespace, response_code, site) (increase(istio_request_duration_milliseconds_count{destination_service_namespace=~"revscoring.*",response_code=~"2..",site=~"eqiad"}[12w]))

63.559	sum by (destination_service_namespace, response_code, site) (increase(istio_request_duration_milliseconds_bucket{destination_service_namespace=~"revscoring.*",le="5000",response_code=~"2..",site=~"eqiad"}[12w]))

94.796	sum by (destination_service_namespace, response_code, site) (increase(istio_requests_total{destination_service_namespace=~"revscoring.*",response_code=~"...",site=~"eqiad"}[12w]))

image.png (441×1 px, 42 KB)

image.png (476×1 px, 52 KB)

image.png (514×1 px, 49 KB)

image.png (440×1 px, 45 KB)

prometheus (curl, /query, /k8s-mlserve, prometheus1006)

3.761	sum(rate(istio_requests_total{destination_service_namespace=~"revscoring.*",response_code!~"(2|3|4).."}[3d])) / sum(rate(istio_requests_total{destination_service_namespace=~"revscoring.*",response_code=~"..."}[3d]))

3.780	(sum(rate(istio_request_duration_milliseconds_count{destination_service_namespace=~"revscoring.*",response_code=~"2.."}[3d])) - sum(rate(istio_request_duration_milliseconds_bucket{destination_service_namespace=~"revscoring.*",le="5000",response_code=~"2.."}[3d]))) / sum(rate(istio_request_duration_milliseconds_count{destination_service_namespace=~"revscoring.*",response_code=~"2.."}[3d]))

13.811	(sum(rate(istio_request_duration_milliseconds_count{destination_service_namespace=~"revscoring.*",response_code=~"2.."}[12d])) - sum(rate(istio_request_duration_milliseconds_bucket{destination_service_namespace=~"revscoring.*",le="5000",response_code=~"2.."}[12d]))) / sum(rate(istio_request_duration_milliseconds_count{destination_service_namespace=~"revscoring.*",response_code=~"2.."}[12d]))

15.887	sum(rate(istio_requests_total{destination_service_namespace=~"revscoring.*",response_code!~"(2|3|4).."}[12d])) / sum(rate(istio_requests_total{destination_service_namespace=~"revscoring.*",response_code=~"..."}[12d]))

51.368	sum by (destination_service_namespace, response_code) (increase(istio_request_duration_milliseconds_count{destination_service_namespace=~"revscoring.*",response_code=~"2.."}[12w]))

75.709	sum by (destination_service_namespace, response_code) (increase(istio_request_duration_milliseconds_bucket{destination_service_namespace=~"revscoring.*",le="5000",response_code=~"2.."}[12w]))

115.120	sum by (destination_service_namespace, response_code) (increase(istio_requests_total{destination_service_namespace=~"revscoring.*",response_code=~"..."}[12w]))

image.png (442×1 px, 38 KB)

image.png (478×1 px, 34 KB)

image.png (516×1 px, 79 KB)

image.png (444×1 px, 30 KB)

As a result, it turns out that the most expensive SLOs are (clearly):
liftwing-revscoring-availability-${datacenter}
liftwing-revscoring-latency-${datacenter}

Moreover, despite the slower query evaluation time, computing directly on the Prometheus instance appears to be less resource-intensive.

Event Timeline

So as a workaround next step I'm thinking we can create stripped down copies of high cardinality metrics like istio_requests_total to the effect of istio_requests_total_stripped and istio_request_duration_milliseconds_bucket_stripped etc. using relabel/scrape configs, since recording rules to the effect of sum without (a, b, c) (istio_requests_total) will get us into rate(sum()) issues with the Pyrra generated recording rules.

In other words, copy the in scope high cardinality metrics to a second metric name, and use a relabel config with e.g. labelkeep to opt-in only labels required for the slo to the _stripped metric. It'd be similar in approach to slo specific recording rule metrics, but done in the relabel config to avoid conflicts with Pyrra generated recording rules.

I'll plan to prototype this, but also am happy if anyone is interested in experimenting with that and beats me to it, or has a more straightforward workaround in mind.

Wouldn't stripping down labels without aggregating data lead to losing data?

Folks please note that we are already stripping some labels from the native Istio metrics, see the Prometheus k8s config:

'metric_relabel_configs' => [
    {
        'action'        => 'labeldrop',
        'regex'         => "(${[
            'connection_security_policy',
            'controller_revision_hash',
            'istio_io_rev',
            'source_canonical_revision',
            'source_principal',
            'source_version',
            'install_operator_istio_io_owning_resource',
            'operator_istio_io_component',
            'service_istio_io_canonical_revision',
            'sidecar_istio_io_inject',
            'source_cluster',
            'destination_principal',
            'destination_version',
            'destination_cluster',
        ].join('|')})",
    },
],

There may be more to clean up, but as Tiziano was saying we need to be careful on what we drop to avoid loosing data.

Yes, although for the purposes of e.g. citoid latency SLO, we need just a few labels: source_workload_namespace destination_canonical_service le site along with the typical instance, job etc.

So my thinking is to for example copy istio_request_duration_milliseconds_bucket to istio_request_duration_milliseconds_bucket_stripped and apply a labelkeep filter to retain only what we absolutely need.

Losing dating shouldn't be an issue since it'd be a second metric. The original istio_request_duration_milliseconds_bucket would be left as-is

@herron ack yes, for citoid it may be a viable solution, for Lift Wing's metrics I am not sure since there are more labels (due to extra Kubernetes layers like Knative etc..). Should we start with citoid, test and then expand?

@herron ack yes, for citoid it may be a viable solution, for Lift Wing's metrics I am not sure since there are more labels (due to extra Kubernetes layers like Knative etc..). Should we start with citoid, test and then expand?

Getting back to this with a fresh mind. I am not 100% that dropping labels alone would work, because some of them are related to revisions of the code etc.. for example, in the ML use case, every time they deploy a new version one label value change (like the revision counter), and the old time series stop. To do things in the right way we'd need to sum and then drop the labels, like we do now, but it doesn't work. This is what Tiziano mentioned above IIUC..

Change #1136978 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::prometheus::k8s: drop two more labels in Istio metrics

https://gerrit.wikimedia.org/r/1136978

It is worth to mention that other tools like https://github.com/slok/sloth give more flexibility when adding metrics and recording rules, I am wondering if Pyrra has any of that. Worst case we could provide a patch to avoid the "standard" creation of recording rules, to rely only on the ones that we create. It should be possible in my opinion, it is a fair use case.

Yes, although for the purposes of e.g. citoid latency SLO, we need just a few labels: source_workload_namespace destination_canonical_service le site along with the typical instance, job etc.

So my thinking is to for example copy istio_request_duration_milliseconds_bucket to istio_request_duration_milliseconds_bucket_stripped and apply a labelkeep filter to retain only what we absolutely need.

Losing dating shouldn't be an issue since it'd be a second metric. The original istio_request_duration_milliseconds_bucket would be left as-is

@herron This is just to further clarify my concern.

While reading the Prometheus documentation about labeldrop and labelkeep, I came across this note:

Care must be taken with labeldrop and labelkeep to ensure that metrics are still uniquely labeled once the labels are removed.

This raised a question for me: What happens to our metrics if, by removing labels without aggregation, we end up with multiple series that are no longer uniquely identifiable? How are we supposed to manage this situation?

On the other hand, if we remove labels that are identical across all series, we’re technically reducing the number of labels — but not actually reducing the cardinality.

A talk I attended at the last FOSDEM comes to mind:
https://fosdem.org/2025/schedule/event/fosdem-2025-5726-apache-flink-and-prometheus-better-together-to-improve-the-efficiency-of-your-observability-platform-at-scale.

I’ve already discussed it with Filippo — as it stands, it’s quite complex, not only to study and understand, but also to implement and maintain. On the other hand, it’s a general-purpose solution.

The talk was essentially about ingesting large volumes of high-cardinality IoT data into Prometheus.
They used Apache Flink to aggregate the data before sending it to Prometheus. Flink itself provides a way to push data into Prometheus or directly into a TSDB.

This approach allows us to aggregate data before it reaches Pyrra and expose it as a counter.
As we've seen, our current aggregation method may lead to the rate(sum(...)) issue, since the recording rule produces a gauge metric.
However, exposing the aggregated data as a counter helps us avoid this behavior.

To be clear: I'm not saying this is the solution (nor that these are the tools we’d want to use), but it might be something worth sharing with you and thinking through together.

Change #1136978 merged by Elukey:

[operations/puppet@production] profile::prometheus::k8s: drop two more labels in Istio metrics

https://gerrit.wikimedia.org/r/1136978

Change #1138313 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::prometheus::k8s: drop istio gateway labesl for ML

https://gerrit.wikimedia.org/r/1138313

I have some high level thoughts about these metrics, I think that we are focusing on a specific use case that can be misleading. I'll try to explain the various use cases so we can discuss how to proceed.

As starter, metrics starting with istio_ can be generated by two sources:

  1. The istio sidecars running on ML pods, that is the equivalent of the "mesh" sidecar that we use on Wikikube for most of our services. It is ML-specific and we don't care about it for SLO calculations. We can use selectors like source_workload="istio-ingressgateway", source_workload_namespace="istio-system" to filer out unwanted time series.
  2. The istio gateway pods, that run on both ML and Wikikube clusters. They are the subjects of SLI/SLO targets.

Then more specifically, the ML use case produces Istio Gateway metrics that have way more labels, because of a K8s framework called "Knative" that forces custom ones. I realized that we don't drop those for gateways, only for sidecars, see https://gerrit.wikimedia.org/r/1138313.
Moreover grouping like destination_service_namespace=~"revscoring.*" takes into account several model servers, something close to 100 pods, versus other services that run with a couple of them. The metrics returned could vary enormously, so I want to be sure that we focus on the right use case.

The Wikikube use case produces smaller time series, see for example:

istio_requests_total{destination_service_namespace="citoid", destination_workload="citoid-production", source_workload="istio-ingressgateway", source_workload_namespace="istio-system"}

To summarize, for the same metric we have various combination of labels that can trigger big results, like:

  • istio_requests_total{destination_service_namespace=~"revscoring.*",response_code=~"..." - the heaviest, a ton of pods.
  • istio_requests_total{destination_service_namespace="revert-risk", destination_workload="revert-risk-predictor", source_workload="istio-ingressgateway", source_workload_namespace="istio-system"} - somehow smaller but with a lot of labels (ML specific workload)
  • istio_requests_total{destination_service_namespace="citoid", destination_workload="citoid-production", source_workload="istio-ingressgateway", source_workload_namespace="istio-system"}

I am trying to understand what is the right/acceptable use case among the ones mentioned above, namely what is the one that would be fine to be rendered by Pyrra without issues on the Thanos side. Even if none of the above is the answer, we should aim for a reasonable target for time series returned in a given time window, to figure out what is best to do.

Last but not the least, istio_requests_total doesn't take into account buckets and histograms, the Istio latency metrics return more time series.

Change #1138674 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::pyrra: avoid Istio recording rules for SLOs

https://gerrit.wikimedia.org/r/1138674

Change #1138313 merged by Elukey:

[operations/puppet@production] profile::prometheus::k8s: drop istio gateway labels for ML

https://gerrit.wikimedia.org/r/1138313

Change #1138674 merged by Elukey:

[operations/puppet@production] profile::pyrra: avoid Istio recording rules for SLOs

https://gerrit.wikimedia.org/r/1138674

I've updated the citoid Pyrra definition to use the "raw" istio gateway metrics, and from https://thanos.wikimedia.org/rules it seems that we are seeing max 7/8s for 12w queries. We could probably try another time a smaller case for Lift Wing and see how it goes (I dropped some labels again that were probably generating high cardinality).

I've created T392886 as subtask to reduce the cardinality of the bucket-related metrics :)

Change #1142613 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::pyrra::filesystem::slos: add test for revertrisk LA

https://gerrit.wikimedia.org/r/1142613

Change #1142613 merged by Elukey:

[operations/puppet@production] profile::pyrra::filesystem::slos: add test for revertrisk LA

https://gerrit.wikimedia.org/r/1142613

We added the revertrisk pilot in https://slo.wikimedia.org/?search=revertrisk, using istio's "raw" metrics with a more performant selection. The resulting thanos rules have a max of 9s of aggregated running time, that is not perfect but it seems very promising :)

Change #1147713 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/grafana-grizzly@master] Remove Lift Wing related dashboards from Grizzly

https://gerrit.wikimedia.org/r/1147713

Change #1147713 merged by Elukey:

[operations/grafana-grizzly@master] Remove Lift Wing related dashboards from Grizzly

https://gerrit.wikimedia.org/r/1147713

Lift Wing Grizzly dashboards have been bulk deleted within Grafana as well

Change #1147803 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::prometheus: remove istio recording rules

https://gerrit.wikimedia.org/r/1147803

Change #1147803 merged by Elukey:

[operations/puppet@production] profile::prometheus: remove istio recording rules

https://gerrit.wikimedia.org/r/1147803

Change #1155316 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] profile::pyrra::filesystem::slos::istio: default to 4w

https://gerrit.wikimedia.org/r/1155316

Change #1155316 merged by Elukey:

[operations/puppet@production] profile::pyrra::filesystem::slos::istio: default to 4w

https://gerrit.wikimedia.org/r/1155316

Optimistically resolving as we've tuned the window for istio slos to 4w (from 12w)