Relative to tasks T302995: Transition to Pyrra for SLO Visualization and Management and T302995: Transition to Pyrra for SLO Visualization and Management, I ran some tests.
As a recap, applying the patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/1102346, where the Liftwing SLOs were moved away from pre-configured recording rules, led to performance issues on the Thanos side.
I executed the entire set of queries generated by the Pyrra config for every liftwing SLO using the /query API via curl through the thanos-query-frontend endpoint.
I ran the same test directly on the eqiad k8s-mlserve Prometheus instance (prometheus1006) to gather additional feedback on performance, specifically regarding query execution time and resource consumption on the host machine (Ed. This is also to explore the possibility of configuring Thanos Query to request the most recent four weeks of data directly from the Prometheus sidecar).
The purpose of this task is also to collect the necessary data to open an issue directly on the Pyrra repository, asking for guidance.
thanos (curl, /query, query-frontend, titan1002)
3.569 (sum(rate(istio_request_duration_milliseconds_count{destination_service_namespace=~"revscoring.*",response_code=~"2..",site=~"eqiad"}[3d])) - sum(rate(istio_request_duration_milliseconds_bucket{destination_service_namespace=~"revscoring.*",le="5000",response_code=~"2..",site=~"eqiad"}[3d]))) / sum(rate(istio_request_duration_milliseconds_count{destination_service_namespace=~"revscoring.*",response_code=~"2..",site=~"eqiad"}[3d]))
3.817 sum(rate(istio_requests_total{destination_service_namespace=~"revscoring.*",response_code!~"(2|3|4)..",site=~"eqiad"}[3d])) / sum(rate(istio_requests_total{destination_service_namespace=~"revscoring.*",response_code=~"...",site=~"eqiad"}[3d]))
12.694 sum(rate(istio_requests_total{destination_service_namespace=~"revscoring.*",response_code!~"(2|3|4)..",site=~"eqiad"}[12d])) / sum(rate(istio_requests_total{destination_service_namespace=~"revscoring.*",response_code=~"...",site=~"eqiad"}[12d]))
12.972 (sum(rate(istio_request_duration_milliseconds_count{destination_service_namespace=~"revscoring.*",response_code=~"2..",site=~"eqiad"}[12d])) - sum(rate(istio_request_duration_milliseconds_bucket{destination_service_namespace=~"revscoring.*",le="5000",response_code=~"2..",site=~"eqiad"}[12d]))) / sum(rate(istio_request_duration_milliseconds_count{destination_service_namespace=~"revscoring.*",response_code=~"2..",site=~"eqiad"}[12d]))
48.040 sum by (destination_service_namespace, response_code, site) (increase(istio_request_duration_milliseconds_count{destination_service_namespace=~"revscoring.*",response_code=~"2..",site=~"eqiad"}[12w]))
63.559 sum by (destination_service_namespace, response_code, site) (increase(istio_request_duration_milliseconds_bucket{destination_service_namespace=~"revscoring.*",le="5000",response_code=~"2..",site=~"eqiad"}[12w]))
94.796 sum by (destination_service_namespace, response_code, site) (increase(istio_requests_total{destination_service_namespace=~"revscoring.*",response_code=~"...",site=~"eqiad"}[12w]))prometheus (curl, /query, /k8s-mlserve, prometheus1006)
3.761 sum(rate(istio_requests_total{destination_service_namespace=~"revscoring.*",response_code!~"(2|3|4).."}[3d])) / sum(rate(istio_requests_total{destination_service_namespace=~"revscoring.*",response_code=~"..."}[3d]))
3.780 (sum(rate(istio_request_duration_milliseconds_count{destination_service_namespace=~"revscoring.*",response_code=~"2.."}[3d])) - sum(rate(istio_request_duration_milliseconds_bucket{destination_service_namespace=~"revscoring.*",le="5000",response_code=~"2.."}[3d]))) / sum(rate(istio_request_duration_milliseconds_count{destination_service_namespace=~"revscoring.*",response_code=~"2.."}[3d]))
13.811 (sum(rate(istio_request_duration_milliseconds_count{destination_service_namespace=~"revscoring.*",response_code=~"2.."}[12d])) - sum(rate(istio_request_duration_milliseconds_bucket{destination_service_namespace=~"revscoring.*",le="5000",response_code=~"2.."}[12d]))) / sum(rate(istio_request_duration_milliseconds_count{destination_service_namespace=~"revscoring.*",response_code=~"2.."}[12d]))
15.887 sum(rate(istio_requests_total{destination_service_namespace=~"revscoring.*",response_code!~"(2|3|4).."}[12d])) / sum(rate(istio_requests_total{destination_service_namespace=~"revscoring.*",response_code=~"..."}[12d]))
51.368 sum by (destination_service_namespace, response_code) (increase(istio_request_duration_milliseconds_count{destination_service_namespace=~"revscoring.*",response_code=~"2.."}[12w]))
75.709 sum by (destination_service_namespace, response_code) (increase(istio_request_duration_milliseconds_bucket{destination_service_namespace=~"revscoring.*",le="5000",response_code=~"2.."}[12w]))
115.120 sum by (destination_service_namespace, response_code) (increase(istio_requests_total{destination_service_namespace=~"revscoring.*",response_code=~"..."}[12w]))As a result, it turns out that the most expensive SLOs are (clearly):
liftwing-revscoring-availability-${datacenter}
liftwing-revscoring-latency-${datacenter}
Moreover, despite the slower query evaluation time, computing directly on the Prometheus instance appears to be less resource-intensive.











