Page MenuHomePhabricator

Include long-retention Prometheus data from Thanos into Grafana queries
Open, MediumPublic

Description

I'm sure if this is a bug, feature request, or a "how to" documentation request.

What I'm trying to do:

Analyze data in Grafana for both recent periods (e.g. past 2 days), past periods (e.g. August 2023), and high-level (e.g. past 2 years).

For example:

Actual:

Querying "7 days" or more takes a few seconds but generally works.
Querying "last 30 days" starts to cause some panels to time out.
Querying "last 90 days" times out every time.
Querying "last 6 months" times out every time.
Querying "1 August 2023 - 30 August 2023" shows No data.
Querying "1 Jan 2023 - 30 Jan 2023" shows No data.

Expected

From chatting with O11y folks, it is my understanding that when selecting the Thanos datasource, it is meant to transparently switch between raw/recent data and archived long-retention from more than a year ago. But, it has not worked for me in practice.

Are there criteria or limitations in how a dashboard query should be written in order for Thanos to consider archived data?

I've not been able to deduce what the criteria would be, but I suspect there exist some. I have on ocasion, after a lot of mangling on the query, seen a few disparate data points show up from more than a year ago. I could not get it to plot a continuous (hourly) line, but it felt like there was a way, I just couldn't find it.

With Graphite-backed data this works naturally and responds nearly instantly no matter the query. This makes intuitive sense, given that Graphite gradually reduces resolution of the data, and most queries don't specify a preferred resolution, so it ends up with a reasonable default that works.

In addition to reducing resolution, Graphite's aggregation generally also throws out all statistical meaning for timing metrics (average of unweighted averages, etc). It's terrible and I've blogged about it. I love and am sold on the Prometheus histogram model with its simplicity to destill everything down to a counter that can be safely aggregated and works independent of any specific resolution (e.g. T175087: Create a navtiming processor for Prometheus).

Other information

I've read these tasks:

And:

Thanos keeps 54 weeks of raw metric data, and 5 years for 5m and 1h resolution under normal circumstances.

Event Timeline

Thank you for reaching out @Krinkle, I've looked into this a bit and definitely in the middle between bug and documentation improvement.

I have added a new test datasource Thanos T371102 with two changes:

  1. max_source_resolution=auto (i.e. instruct Thanos to pick resolution automatically based on query parameters)
  2. set scrape interval to 1m which is the standard we use. The latter also effectively sets the minimum resolution that will be displayed (e.g. rate() will display points two minutes apart minimum since two points is what rate requires at least)

With these changes in place the last missing bit is having a adaptive interval for rate(): in Grafana that is $__rate_interval. In other words sth like this for queries: rate(....[$__rate_interval]).

This works in most cases, however not all: specifically rate() calculations can break when merging results from Thanos and Prometheus (we query Prometheus for the last 15 days of data, Thanos for the rest) since I suspect there will be a discontinuity in the values that rate() gets. At any rate (hah!) please do some tests too with the datasource above and $__rate_interval and let us know.

re: capacity, 1h data is available since Thanos has been deployed, however to avoid overly expensive queries pulling in a lot of data we're currently limiting query ranges to max 12 months.

Change #1058106 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] grafana: set timeinterval 60s for Thanos

https://gerrit.wikimedia.org/r/1058106

In practice, many timeseries I see from Promethues seem to have data every 10-15 seconds. It'd be nice if zooming in can still surface that as-is, or at least points every 30s or 1min, not a minimum of 2min.

Afaik the long-term data in Thanos is at granularity of 5min and 1h, so something somewhre still has to be able to take the request from Grafana for "I'd like X data points from t0 to t1 at X or more interval" and figure out what the closest data is has for it. Why does that work better with timeInterval=60s? What is the default when timeInterval is not set?

In practice, many timeseries I see from Promethues seem to have data every 10-15 seconds. It'd be nice if zooming in can still surface that as-is, or at least points every 30s or 1min, not a minimum of 2min.

That's fair re: minimum 2m, I hadn't considered that Thanos in codfw/eqiad does read from two Prometheus each scraping every 60s, which may (to be confirmed) result in actually higher resolution. And the other thing I hadn't considered is that we're summing the rate of many different metrics each with different points/timestamps.

Afaik the long-term data in Thanos is at granularity of 5min and 1h, so something somewhre still has to be able to take the request from Grafana for "I'd like X data points from t0 to t1 at X or more interval" and figure out what the closest data is has for it. Why does that work better with timeInterval=60s? What is the default when timeInterval is not set?

That's indeed Thanos' "query" component (i.e. the datasource we use in Grafana) and it supports automatically picking the source data resolution (cfr https://thanos.io/tip/components/query.md/#auto-downsampling) via query string for selected queries or blanket always enabled via command line. Historically we've kept auto downsampling disabled because it (used to?) break rate calculations, and it still does to some extent when going from prometheus to thanos data at the 15 days boundary (cfr my previous comment).

The default time interval is 15s AFAICS and set by grafana if the datasource doesn't specify it. With the latest considerations in mind I think timeInterval=30s by default would work better. I've already adjusted the test thanos datasource to 30s to test, what do you think ?

I'm not sure if the below is due to downsampling, but if it is, then setting a timeInterval might not be enough to make the data source work:

# Date: last 1 year
# Source: Thanos T371102
sum(irate(varnish_resourceloader_resp{site=~"$site"}[5m])) by (site)

# Defaults: Step=Auto, Max data points = Auto (700), Min interval = Auto (30s), Actual interval: 12h

Screenshot 2024-07-31 at 17.13.54.png (1×2 px, 214 KB)

I suspect this is because the 5m is somehow either just missing the per-5m Thanos retention data points most of the time, or the effective interval of 12h is making it unable to find a nearby data point to compare against. I'm not sure what to make of the ocasional 100 million blip. Counter resets gone wrong?

Trying to adjust to this by using $__rate_interval doesn't seem to help. Grafana documents __rate_interval as "like" $__interval but with a mimum of 4x the scrape interval, to accomodate the range lookups that Prometheus rate and irate() functions need in order to find a second data point to extrapolate from.

# Date: 1 - 8 June 2023
# Source: Thanos T371102
sum(irate(varnish_resourceloader_resp{site=~"$site"}[$__rate_interval])) by (site)

# Defaults: Step=Auto, Max data points = Auto (900), Min interval = Auto (15s), Actual interval: 10m

Screenshot 2024-07-31 at 17.22.17.png (1×1 px, 201 KB)

This doesn't work either. I'm guessing because even though Thanos is delegating to the correct 5m or 1h aggregation source, Thanos as a whole still (mostly, correctly) tells Grafana that its underlying scrape is <1min, which means $__rate_interval is still too small.

By overriding various more things I can eventually get something to show up if I force it to look-behind [2h] and reduce the number of fetched data points by a lot. But this isn't useful in practice because no dashboard like this would be useful for recente data. And, you'd never be able to get to a range like "1-8 June 2023" without first having a functioning "last 2 years" step to get there and zoom in. And then when you zoom in further, the per-5min retention data should be coming through, but isn't, since the query would then hardcode 2h.

I tried to get the 1-8 June 2023 query to show the data from per-5min retention with a look behind of 10m, but nothing seems to come through. It seems at minimum I have to give it [2h] when looking at older data.

Update from a brainstorm Timo and I had: setting time/scrape interval to 30s is going to be beneficial for $__rate_interval users since they'll get an minimum 2m which is guaranteed to find at least two data points given our 1m scrape that Prometheus does.

I'll be also setting max_source_resolution=auto for the default Thanos datasource which instructs Thanos to look for the appropriate downsampling as needed, with no adverse effects observed so far.

One other thing worth pointing out is that under normal circumstances (i.e. no storage space pressure) we would keep data (raw and downsampled) for as long as possible. Keeping only downsampled data for longer isn't the intended use case according to upstream, though we do it because it is certainly better than no data.

andrea.denisse subscribed.

Change #1058106 merged by Filippo Giunchedi:

[operations/puppet@production] grafana: set timeinterval 30s for Thanos

https://gerrit.wikimedia.org/r/1058106

Mentioned in SAL (#wikimedia-operations) [2025-02-24T10:07:55Z] <godog> set grafana thanos datasource interval to 30s - T371102

Change to set 30s interval is live, please let me know what you think @Michael @Krinkle

Change to set 30s interval is live, please let me know what you think @Michael @Krinkle

Thank you, that is helpful! While it is still necessary to set the min-interval at least to two minutes to reduce the noise coming from all the intermittent labels (instance, pod-id, etc.), at least now the data will not be suddenly "gone" when zooming in.

Change to set 30s interval is live, please let me know what you think @Michael @Krinkle

Thank you, that is helpful! While it is still necessary to set the min-interval at least to two minutes to reduce the noise coming from all the intermittent labels (instance, pod-id, etc.), at least now the data will not be suddenly "gone" when zooming in.

You are welcome.

I'm not sure about the exact use case or dashboard, though for short-lived labels like pod the recommendation is to essentially not have them by for example summing across other dimentions (e.g. sum by (kubernetes_namespace) (...)) or without the labels (e.g. sum without (instance) (...)).

Change to set 30s interval is live, please let me know what you think @Michael @Krinkle

Thank you, that is helpful! While it is still necessary to set the min-interval at least to two minutes to reduce the noise coming from all the intermittent labels (instance, pod-id, etc.), at least now the data will not be suddenly "gone" when zooming in.

You are welcome.

I'm not sure about the exact use case or dashboard, though for short-lived labels like pod the recommendation is to essentially not have them by for example summing across other dimentions (e.g. sum by (kubernetes_namespace) (...)) or without the labels (e.g. sum without (instance) (...)).

Yes. What I was getting at is that all these extra labels are being sampled at slightly different points in time so that when naively summing over them, I get something that looks like a metric with per-second resolution:

image.png (445×1 px, 30 KB)

But, as I understand it, that is just noise coming from these extra labels, while the underlying sampling interval is still 60s. And so only a 2m interval would start to give us meaningful information beyond the noise from summing over a bunch of subsets of the same metric.

Change to set 30s interval is live, please let me know what you think @Michael @Krinkle

Thank you, that is helpful! While it is still necessary to set the min-interval at least to two minutes to reduce the noise coming from all the intermittent labels (instance, pod-id, etc.), at least now the data will not be suddenly "gone" when zooming in.

You are welcome.

I'm not sure about the exact use case or dashboard, though for short-lived labels like pod the recommendation is to essentially not have them by for example summing across other dimentions (e.g. sum by (kubernetes_namespace) (...)) or without the labels (e.g. sum without (instance) (...)).

Yes. What I was getting at is that all these extra labels are being sampled at slightly different points in time so that when naively summing over them, I get something that looks like a metric with per-second resolution:

image.png (445×1 px, 30 KB)

But, as I understand it, that is just noise coming from these extra labels, while the underlying sampling interval is still 60s. And so only a 2m interval would start to give us meaningful information beyond the noise from summing over a bunch of subsets of the same metric.

Thank you for clarifying, now I get what you mean!

fgiunchedi claimed this task.

I'm optimistically resolving, please reach out @Michael and @Krinkle if sth is amiss

Prometheus still feels a bit more sluggish to me for long time ranges (e.g. 30 sec for displaying last 30 days vs. <10 sec for Grafana). Definitely much better than before though (when it took minutes and usually a few metrics just timed out).

Prometheus still feels a bit more sluggish to me for long time ranges (e.g. 30 sec for displaying last 30 days vs. <10 sec for Grafana). Definitely much better than before though (when it took minutes and usually a few metrics just timed out).

Would you mind sharing the URLs you are looking at in this case ?

The Prometheus version of our dashboard is https://grafana.wikimedia.org/d/dea70b0f-cb36-4c85-8798-15781cb2c14a/authentication-metrics-prometheus?orgId=1
The Grafana version (ATM - soon to be replaced with the Prometheus version) is https://grafana.wikimedia.org/d/000000004/authentication-metrics?orgId=1

Ok that explains, the former dashboard should be much faster now after https://phabricator.wikimedia.org/T390672#10717790

I was too hasty in resolving this, we're missing max_source_resolution=auto for the thanos datasource

Change #1135948 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] grafana: set max_source_resolution=auto for thanos ds

https://gerrit.wikimedia.org/r/1135948

I was too hasty in resolving this, we're missing max_source_resolution=auto for the thanos datasource

I'll be prioritizing the thanos upgrade in T383966: Upgrade Thanos to 0.38.0 which according to https://github.com/thanos-io/thanos/pull/7012 should be able to infer the source resolution by itself

@Krinkle would you mind trying again? we're running on thanos 0.38 now and I was able to get 1y data from thanos datasource for this query sum(irate(varnish_resourceloader_resp{site=~"eqiad"}[$__rate_interval])) by (site) see also explore at https://grafana.wikimedia.org/goto/o1Gy-gxHg

Please note that currently thanos is not going to serve queries for timespans longer than 1y to protect against memory explosion, though a limit we'll be revisiting

@Krinkle would you mind trying again? we're running on thanos 0.38 now and I was able to get 1y data from thanos datasource for this query sum(irate(varnish_resourceloader_resp{site=~"eqiad"}[$__rate_interval])) by (site) see also explore at https://grafana.wikimedia.org/goto/o1Gy-gxHg

This appears to be a short URL to the homepage. Trying it at https://grafana-rw.wikimedia.org/explore, I indeed get a result after approx. 20 seconds. However, that appears to be too slow because when I load the rest of the dashboard, all panels time out, including the panel that performs the above query.

https://grafana.wikimedia.org/d/000000066/resourceloader?orgId=1&from=now-1y&to=now

Screenshot 2025-05-12 at 20.27.01.png (1×870 px, 68 KB)

@Krinkle would you mind trying again? we're running on thanos 0.38 now and I was able to get 1y data from thanos datasource for this query sum(irate(varnish_resourceloader_resp{site=~"eqiad"}[$__rate_interval])) by (site) see also explore at https://grafana.wikimedia.org/goto/o1Gy-gxHg

This appears to be a short URL to the homepage. Trying it at https://grafana-rw.wikimedia.org/explore, I indeed get a result after approx. 20 seconds. However, that appears to be too slow because when I load the rest of the dashboard, all panels time out, including the panel that performs the above query.

Indeed, I spot-checked a few panels and I could see rate() arguments to be e.g. 5m whereas they should be $__rate_interval for grafana to send the right aggregation and in turn thanos to fetch downsampled data

Change #1135948 abandoned by Filippo Giunchedi:

[operations/puppet@production] grafana: set max_source_resolution=auto for thanos ds

Reason:

Not actively working on this

https://gerrit.wikimedia.org/r/1135948