I'm sure if this is a bug, feature request, or a "how to" documentation request.
What I'm trying to do:
Analyze data in Grafana for both recent periods (e.g. past 2 days), past periods (e.g. August 2023), and high-level (e.g. past 2 years).
For example:
- https://grafana.wikimedia.org/d/35WSHOjVk/application-servers-red-k8s
- https://grafana.wikimedia.org/d/000000066/resourceloader
Actual:
Querying "7 days" or more takes a few seconds but generally works.
Querying "last 30 days" starts to cause some panels to time out.
Querying "last 90 days" times out every time.
Querying "last 6 months" times out every time.
Querying "1 August 2023 - 30 August 2023" shows No data.
Querying "1 Jan 2023 - 30 Jan 2023" shows No data.
Expected
From chatting with O11y folks, it is my understanding that when selecting the Thanos datasource, it is meant to transparently switch between raw/recent data and archived long-retention from more than a year ago. But, it has not worked for me in practice.
Are there criteria or limitations in how a dashboard query should be written in order for Thanos to consider archived data?
I've not been able to deduce what the criteria would be, but I suspect there exist some. I have on ocasion, after a lot of mangling on the query, seen a few disparate data points show up from more than a year ago. I could not get it to plot a continuous (hourly) line, but it felt like there was a way, I just couldn't find it.
With Graphite-backed data this works naturally and responds nearly instantly no matter the query. This makes intuitive sense, given that Graphite gradually reduces resolution of the data, and most queries don't specify a preferred resolution, so it ends up with a reasonable default that works.
- https://grafana.wikimedia.org/d/QLtC93rMz/backend-pageview-timing?from=now-2y&to=now
- https://grafana.wikimedia.org/d/lqE4lcGWz/wanobjectcache-key-group?from=now-6M&to=now
In addition to reducing resolution, Graphite's aggregation generally also throws out all statistical meaning for timing metrics (average of unweighted averages, etc). It's terrible and I've blogged about it. I love and am sold on the Prometheus histogram model with its simplicity to destill everything down to a counter that can be safely aggregated and works independent of any specific resolution (e.g. T175087: Create a navtiming processor for Prometheus).
Other information
I've read these tasks:
- T311690: Shorten Thanos retention
- T351927: Decide and tweak Thanos retention
- T357747: Capacity planning/estimation for Thanos
And:



