As experienced in T260241: Grafana/Thanos serves 503s for long-time-window requests long-time-range queries via Thanos don't have a great performance (or don't return results at all).
This is due to a combinations of factors, namely:
- Thanos sidecar requires Prometheus compaction to be turned off for blocks upload to happen. With compaction off (i.e. each block is 2h on Prometheus local TSDB) then Prometheus has to query a lot of blocks for queries (e.g. >7d) and this affects performance of dashboards not yet migrated to Thanos too.
- For queries going through Thanos query, there's no splitting/chunking of queries and no caching of results
Mitigations / fixes possible
- Upgrade Prometheus, unclear if it'll make a difference but we might want to move to 2.20 (from 2.7.1 now)
- To turn compactions back on in Prometheus, one solution would be to flip things around and remote write (i.e. push) from Prometheus to Thanos (using thanos receive instead). The main drawback I see is that when Prometheus for some reason can't send metrics to Thanos for more than two hours then the data won't be written to Thanos at all.
- For slow Thanos queries, Thanos will ship query frontend which will help with chunking queries and splitting results
- Re-enable compactions fleetwide on Prometheus
- Upgrade Thanos to >= 0.16.0 to get query-frontend component
- Deploy query-frontend to benefit from query splitting and results caching
- Add dashboards and alerting for query-frontend