Page MenuHomePhabricator

Grafana/Thanos serves 503s for long-time-window requests
Closed, ResolvedPublic

Description

I'm struggling to get good results from Thanos queries for large amounts of history, e.g. a "last 90 days" view on the Host overview dashboard:

https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&from=now-90d&to=now&var-server=mw1359&var-datasource=thanos&var-cluster=api_appserver

Screenshot_20200812_094528.png (950×1 px, 66 KB)

A few of the graph panels load, but only a few.

The error message shown in the hover is <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <html><head> <title>503 Service Unavailable</title> </head><body> <h1>Service Unavailable</h1> <p>The server is temporarily unable to service your request due to maintenance downtime or capacity problems. Please try again later.</p> </body></html> which looks like a minimal boilerplate 503, not sure exactly where it's coming from, although I suspect internally generated in grafana after some failure in talking to Thanos?

Event Timeline

I have also seen that myself quite often. Getting data older than 30 days is very useful, specially for capacity planning.

From a first look at this I believe it is a combination of factors: namely Prometheus at the moment struggling with long time range queries due to having compactions turned off (for upload to Thanos) and not enough overlapping data in Thanos yet.

I'm not sure there's any immediate mitigation though besides having Prometheus serve less data and Thanos more temporarily. We're ~5 weeks away from having full overlap between Thanos and Prometheus data though (cfr T260053)

Change 620656 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: limit queries to Thanos sidecar / Prometheus to last 15d

https://gerrit.wikimedia.org/r/620656

Change 620656 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: limit queries to Thanos sidecar / Prometheus to last 15d

https://gerrit.wikimedia.org/r/620656

With the latest change I'm able to query data for the last 90d via Thanos on the host dashboard! Performance still could be better, and for that there are improvements being made to Thanos. Namely Thanos will be shipping the "query-frontend" component from Cortex that splits and issues concurrent queries (though this has not been released yet)

fgiunchedi lowered the priority of this task from High to Medium.Aug 24 2020, 12:56 PM

Lowering priority as the main issue has been mitigated, still work to do though to improve performance

fgiunchedi claimed this task.

I'm resolving this task as the original issue is mitigated and queries work (albeit slow), please see the followup on further improvements/mitigations at T261281: Improve performance of Thanos (+ Prometheus)