Page MenuHomePhabricator

Grafana/Thanos serves 503s for long-time-window requests
Closed, ResolvedPublic

Description

I'm struggling to get good results from Thanos queries for large amounts of history, e.g. a "last 90 days" view on the Host overview dashboard:

https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&from=now-90d&to=now&var-server=mw1359&var-datasource=thanos&var-cluster=api_appserver

A few of the graph panels load, but only a few.

The error message shown in the hover is <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <html><head> <title>503 Service Unavailable</title> </head><body> <h1>Service Unavailable</h1> <p>The server is temporarily unable to service your request due to maintenance downtime or capacity problems. Please try again later.</p> </body></html> which looks like a minimal boilerplate 503, not sure exactly where it's coming from, although I suspect internally generated in grafana after some failure in talking to Thanos?

Event Timeline

CDanis created this task.Aug 12 2020, 1:51 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 12 2020, 1:51 PM

I have also seen that myself quite often. Getting data older than 30 days is very useful, specially for capacity planning.

jijiki added a subscriber: jijiki.Aug 12 2020, 1:56 PM

From a first look at this I believe it is a combination of factors: namely Prometheus at the moment struggling with long time range queries due to having compactions turned off (for upload to Thanos) and not enough overlapping data in Thanos yet.

I'm not sure there's any immediate mitigation though besides having Prometheus serve less data and Thanos more temporarily. We're ~5 weeks away from having full overlap between Thanos and Prometheus data though (cfr T260053)

fgiunchedi triaged this task as High priority.Aug 17 2020, 8:34 AM

Change 620656 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: limit queries to Thanos sidecar / Prometheus to last 15d

https://gerrit.wikimedia.org/r/620656

Change 620656 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: limit queries to Thanos sidecar / Prometheus to last 15d

https://gerrit.wikimedia.org/r/620656

With the latest change I'm able to query data for the last 90d via Thanos on the host dashboard! Performance still could be better, and for that there are improvements being made to Thanos. Namely Thanos will be shipping the "query-frontend" component from Cortex that splits and issues concurrent queries (though this has not been released yet)

fgiunchedi lowered the priority of this task from High to Medium.Mon, Aug 24, 12:56 PM

Lowering priority as the main issue has been mitigated, still work to do though to improve performance

fgiunchedi closed this task as Resolved.Wed, Aug 26, 9:43 AM
fgiunchedi claimed this task.

I'm resolving this task as the original issue is mitigated and queries work (albeit slow), please see the followup on further improvements/mitigations at T261281: Improve performance of Thanos (+ Prometheus)