Improve performance of Thanos (+ Prometheus)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	fgiunchedi
	Aug 26 2020, 9:37 AM

Description

As experienced in T260241: Grafana/Thanos serves 503s for long-time-window requests long-time-range queries via Thanos don't have a great performance (or don't return results at all).

This is due to a combinations of factors, namely:

Thanos sidecar requires Prometheus compaction to be turned off for blocks upload to happen. With compaction off (i.e. each block is 2h on Prometheus local TSDB) then Prometheus has to query a lot of blocks for queries (e.g. >7d) and this affects performance of dashboards not yet migrated to Thanos too.
For queries going through Thanos query, there's no splitting/chunking of queries and no caching of results

Mitigations / fixes possible

Upgrade Prometheus, unclear if it'll make a difference but we might want to move to 2.20 (from 2.7.1 now)
To turn compactions back on in Prometheus, one solution would be to flip things around and remote write (i.e. push) from Prometheus to Thanos (using thanos receive instead). The main drawback I see is that when Prometheus for some reason can't send metrics to Thanos for more than two hours then the data won't be written to Thanos at all.
For slow Thanos queries, Thanos will ship query frontend which will help with chunking queries and splitting results

Plan

Re-enable compactions fleetwide on Prometheus
Upgrade Thanos to >= 0.16.0 to get query-frontend component
Deploy query-frontend to benefit from query splitting and results caching
Add dashboards and alerting for query-frontend

Details

Subject	Repo	Branch	Lines +/-
thanos: add query-frontend alerts	operations/puppet	production	+38 -2
role: add query_frontend to thanos frontend	operations/puppet	production	+3 -2
prometheus: add thanos query-frontend jobs	operations/puppet	production	+14 -0
pontoon: use frontends for query_frontend memcache	operations/puppet	production	+2 -0
thanos: add query-frontend	operations/puppet	production	+107 -0
prometheus: re-enable compaction by default	operations/puppet	production	+6 -8
thanos: use systemd overrides for query/store/compact	operations/puppet	production	+6 -38
thanos: configure memcached size via hiera	operations/puppet	production	+4 -1
thanos: disable compaction check in sidecar	operations/puppet	production	+2 -1
hieradata: re-enable compaction for prometheus[12]003	operations/puppet	production	+2 -0
role: add thanos bucket-web to frontend	operations/puppet	production	+2 -0
thanos: add thanos-bucket-web explorer	operations/puppet	production	+87 -2
profile: selectively enable Prometheus compaction	operations/puppet	production	+12 -6

Related Objects

Mentioned In: T260241: Grafana/Thanos serves 503s for long-time-window requests
Mentioned Here: T260241: Grafana/Thanos serves 503s for long-time-window requests

Event Timeline

fgiunchedi created this task.Aug 26 2020, 9:37 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 26 2020, 9:37 AM

fgiunchedi mentioned this in T260241: Grafana/Thanos serves 503s for long-time-window requests.Aug 26 2020, 9:43 AM

fgiunchedi added a project: User-fgiunchedi.Aug 26 2020, 1:18 PM

fgiunchedi updated the task description. (Show Details)

CDanis subscribed.Aug 26 2020, 1:54 PM

lmata moved this task from Inbox to Up next on the observability board.Aug 27 2020, 2:13 PM

herron subscribed.Aug 31 2020, 3:27 PM

fgiunchedi moved this task from Backlog to Up next on the User-fgiunchedi board.Sep 1 2020, 12:47 PM

lmata moved this task from Up next to Backlog on the observability board.Sep 21 2020, 8:26 PM

Change 633971 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] profile: selectively enable Prometheus compaction

https://gerrit.wikimedia.org/r/633971

Change 633972 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: re-enable compaction for prometheus[12]003

https://gerrit.wikimedia.org/r/633972

The last two patches are to enable compactions back in Prometheus and leave the rest unchanged. I asked upstream about what's the recommended solution in our case and just reenabling compactions should work as expected.

Change 633971 merged by Filippo Giunchedi:
[operations/puppet@production] profile: selectively enable Prometheus compaction

https://gerrit.wikimedia.org/r/633971

Change 634198 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] thanos: add thanos-bucket-web explorer

https://gerrit.wikimedia.org/r/634198

Change 634199 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] role: add thanos bucket-web to frontend

https://gerrit.wikimedia.org/r/634199

Change 634198 merged by Filippo Giunchedi:
[operations/puppet@production] thanos: add thanos-bucket-web explorer

https://gerrit.wikimedia.org/r/634198

Change 634199 merged by Filippo Giunchedi:
[operations/puppet@production] role: add thanos bucket-web to frontend

https://gerrit.wikimedia.org/r/634199

fgiunchedi moved this task from Up next to Doing on the User-fgiunchedi board.Oct 16 2020, 10:00 AM

Change 633972 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: re-enable compaction for prometheus[12]003

https://gerrit.wikimedia.org/r/633972

Mentioned in SAL (#wikimedia-operations) [2020-10-19T08:01:12Z] <godog> re-enable compaction for prometheus[12]003 - T261281

Maintenance_bot removed a project: Patch-For-Review.Oct 19 2020, 8:10 AM

Change 635249 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] thanos: disable compaction check in sidecar

https://gerrit.wikimedia.org/r/635249

gerritbot added a project: Patch-For-Review.Oct 20 2020, 9:14 AM

Change 635249 merged by Filippo Giunchedi:
[operations/puppet@production] thanos: disable compaction check in sidecar

https://gerrit.wikimedia.org/r/635249

fgiunchedi updated the task description. (Show Details)Oct 21 2020, 12:47 PM

Change 636362 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: re-enable compaction by default

https://gerrit.wikimedia.org/r/636362

Mentioned in SAL (#wikimedia-operations) [2020-10-27T08:15:48Z] <godog> update thanos-fe2002 to thanos 0.16.0 - T261281

Mentioned in SAL (#wikimedia-operations) [2020-10-28T07:40:40Z] <godog> update thanos-fe1002 to thanos 0.16.0 - T261281

Mentioned in SAL (#wikimedia-operations) [2020-11-02T08:40:44Z] <godog> upgrade thanos to 0.16 in codfw/eqiad - T261281

Change 638036 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] thanos: use systemd overrides for query/store/compact

https://gerrit.wikimedia.org/r/638036

Mentioned in SAL (#wikimedia-operations) [2020-11-02T11:06:00Z] <godog> upgrade thanos to 0.16.0 on prometheus hosts - T261281

fgiunchedi updated the task description. (Show Details)Nov 2 2020, 1:29 PM

Change 638110 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] thanos: configure memcached size via hiera

https://gerrit.wikimedia.org/r/638110

Change 638119 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] thanos: add query-frontend

https://gerrit.wikimedia.org/r/638119

Change 638120 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: add thanos query-frontend jobs

https://gerrit.wikimedia.org/r/638120

Change 638121 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] role: add query_frontend to thanos frontend

https://gerrit.wikimedia.org/r/638121

Change 638122 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] pontoon: use frontends for query_frontend memcache

https://gerrit.wikimedia.org/r/638122

Change 638110 merged by Filippo Giunchedi:
[operations/puppet@production] thanos: configure memcached size via hiera

https://gerrit.wikimedia.org/r/638110

Change 638036 merged by Filippo Giunchedi:
[operations/puppet@production] thanos: use systemd overrides for query/store/compact

https://gerrit.wikimedia.org/r/638036

Change 636362 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: re-enable compaction by default

https://gerrit.wikimedia.org/r/636362

Mentioned in SAL (#wikimedia-operations) [2020-11-03T08:32:33Z] <godog> Prometheus re-enable compactions - T261281

fgiunchedi updated the task description. (Show Details)Nov 4 2020, 9:09 AM

Change 638119 merged by Filippo Giunchedi:
[operations/puppet@production] thanos: add query-frontend

https://gerrit.wikimedia.org/r/638119

Change 638122 merged by Filippo Giunchedi:
[operations/puppet@production] pontoon: use frontends for query_frontend memcache

https://gerrit.wikimedia.org/r/638122

Change 638120 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: add thanos query-frontend jobs

https://gerrit.wikimedia.org/r/638120

Change 638121 merged by Filippo Giunchedi:
[operations/puppet@production] role: add query_frontend to thanos frontend

https://gerrit.wikimedia.org/r/638121

Mentioned in SAL (#wikimedia-operations) [2020-11-09T09:06:38Z] <godog> enable thanos query-frontend on thanos-fe hosts - T261281

fgiunchedi updated the task description. (Show Details)Nov 9 2020, 9:13 AM

Status update: query-frontend is serving queries with in-memory caching (1GB, to start with)

fgiunchedi updated the task description. (Show Details)Nov 9 2020, 9:17 AM

Change 641729 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] thanos: add query-frontend alerts

https://gerrit.wikimedia.org/r/641729

Change 641729 merged by Filippo Giunchedi:
[operations/puppet@production] thanos: add query-frontend alerts