Page MenuHomePhabricator

Improve performance of Thanos (+ Prometheus)
Closed, ResolvedPublic

Description

As experienced in T260241: Grafana/Thanos serves 503s for long-time-window requests long-time-range queries via Thanos don't have a great performance (or don't return results at all).

This is due to a combinations of factors, namely:

  1. Thanos sidecar requires Prometheus compaction to be turned off for blocks upload to happen. With compaction off (i.e. each block is 2h on Prometheus local TSDB) then Prometheus has to query a lot of blocks for queries (e.g. >7d) and this affects performance of dashboards not yet migrated to Thanos too.
  2. For queries going through Thanos query, there's no splitting/chunking of queries and no caching of results

Mitigations / fixes possible

  1. Upgrade Prometheus, unclear if it'll make a difference but we might want to move to 2.20 (from 2.7.1 now)
  2. To turn compactions back on in Prometheus, one solution would be to flip things around and remote write (i.e. push) from Prometheus to Thanos (using thanos receive instead). The main drawback I see is that when Prometheus for some reason can't send metrics to Thanos for more than two hours then the data won't be written to Thanos at all.
  3. For slow Thanos queries, Thanos will ship query frontend which will help with chunking queries and splitting results

Plan

  • Re-enable compactions fleetwide on Prometheus
  • Upgrade Thanos to >= 0.16.0 to get query-frontend component
  • Deploy query-frontend to benefit from query splitting and results caching
  • Add dashboards and alerting for query-frontend

Event Timeline

Change 633971 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] profile: selectively enable Prometheus compaction

https://gerrit.wikimedia.org/r/633971

Change 633972 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: re-enable compaction for prometheus[12]003

https://gerrit.wikimedia.org/r/633972

The last two patches are to enable compactions back in Prometheus and leave the rest unchanged. I asked upstream about what's the recommended solution in our case and just reenabling compactions should work as expected.

Change 633971 merged by Filippo Giunchedi:
[operations/puppet@production] profile: selectively enable Prometheus compaction

https://gerrit.wikimedia.org/r/633971

Change 634198 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] thanos: add thanos-bucket-web explorer

https://gerrit.wikimedia.org/r/634198

Change 634199 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] role: add thanos bucket-web to frontend

https://gerrit.wikimedia.org/r/634199

Change 634198 merged by Filippo Giunchedi:
[operations/puppet@production] thanos: add thanos-bucket-web explorer

https://gerrit.wikimedia.org/r/634198

Change 634199 merged by Filippo Giunchedi:
[operations/puppet@production] role: add thanos bucket-web to frontend

https://gerrit.wikimedia.org/r/634199

Change 633972 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: re-enable compaction for prometheus[12]003

https://gerrit.wikimedia.org/r/633972

Mentioned in SAL (#wikimedia-operations) [2020-10-19T08:01:12Z] <godog> re-enable compaction for prometheus[12]003 - T261281

Change 635249 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] thanos: disable compaction check in sidecar

https://gerrit.wikimedia.org/r/635249

Change 635249 merged by Filippo Giunchedi:
[operations/puppet@production] thanos: disable compaction check in sidecar

https://gerrit.wikimedia.org/r/635249

Change 636362 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: re-enable compaction by default

https://gerrit.wikimedia.org/r/636362

Mentioned in SAL (#wikimedia-operations) [2020-10-27T08:15:48Z] <godog> update thanos-fe2002 to thanos 0.16.0 - T261281

Mentioned in SAL (#wikimedia-operations) [2020-10-28T07:40:40Z] <godog> update thanos-fe1002 to thanos 0.16.0 - T261281

Mentioned in SAL (#wikimedia-operations) [2020-11-02T08:40:44Z] <godog> upgrade thanos to 0.16 in codfw/eqiad - T261281

Change 638036 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] thanos: use systemd overrides for query/store/compact

https://gerrit.wikimedia.org/r/638036

Mentioned in SAL (#wikimedia-operations) [2020-11-02T11:06:00Z] <godog> upgrade thanos to 0.16.0 on prometheus hosts - T261281

Change 638110 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] thanos: configure memcached size via hiera

https://gerrit.wikimedia.org/r/638110

Change 638119 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] thanos: add query-frontend

https://gerrit.wikimedia.org/r/638119

Change 638120 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: add thanos query-frontend jobs

https://gerrit.wikimedia.org/r/638120

Change 638121 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] role: add query_frontend to thanos frontend

https://gerrit.wikimedia.org/r/638121

Change 638122 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] pontoon: use frontends for query_frontend memcache

https://gerrit.wikimedia.org/r/638122

Change 638110 merged by Filippo Giunchedi:
[operations/puppet@production] thanos: configure memcached size via hiera

https://gerrit.wikimedia.org/r/638110

Change 638036 merged by Filippo Giunchedi:
[operations/puppet@production] thanos: use systemd overrides for query/store/compact

https://gerrit.wikimedia.org/r/638036

Change 636362 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: re-enable compaction by default

https://gerrit.wikimedia.org/r/636362

Mentioned in SAL (#wikimedia-operations) [2020-11-03T08:32:33Z] <godog> Prometheus re-enable compactions - T261281

Change 638119 merged by Filippo Giunchedi:
[operations/puppet@production] thanos: add query-frontend

https://gerrit.wikimedia.org/r/638119

Change 638122 merged by Filippo Giunchedi:
[operations/puppet@production] pontoon: use frontends for query_frontend memcache

https://gerrit.wikimedia.org/r/638122

Change 638120 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: add thanos query-frontend jobs

https://gerrit.wikimedia.org/r/638120

Change 638121 merged by Filippo Giunchedi:
[operations/puppet@production] role: add query_frontend to thanos frontend

https://gerrit.wikimedia.org/r/638121

Mentioned in SAL (#wikimedia-operations) [2020-11-09T09:06:38Z] <godog> enable thanos query-frontend on thanos-fe hosts - T261281

Status update: query-frontend is serving queries with in-memory caching (1GB, to start with)

Change 641729 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] thanos: add query-frontend alerts

https://gerrit.wikimedia.org/r/641729

Change 641729 merged by Filippo Giunchedi:
[operations/puppet@production] thanos: add query-frontend alerts

https://gerrit.wikimedia.org/r/641729

fgiunchedi claimed this task.
fgiunchedi updated the task description. (Show Details)

This is complete! I've left out trying memcache since the in-memory caching seems to work well for now