Page MenuHomePhabricator

Deploy Thanos (Prometheus long-term storage) stateful components
Closed, ResolvedPublic

Description

We have Thanos hardware coming in (T249538 + T249539) and will need to deploy Thanos on it, more specifically:

  • Thanos store gateway on frontends: this component exposes the underlying object storage as Thanos' standard StoreAPI. The query component (T233956: Deploy Thanos (long-term storage) stateless components: sidecar and query) will know about store(s) and query them too.
  • Setup the object storage (i.e. Swift). This is the standard frontend/backend Swift deployment we already run in production. There will be Puppet changes required to existing swift classes to cater for additional clusters.
    • Since this is a "green field" project, we should experiment with multi-region swift deployment. In other words the Swift cluster will span across eqiad and codfw, with 4 copies of data and read affinity.
    • The object storage will need to be reachable by Thanos sidecar too (running on Prometheus hosts), thus the authenticated API will need to be available over TLS.
  • Setup Thanos compactor. This component needs to run as a singleton and requires access to the object storage.

Reference diagrams (on gdocs now, to be published on wikitech once finalized)

Logical: https://docs.google.com/drawings/d/1FhE7_vBtqCao2qnDKUe9rNdNnoHT9NU-Fg815I-MIf0/edit?usp=sharing
Deployment: https://docs.google.com/drawings/d/1IBSrreH8UPXKMbRLF_-hW3kJiQ-gbTgQCXumOh6oN04/edit?usp=sharing

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+1 -0
operations/puppetproduction+4 -2
operations/puppetproduction+3 -0
operations/puppetproduction+15 -1
operations/puppetproduction+5 -1
operations/puppetproduction+3 -3
operations/puppetproduction+5 -6
operations/puppetproduction+4 -0
operations/puppetproduction+2 -0
operations/puppetproduction+7 -0
operations/puppetproduction+1 -0
operations/dnsmaster+9 -0
operations/puppetproduction+3 -0
operations/puppetproduction+9 -7
operations/dnsmaster+4 -0
operations/puppetproduction+10 -0
operations/puppetproduction+10 -0
operations/puppetproduction+5 -1
operations/dnsmaster+8 -0
operations/puppetproduction+8 -27
operations/puppetproduction+2 -0
operations/puppetproduction+2 -2
operations/puppetproduction+9 -1
operations/puppetproduction+10 -0
operations/puppetproduction+299 -0
operations/puppetproduction+4 -0
operations/puppetproduction+299 -0
operations/puppetproduction+3 -9
operations/puppetproduction+5 -12
operations/puppetproduction+10 -0
operations/puppetproduction+15 -0
operations/puppetproduction+55 -43
operations/puppetproduction+8 -11
operations/puppetproduction+11 -1
operations/puppetproduction+47 -1
operations/puppetproduction+11 -1
operations/puppetproduction+1 -1
operations/puppetproduction+8 -1
operations/puppetproduction+3 -1
operations/puppetproduction+117 -0
operations/puppetproduction+27 -0
operations/puppetproduction+156 -2
operations/dnsmaster+2 -0
operations/puppetproduction+3 -1
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+3 -5
operations/puppetproduction+1 -1
operations/puppetproduction+3 -0
operations/puppetproduction+34 -0
operations/puppetproduction+4 -3
operations/dnsmaster+2 -0
operations/puppetproduction+36 -0
operations/puppetproduction+0 -5
operations/puppetproduction+10 -0
operations/puppetproduction+1 -1
operations/puppetproduction+264 -0
operations/puppetproduction+296 -0
operations/puppetproduction+3 -1
operations/dnsmaster+2 -0
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 598480 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/dns@master] Add thanos-swift discovery records

https://gerrit.wikimedia.org/r/598480

Change 598480 merged by Filippo Giunchedi:
[operations/dns@master] Add thanos-swift discovery records

https://gerrit.wikimedia.org/r/598480

Change 598698 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: add store and compact jobs

https://gerrit.wikimedia.org/r/598698

Change 598711 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: allow setting min/max block duration

https://gerrit.wikimedia.org/r/598711

Change 598712 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: enable Thanos upload for Prometheus k8s-staging

https://gerrit.wikimedia.org/r/598712

Change 597019 merged by Filippo Giunchedi:
[operations/puppet@production] thanos: add Store Gateway

https://gerrit.wikimedia.org/r/597019

Change 598698 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: add store and compact jobs

https://gerrit.wikimedia.org/r/598698

Change 597072 merged by Filippo Giunchedi:
[operations/puppet@production] thanos: add thanos::compact

https://gerrit.wikimedia.org/r/597072

Change 598956 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] thanos: don't enable compact when not needed

https://gerrit.wikimedia.org/r/598956

Change 598956 merged by Filippo Giunchedi:
[operations/puppet@production] thanos: don't enable compact when not needed

https://gerrit.wikimedia.org/r/598956

Change 598971 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] thanos: monitor compact metrics only for enabled host

https://gerrit.wikimedia.org/r/598971

Change 598971 merged by Filippo Giunchedi:
[operations/puppet@production] thanos: monitor compact metrics only for enabled host

https://gerrit.wikimedia.org/r/598971

Change 598979 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] swift: use default to https for dispersion

https://gerrit.wikimedia.org/r/598979

Change 598979 merged by Filippo Giunchedi:
[operations/puppet@production] swift: use default to https for dispersion

https://gerrit.wikimedia.org/r/598979

Change 598711 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: allow setting min/max block duration

https://gerrit.wikimedia.org/r/598711

Change 597071 merged by Filippo Giunchedi:
[operations/puppet@production] thanos: add objstore support to sidecar

https://gerrit.wikimedia.org/r/597071

Change 598712 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: enable Thanos upload for Prometheus k8s-staging

https://gerrit.wikimedia.org/r/598712

Mentioned in SAL (#wikimedia-operations) [2020-05-28T13:49:45Z] <godog> roll-restart prometheus k8s-staging to enable thanos upload - T252186

Change 599342 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: move analytics to profile

https://gerrit.wikimedia.org/r/599342

Mentioned in SAL (#wikimedia-operations) [2020-05-29T12:15:13Z] <godog> roll-restart to upgrade thanos to 0.13.0rc0 - T252186 T233956

Change 599342 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: move analytics to profile

https://gerrit.wikimedia.org/r/599342

Change 601326 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: enable Thanos upload for analytics

https://gerrit.wikimedia.org/r/601326

Change 601346 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] swift: refactor stats_reporter into a profile

https://gerrit.wikimedia.org/r/601346

Change 601346 merged by Filippo Giunchedi:
[operations/puppet@production] swift: refactor stats_reporter into a profile

https://gerrit.wikimedia.org/r/601346

Change 601657 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] thanos: enable swift stats reporting

https://gerrit.wikimedia.org/r/601657

Change 601657 merged by Filippo Giunchedi:
[operations/puppet@production] thanos: enable swift stats reporting

https://gerrit.wikimedia.org/r/601657

Change 601326 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: enable Thanos upload for analytics

https://gerrit.wikimedia.org/r/601326

Change 602082 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] WIP thanos: add alerts

https://gerrit.wikimedia.org/r/602082

Change 602398 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: move services instance to profile

https://gerrit.wikimedia.org/r/602398

Change 602401 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: move global instance to profile

https://gerrit.wikimedia.org/r/602401

Change 602409 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: merge ops instance role into profile

https://gerrit.wikimedia.org/r/602409

Change 602398 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: move services instance to profile

https://gerrit.wikimedia.org/r/602398

Change 602401 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: move global instance to profile

https://gerrit.wikimedia.org/r/602401

Change 602082 merged by Filippo Giunchedi:
[operations/puppet@production] thanos: add alerts for Thanos components

https://gerrit.wikimedia.org/r/602082

Change 602633 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] thanos: add alerts for Thanos components

https://gerrit.wikimedia.org/r/602633

Change 602669 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] monitoring: bail on check_command containing newlines

https://gerrit.wikimedia.org/r/602669

Change 602715 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: enable Thanos upload for k8s

https://gerrit.wikimedia.org/r/602715

Change 602716 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: enable Thanos upload for services

https://gerrit.wikimedia.org/r/602716

Change 602717 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: enable Thanos upload for ops in esams

https://gerrit.wikimedia.org/r/602717

Change 602669 merged by Filippo Giunchedi:
[operations/puppet@production] monitoring: bail on check_command containing newlines

https://gerrit.wikimedia.org/r/602669

Change 602633 merged by Filippo Giunchedi:
[operations/puppet@production] thanos: add alerts for Thanos components

https://gerrit.wikimedia.org/r/602633

Change 602716 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: enable Thanos upload for services

https://gerrit.wikimedia.org/r/602716

Change 603394 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] swift: enable bulk and slo middlewares for s3api compat

https://gerrit.wikimedia.org/r/603394

Change 603394 merged by Filippo Giunchedi:
[operations/puppet@production] swift: enable bulk and slo middlewares for s3api compat

https://gerrit.wikimedia.org/r/603394

Change 603441 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] Fix Thanos compact alert threshold

https://gerrit.wikimedia.org/r/603441

Change 603441 merged by Filippo Giunchedi:
[operations/puppet@production] Fix Thanos compact alert threshold

https://gerrit.wikimedia.org/r/603441

Change 603457 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] profile: allow NaN for Thanos compact/query errors

https://gerrit.wikimedia.org/r/603457

Change 603457 merged by Filippo Giunchedi:
[operations/puppet@production] profile: allow NaN for Thanos compact/query errors

https://gerrit.wikimedia.org/r/603457

Change 602409 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: merge ops instance role into profile

https://gerrit.wikimedia.org/r/602409

Change 604314 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/dns@master] Add v6 for thanos-be1*

https://gerrit.wikimedia.org/r/604314

Change 604314 merged by Filippo Giunchedi:
[operations/dns@master] Add v6 for thanos-be1*

https://gerrit.wikimedia.org/r/604314

Mentioned in SAL (#wikimedia-operations) [2020-06-10T09:31:06Z] <godog> configure thanos-be1* HDDs as raid0 - T252186

Change 604383 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] site: add thanos-be1* to thanos::backend

https://gerrit.wikimedia.org/r/604383

Change 604383 merged by Filippo Giunchedi:
[operations/puppet@production] site: add thanos-be1* to thanos::backend

https://gerrit.wikimedia.org/r/604383

Change 602715 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: enable Thanos upload for k8s

https://gerrit.wikimedia.org/r/602715

Change 602717 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: enable Thanos upload for ops in esams

https://gerrit.wikimedia.org/r/602717

Change 604623 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] swift: remove swift-container-sharder unit

https://gerrit.wikimedia.org/r/604623

Change 604664 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/dns@master] templates: add PTR for thanos-swift / thanos-query

https://gerrit.wikimedia.org/r/604664

Change 604664 merged by Filippo Giunchedi:
[operations/dns@master] templates: add PTR for thanos-swift / thanos-query

https://gerrit.wikimedia.org/r/604664

Change 605177 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: enable thanos upload in ops eqsin/ulsfo/codfw

https://gerrit.wikimedia.org/r/605177

Change 605178 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: enable thanos upload in ops eqiad

https://gerrit.wikimedia.org/r/605178

Change 604623 merged by Filippo Giunchedi:
[operations/puppet@production] swift: remove swift-container-sharder unit

https://gerrit.wikimedia.org/r/604623

Change 605177 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: enable thanos upload in ops eqsin/ulsfo/codfw

https://gerrit.wikimedia.org/r/605177

Change 605591 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] swift: optional read affinity proxy setting

https://gerrit.wikimedia.org/r/605591

Change 605592 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: enable read affinity for thanos-swift

https://gerrit.wikimedia.org/r/605592

Change 605598 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/dns@master] templates: add ipv6 for thanos-be2*

https://gerrit.wikimedia.org/r/605598

Change 605598 merged by Filippo Giunchedi:
[operations/dns@master] templates: add ipv6 for thanos-be2*

https://gerrit.wikimedia.org/r/605598

Change 605178 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: enable thanos upload in ops eqiad

https://gerrit.wikimedia.org/r/605178

Change 605591 merged by Filippo Giunchedi:
[operations/puppet@production] swift: optional read affinity proxy setting

https://gerrit.wikimedia.org/r/605591

Change 605592 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: enable read affinity for thanos-swift

https://gerrit.wikimedia.org/r/605592

fgiunchedi claimed this task.

This is complete!

Change 605915 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] swift: add explicit ordering for /var/log/swift

https://gerrit.wikimedia.org/r/605915

Change 605915 merged by Filippo Giunchedi:
[operations/puppet@production] swift: add explicit ordering for /var/log/swift

https://gerrit.wikimedia.org/r/605915

Change 605920 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] swift: stop requiring Package swift for log directory

https://gerrit.wikimedia.org/r/605920

Change 605920 merged by Filippo Giunchedi:
[operations/puppet@production] swift: stop requiring Package swift for log directory

https://gerrit.wikimedia.org/r/605920

Change 605949 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] thanos: pass min / max time to store

https://gerrit.wikimedia.org/r/605949

Change 605950 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] thanos: use object storage for data older than 15d

https://gerrit.wikimedia.org/r/605950

Change 605960 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] thanos: fix Thanos sidecar Prometheus connection alert

https://gerrit.wikimedia.org/r/605960

Change 605960 merged by Filippo Giunchedi:
[operations/puppet@production] thanos: fix Thanos sidecar Prometheus connection alert

https://gerrit.wikimedia.org/r/605960

Change 606135 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: pass min_time to Thanos sidecar

https://gerrit.wikimedia.org/r/606135

Change 605949 merged by Filippo Giunchedi:
[operations/puppet@production] thanos: pass min / max time to store

https://gerrit.wikimedia.org/r/605949

Change 606135 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: pass min_time to Thanos sidecar

https://gerrit.wikimedia.org/r/606135

Change 605950 merged by Filippo Giunchedi:
[operations/puppet@production] thanos: use object storage for data older than 15d

https://gerrit.wikimedia.org/r/605950

Change 606400 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] thanos: fix min/max time options parsing

https://gerrit.wikimedia.org/r/606400

Change 606400 merged by Filippo Giunchedi:
[operations/puppet@production] thanos: fix min/max time options parsing

https://gerrit.wikimedia.org/r/606400

Change 610768 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] thanos: set concurrency to 1

https://gerrit.wikimedia.org/r/610768

Change 610768 merged by Filippo Giunchedi:
[operations/puppet@production] thanos: set concurrency to 1

https://gerrit.wikimedia.org/r/610768