Page MenuHomePhabricator

Deploy Thanos (long-term storage) stateless components: sidecar and query
Closed, ResolvedPublic

Description

This task tracks the deployment of Thanos stateless components, the big win being a query endpoint that can reach out to all prometheus instances and merge/deduplicate results as needed.

Outline of what's needed:

  • Thanos Debian package
  • Prometheus instances need to advertise unique external_labels
    • Need to add labels: instance (or name or sth like that) plus replica (A or B)
    • The labels above need to be filtered out before ingestion by our global instance for backwards compatibility
  • Deploy Thanos sidecar alongside each Prometheus instance (save for global)
    • Needs two ports for each of http + grpc interfaces, likely as an offset of the instance's port itself
  • Deploy Thanos query component
    • Deploy on thanos-fe2* hosts
    • Deploy on thanos-fe1* hosts
    • Needs two ports for http+grpc, and labels that are considered for deduplication (replica in our case)
    • Needs to locate and reach all other Thanos sidecars
    • Deploy behind LVS and can be active/active (i.e. discovery DNS records) following https://wikitech.wikimedia.org/wiki/LVS#Add_a_new_load_balanced_service.
  • Configure and test datasource in Grafana
  • Audit and document how to port dashboards to Thanos (T256954)

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+4 -1
operations/puppetproduction+2 -2
operations/puppetproduction+6 -6
operations/puppetproduction+36 -6
operations/puppetproduction+2 -2
operations/dnsmaster+4 -0
operations/puppetproduction+10 -0
operations/puppetproduction+4 -0
operations/puppetproduction+5 -2
operations/puppetproduction+4 -0
operations/puppetproduction+1 -0
operations/puppetproduction+11 -0
operations/puppetproduction+4 -4
operations/puppetproduction+3 -8
operations/puppetproduction+2 -2
operations/puppetproduction+1 -1
operations/puppetproduction+6 -0
operations/puppetproduction+41 -0
operations/puppetproduction+10 -0
operations/puppetproduction+6 -0
operations/puppetproduction+34 -0
operations/dnsmaster+2 -0
operations/puppetproduction+5 -0
operations/puppetproduction+1 -1
operations/puppetproduction+4 -4
operations/puppetproduction+10 -0
operations/puppetproduction+10 -10
operations/puppetproduction+32 -1
operations/puppetproduction+100 -0
operations/puppetproduction+10 -2
operations/puppetproduction+81 -0
operations/debs/thanosdebian/buster-wikimedia+684 -0
operations/puppetproduction+29 -6
operations/puppetproduction+1 -2
operations/puppetproduction+10 -0
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 594919 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: add thanos sidecar to k8s instance

https://gerrit.wikimedia.org/r/594919

Change 594919 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: add thanos sidecar to k8s instances

https://gerrit.wikimedia.org/r/594919

Change 595138 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] thanos: fix query class arguments and path

https://gerrit.wikimedia.org/r/595138

Change 595138 merged by Filippo Giunchedi:
[operations/puppet@production] thanos: fix query class arguments and path

https://gerrit.wikimedia.org/r/595138

Change 595473 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] site: assign thanos::frontend to thanos-fe2*

https://gerrit.wikimedia.org/r/595473

Change 595473 merged by Filippo Giunchedi:
[operations/puppet@production] site: assign thanos::frontend to thanos-fe2*

https://gerrit.wikimedia.org/r/595473

Change 595489 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/dns@master] wmnet: allocate thanos-query.svc addresses

https://gerrit.wikimedia.org/r/595489

Change 595491 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] conftool-data: add thanos-query

https://gerrit.wikimedia.org/r/595491

Change 595493 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: add thanos-query to service::catalog

https://gerrit.wikimedia.org/r/595493

Change 595494 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] thanos: add lvs addresses to frontend

https://gerrit.wikimedia.org/r/595494

Change 595491 merged by Filippo Giunchedi:
[operations/puppet@production] conftool-data: add thanos-query

https://gerrit.wikimedia.org/r/595491

Change 595489 merged by Filippo Giunchedi:
[operations/dns@master] wmnet: allocate thanos-query.svc addresses

https://gerrit.wikimedia.org/r/595489

Change 595493 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: add thanos-query to service::catalog

https://gerrit.wikimedia.org/r/595493

Change 595494 merged by Filippo Giunchedi:
[operations/puppet@production] thanos: add lvs addresses to frontend

https://gerrit.wikimedia.org/r/595494

Change 596147 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: add thanos::sidecar to services and analytics

https://gerrit.wikimedia.org/r/596147

Change 596147 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: add thanos::sidecar to services and analytics

https://gerrit.wikimedia.org/r/596147

Change 597245 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] profile: add thanos::httpd to proxy thanos-query

https://gerrit.wikimedia.org/r/597245

Change 597245 merged by Filippo Giunchedi:
[operations/puppet@production] profile: add thanos::httpd to proxy thanos-query

https://gerrit.wikimedia.org/r/597245

Change 597481 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] profile: open port 80 for thanos-query

https://gerrit.wikimedia.org/r/597481

Change 597481 merged by Filippo Giunchedi:
[operations/puppet@production] profile: open port 80 for thanos-query

https://gerrit.wikimedia.org/r/597481

Change 597482 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] profile: fix thanos-query httpd proxypass

https://gerrit.wikimedia.org/r/597482

Change 597482 merged by Filippo Giunchedi:
[operations/puppet@production] profile: fix thanos-query httpd proxypass

https://gerrit.wikimedia.org/r/597482

Change 597485 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: move thanos-query service to port 80

https://gerrit.wikimedia.org/r/597485

Change 597485 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: move thanos-query service to port 80

https://gerrit.wikimedia.org/r/597485

Mentioned in SAL (#wikimedia-operations) [2020-05-20T12:28:04Z] <godog> roll-restart pybal on codfw low-traffic - T233956

Change 597557 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] profile: fix thanos swift healthcheck

https://gerrit.wikimedia.org/r/597557

Change 597557 merged by Filippo Giunchedi:
[operations/puppet@production] profile: fix thanos swift healthcheck

https://gerrit.wikimedia.org/r/597557

The https/envoy part is up, however check_http is failing while curl works:

icinga1001:~$ /usr/lib/nagios/plugins/check_http -H thanos-swift.discovery.wmnet -S -I 10.192.0.192 -u /healthcheck
CRITICAL - Cannot make SSL connection.
icinga1001:~$ curl --resolve thanos-swift.discovery.wmnet:443:10.192.0.192 https://thanos-swift.discovery.wmnet/healthcheck
OK

Turns out this is due to SNI requirement for Thanos (profile::tlsproxy::envoy::sni_support: 'strict')

icinga1001:~$ /usr/lib/nagios/plugins/check_http -H thanos-swift.discovery.wmnet -S --sni -I 10.192.0.192 -u /healthcheck
HTTP OK: HTTP/1.1 200 OK - 279 bytes in 1.155 second response time |time=1.154771s;;;0.000000;10.000000 size=279B;;;0

I think we (in decreasing order of preference) could:

  1. default check_http with --sni for the https cases
  2. relax the sni requirement on the envoy/thanos side
  3. add yet another specialized icinga command definition for https+sni

Change 599040 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: rename Thanos jobs

https://gerrit.wikimedia.org/r/599040

Change 599040 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: rename Thanos jobs

https://gerrit.wikimedia.org/r/599040

Change 599059 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] grafana: provision Thanos datasource

https://gerrit.wikimedia.org/r/599059

Change 599059 merged by Filippo Giunchedi:
[operations/puppet@production] grafana: provision Thanos datasource

https://gerrit.wikimedia.org/r/599059

Mentioned in SAL (#wikimedia-operations) [2020-05-29T12:15:13Z] <godog> roll-restart to upgrade thanos to 0.13.0rc0 - T252186 T233956

Change 601320 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] role: use_remote_address: true for Envoy on Thanos

https://gerrit.wikimedia.org/r/601320

Change 601320 merged by Filippo Giunchedi:
[operations/puppet@production] role: use_remote_address: true for Envoy on Thanos

https://gerrit.wikimedia.org/r/601320

Change 604009 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] thanos: add SyslogIdentifier=%N to systemd services

https://gerrit.wikimedia.org/r/604009

Change 604009 merged by Filippo Giunchedi:
[operations/puppet@production] thanos: add SyslogIdentifier=%N to systemd services

https://gerrit.wikimedia.org/r/604009

Change 604433 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] site: add thanos-fe1* to frontends

https://gerrit.wikimedia.org/r/604433

Change 604433 merged by Filippo Giunchedi:
[operations/puppet@production] site: add thanos-fe1* to frontends

https://gerrit.wikimedia.org/r/604433

Change 604608 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] conftool-data: add thanos-fe eqiad

https://gerrit.wikimedia.org/r/604608

Change 604608 merged by Filippo Giunchedi:
[operations/puppet@production] conftool-data: add thanos-fe eqiad

https://gerrit.wikimedia.org/r/604608

Change 604613 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: add eqiad for thanos-query / thanos-swift

https://gerrit.wikimedia.org/r/604613

Change 604613 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: add eqiad for thanos-query / thanos-swift

https://gerrit.wikimedia.org/r/604613

Change 604664 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/dns@master] templates: add PTR for thanos-swift / thanos-query

https://gerrit.wikimedia.org/r/604664

Change 604664 merged by Filippo Giunchedi:
[operations/dns@master] templates: add PTR for thanos-swift / thanos-query

https://gerrit.wikimedia.org/r/604664

This is complete! Thanos frontend is available in codfw and eqiad

Change 605859 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: page on thanos swift/query failure

https://gerrit.wikimedia.org/r/605859

Change 605859 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: page on thanos swift/query failure

https://gerrit.wikimedia.org/r/605859

Change 607031 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: import availability aggregation rules from Prometheus global

https://gerrit.wikimedia.org/r/607031

Change 607031 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: import availability aggregation rules from Prometheus global

https://gerrit.wikimedia.org/r/607031

Change 607256 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: move global availability rules to new names

https://gerrit.wikimedia.org/r/607256

Change 607256 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: move global availability rules to new names

https://gerrit.wikimedia.org/r/607256

Change 607783 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] thanos: set consistency-delay on store

https://gerrit.wikimedia.org/r/607783

Change 608319 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] monitoring: switch to new names for global availability metrics

https://gerrit.wikimedia.org/r/c/operations/puppet/ /608319

Change 608319 merged by Filippo Giunchedi:
[operations/puppet@production] monitoring: switch to new names for global availability metrics

https://gerrit.wikimedia.org/r/c/operations/puppet/ /608319

Change 607783 merged by Filippo Giunchedi:
[operations/puppet@production] thanos: set consistency-delay on store

https://gerrit.wikimedia.org/r/c/operations/puppet/ /607783

fgiunchedi claimed this task.
fgiunchedi updated the task description. (Show Details)

Resolving, porting dashboards to Thanos is tracked in T256954: Port Prometheus dashboards to Thanos