Page MenuHomePhabricator

Deploy Thanos (long-term storage) stateless components: sidecar and query
Open, Needs TriagePublic

Description

This task tracks the deployment of Thanos stateless components, the big win being a query endpoint that can reach out to all prometheus instances and merge/deduplicate results as needed.

Outline of what's needed:

  • Thanos Debian package
  • Prometheus instances need to advertise unique external_labels
    • Need to add labels: instance (or name or sth like that) plus replica (A or B)
    • The labels above need to be filtered out before ingestion by our global instance for backwards compatibility
  • Deploy Thanos sidecar alongside each Prometheus instance (save for global)
    • Needs two ports for each of http + grpc interfaces, likely as an offset of the instance's port itself
  • Deploy Thanos query component
    • Deploy on thanos-fe2* hosts
    • Deploy on thanos-fe1* hosts
    • Needs two ports for http+grpc, and labels that are considered for deduplication (replica in our case)
    • Needs to locate and reach all other Thanos sidecars
    • Deploy behind LVS and can be active/active (i.e. discovery DNS records) following https://wikitech.wikimedia.org/wiki/LVS#Add_a_new_load_balanced_service.
  • Configure and test datasource in Grafana

Details

ProjectBranchLines +/-Subject
operations/puppetproduction+11 -0
operations/puppetproduction+4 -4
operations/puppetproduction+3 -8
operations/puppetproduction+2 -2
operations/puppetproduction+1 -1
operations/puppetproduction+6 -0
operations/puppetproduction+41 -0
operations/puppetproduction+10 -0
operations/puppetproduction+6 -0
operations/puppetproduction+34 -0
operations/dnsmaster+2 -0
operations/puppetproduction+5 -0
operations/puppetproduction+1 -1
operations/puppetproduction+4 -4
operations/puppetproduction+10 -0
operations/puppetproduction+10 -10
operations/puppetproduction+32 -1
operations/puppetproduction+100 -0
operations/puppetproduction+10 -2
operations/puppetproduction+81 -0
operations/debs/thanosdebian/buster-wikimedia+684 -0
operations/puppetproduction+29 -6
operations/puppetproduction+1 -2
operations/puppetproduction+10 -0
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 539342 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] role: drop Thanos labels from global Prometheus

https://gerrit.wikimedia.org/r/539342

Change 539342 merged by Filippo Giunchedi:
[operations/puppet@production] role: drop Thanos labels from global Prometheus

https://gerrit.wikimedia.org/r/539342

Change 541374 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] role: fix Prometheus global metric relabel config

https://gerrit.wikimedia.org/r/541374

Change 541374 merged by Filippo Giunchedi:
[operations/puppet@production] role: fix Prometheus global metric relabel config

https://gerrit.wikimedia.org/r/541374

Change 585468 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: additional external_labels for Thanos

https://gerrit.wikimedia.org/r/585468

fgiunchedi moved this task from Backlog to Up next on the User-fgiunchedi board.Apr 2 2020, 10:48 AM

Change 585468 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: additional external_labels for Thanos

https://gerrit.wikimedia.org/r/585468

fgiunchedi updated the task description. (Show Details)Apr 6 2020, 7:45 AM

Change 586312 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] modules: add thanos-sidecar define and profile

https://gerrit.wikimedia.org/r/586312

Change 586313 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: add thanos-sidecar to prometheus@ops

https://gerrit.wikimedia.org/r/586313

Change 586314 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] Add Thanos query

https://gerrit.wikimedia.org/r/586314

Change 586315 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: scrape thanos sidecar/query metrics

https://gerrit.wikimedia.org/r/586315

fgiunchedi moved this task from Inbox to Backlog on the observability board.Apr 6 2020, 12:33 PM

Change 587252 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/debs/thanos@debian/buster-wikimedia] debian: first commit

https://gerrit.wikimedia.org/r/587252

Change 587252 merged by Filippo Giunchedi:
[operations/debs/thanos@debian/buster-wikimedia] debian: first commit

https://gerrit.wikimedia.org/r/587252

Change 586312 merged by Filippo Giunchedi:
[operations/puppet@production] modules: add thanos-sidecar define and profile

https://gerrit.wikimedia.org/r/586312

Change 586313 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: add thanos-sidecar to prometheus@ops

https://gerrit.wikimedia.org/r/586313

Change 586314 merged by Filippo Giunchedi:
[operations/puppet@production] Add Thanos query

https://gerrit.wikimedia.org/r/586314

Change 586315 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: scrape thanos sidecar/query metrics

https://gerrit.wikimedia.org/r/586315

fgiunchedi moved this task from Up next to Doing on the User-fgiunchedi board.Apr 27 2020, 12:12 PM
fgiunchedi updated the task description. (Show Details)Thu, May 7, 9:44 AM

Change 594914 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] role: rename thanos::query in thanos::frontend

https://gerrit.wikimedia.org/r/594914

Change 594914 merged by Filippo Giunchedi:
[operations/puppet@production] role: rename thanos::query in thanos::frontend

https://gerrit.wikimedia.org/r/594914

Change 594919 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: add thanos sidecar to k8s instance

https://gerrit.wikimedia.org/r/594919

Change 594919 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: add thanos sidecar to k8s instances

https://gerrit.wikimedia.org/r/594919

Change 595138 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] thanos: fix query class arguments and path

https://gerrit.wikimedia.org/r/595138

fgiunchedi updated the task description. (Show Details)Fri, May 8, 9:06 AM

Change 595138 merged by Filippo Giunchedi:
[operations/puppet@production] thanos: fix query class arguments and path

https://gerrit.wikimedia.org/r/595138

Change 595473 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] site: assign thanos::frontend to thanos-fe2*

https://gerrit.wikimedia.org/r/595473

Change 595473 merged by Filippo Giunchedi:
[operations/puppet@production] site: assign thanos::frontend to thanos-fe2*

https://gerrit.wikimedia.org/r/595473

fgiunchedi updated the task description. (Show Details)Mon, May 11, 8:48 AM

Change 595489 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/dns@master] wmnet: allocate thanos-query.svc addresses

https://gerrit.wikimedia.org/r/595489

Change 595491 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] conftool-data: add thanos-query

https://gerrit.wikimedia.org/r/595491

fgiunchedi updated the task description. (Show Details)Mon, May 11, 9:41 AM

Change 595493 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: add thanos-query to service::catalog

https://gerrit.wikimedia.org/r/595493

Change 595494 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] thanos: add lvs addresses to frontend

https://gerrit.wikimedia.org/r/595494

Change 595491 merged by Filippo Giunchedi:
[operations/puppet@production] conftool-data: add thanos-query

https://gerrit.wikimedia.org/r/595491

Change 595489 merged by Filippo Giunchedi:
[operations/dns@master] wmnet: allocate thanos-query.svc addresses

https://gerrit.wikimedia.org/r/595489

Change 595493 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: add thanos-query to service::catalog

https://gerrit.wikimedia.org/r/595493

Change 595494 merged by Filippo Giunchedi:
[operations/puppet@production] thanos: add lvs addresses to frontend

https://gerrit.wikimedia.org/r/595494

fgiunchedi updated the task description. (Show Details)Tue, May 12, 3:45 PM

Change 596147 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: add thanos::sidecar to services and analytics

https://gerrit.wikimedia.org/r/596147

Change 596147 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: add thanos::sidecar to services and analytics

https://gerrit.wikimedia.org/r/596147

fgiunchedi updated the task description. (Show Details)Wed, May 13, 8:16 AM

Change 597245 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] profile: add thanos::httpd to proxy thanos-query

https://gerrit.wikimedia.org/r/597245

Change 597245 merged by Filippo Giunchedi:
[operations/puppet@production] profile: add thanos::httpd to proxy thanos-query

https://gerrit.wikimedia.org/r/597245

Change 597481 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] profile: open port 80 for thanos-query

https://gerrit.wikimedia.org/r/597481

Change 597481 merged by Filippo Giunchedi:
[operations/puppet@production] profile: open port 80 for thanos-query

https://gerrit.wikimedia.org/r/597481

Change 597482 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] profile: fix thanos-query httpd proxypass

https://gerrit.wikimedia.org/r/597482

Change 597482 merged by Filippo Giunchedi:
[operations/puppet@production] profile: fix thanos-query httpd proxypass

https://gerrit.wikimedia.org/r/597482

Change 597485 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: move thanos-query service to port 80

https://gerrit.wikimedia.org/r/597485

Change 597485 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: move thanos-query service to port 80

https://gerrit.wikimedia.org/r/597485

Mentioned in SAL (#wikimedia-operations) [2020-05-20T12:28:04Z] <godog> roll-restart pybal on codfw low-traffic - T233956

Change 597557 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] profile: fix thanos swift healthcheck

https://gerrit.wikimedia.org/r/597557

Change 597557 merged by Filippo Giunchedi:
[operations/puppet@production] profile: fix thanos swift healthcheck

https://gerrit.wikimedia.org/r/597557

fgiunchedi added a comment.EditedWed, May 20, 3:19 PM

The https/envoy part is up, however check_http is failing while curl works:

icinga1001:~$ /usr/lib/nagios/plugins/check_http -H thanos-swift.discovery.wmnet -S -I 10.192.0.192 -u /healthcheck
CRITICAL - Cannot make SSL connection.
icinga1001:~$ curl --resolve thanos-swift.discovery.wmnet:443:10.192.0.192 https://thanos-swift.discovery.wmnet/healthcheck
OK

Turns out this is due to SNI requirement for Thanos (profile::tlsproxy::envoy::sni_support: 'strict')

icinga1001:~$ /usr/lib/nagios/plugins/check_http -H thanos-swift.discovery.wmnet -S --sni -I 10.192.0.192 -u /healthcheck
HTTP OK: HTTP/1.1 200 OK - 279 bytes in 1.155 second response time |time=1.154771s;;;0.000000;10.000000 size=279B;;;0

I think we (in decreasing order of preference) could:

  1. default check_http with --sni for the https cases
  2. relax the sni requirement on the envoy/thanos side
  3. add yet another specialized icinga command definition for https+sni

Change 599040 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: rename Thanos jobs

https://gerrit.wikimedia.org/r/599040

Change 599040 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: rename Thanos jobs

https://gerrit.wikimedia.org/r/599040

Change 599059 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] grafana: provision Thanos datasource

https://gerrit.wikimedia.org/r/599059

Change 599059 merged by Filippo Giunchedi:
[operations/puppet@production] grafana: provision Thanos datasource

https://gerrit.wikimedia.org/r/599059

Mentioned in SAL (#wikimedia-operations) [2020-05-29T12:15:13Z] <godog> roll-restart to upgrade thanos to 0.13.0rc0 - T252186 T233956