Page MenuHomePhabricator

Transition to Pyrra for SLO Visualization and Management
Open, MediumPublic

Description

Today we manage SLO dashboards using an in-house jsonnet template which is rendered and deployed to grafana using grafana-grizzly. Since establishing this process, a self contained SLO management tool Pyrra (https://pyrra.dev) has seen much active development and offers several benefits including improved (dedicated) SLO visualization, search, labeling, automation of recording rules, integrated multi burn alerting, and more.

This task initially served as a placeholder to explore this (hence the patch history) and is being expanded to serve as a tracking task for Pyrra deployment and integration.

High level checklist, in rough order:

Details

Related Changes in Gerrit:
SubjectRepoBranchLines +/-
operations/puppetproduction+16 -1
operations/puppetproduction+3 -0
operations/puppetproduction+31 -0
operations/puppetproduction+47 -33
operations/puppetproduction+3 -3
operations/puppetproduction+1 -0
operations/alertsmaster+0 -41
operations/puppetproduction+24 -24
operations/puppetproduction+354 -0
operations/puppetproduction+71 -0
operations/puppetproduction+60 -0
operations/puppetproduction+10 -0
operations/grafana-grizzlymaster+1 -1
operations/puppetproduction+2 -2
operations/puppetproduction+1 -1
operations/puppetproduction+34 -0
operations/grafana-grizzlymaster+8 -3
operations/puppetproduction+59 -59
operations/puppetproduction+1 -1
operations/puppetproduction+49 -4
operations/dnsmaster+6 -1
operations/puppetproduction+36 -0
operations/puppetproduction+40 -0
operations/puppetproduction+12 -11
operations/puppetproduction+41 -0
operations/puppetproduction+38 -35
operations/puppetproduction+68 -60
operations/puppetproduction+16 -8
operations/puppetproduction+68 -0
operations/puppetproduction+150 -141
operations/puppetproduction+38 -1
operations/puppetproduction+2 -2
operations/puppetproduction+32 -0
operations/puppetproduction+3 -0
operations/puppetproduction+1 -1
operations/puppetproduction+36 -0
operations/puppetproduction+25 -0
operations/puppetproduction+3 -3
operations/puppetproduction+4 -3
operations/puppetproduction+2 -1
operations/puppetproduction+1 -1
operations/puppetproduction+7 -0
operations/dnsmaster+4 -0
operations/puppetproduction+15 -0
operations/puppetproduction+3 -2
operations/puppetproduction+5 -4
operations/puppetproduction+3 -1
operations/puppetproduction+1 -1
operations/puppetproduction+0 -3
operations/puppetproduction+1 -1
operations/puppetproduction+3 -0
operations/puppetproduction+10 -0
operations/puppetproduction+32 -0
operations/puppetproduction+71 -0
operations/debs/pyrramaster+227 -0
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 974148 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::thanos: add new istio recording rule

https://gerrit.wikimedia.org/r/974148

Change 974149 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::pyrra::filesystem: add Lift Wing pilot

https://gerrit.wikimedia.org/r/974149

Change 974148 merged by Elukey:

[operations/puppet@production] profile::thanos: add new istio recording rule

https://gerrit.wikimedia.org/r/974148

Change 974149 merged by Elukey:

[operations/puppet@production] profile::pyrra::filesystem: add Lift Wing pilot

https://gerrit.wikimedia.org/r/974149

Change 974496 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::pyrra::filesystem: improve/fix lift wing pilot

https://gerrit.wikimedia.org/r/974496

Change 974496 merged by Elukey:

[operations/puppet@production] profile::pyrra::filesystem: improve/fix lift wing pilot

https://gerrit.wikimedia.org/r/974496

Change 983950 had a related patch set uploaded (by Dwisehaupt; author: Dwisehaupt):

[operations/dns@master] Add dyna record for community-crm

https://gerrit.wikimedia.org/r/983950

Change 983951 had a related patch set uploaded (by Dwisehaupt; author: Dwisehaupt):

[operations/puppet@production] Set the cdn to pass requests for community-crm

https://gerrit.wikimedia.org/r/983951

Change 967950 merged by Herron:

[operations/puppet@production] pyrra: onboard varnish-requests as pilot SLO

https://gerrit.wikimedia.org/r/967950

herron renamed this task from Explore Pyrra for SLO Visualization and Management to Transition to Pyrra for SLO Visualization and Management.May 6 2024, 3:06 PM
herron updated the task description. (Show Details)

Change #1028524 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] pyrra: separate slo definitions from filesystem class

https://gerrit.wikimedia.org/r/1028524

Change #1028524 merged by Herron:

[operations/puppet@production] pyrra: separate slo definitions from filesystem class

https://gerrit.wikimedia.org/r/1028524

Change #1028555 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] pyrra: onboard etcd request/latency SLOs

https://gerrit.wikimedia.org/r/1028555

herron updated the task description. (Show Details)

Change #1028555 merged by Herron:

[operations/puppet@production] pyrra: onboard etcd request/latency SLOs

https://gerrit.wikimedia.org/r/1028555

Change #1028854 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] pyrra: varnish: workaround site grouping limitation

https://gerrit.wikimedia.org/r/1028854

Change #1028854 merged by Herron:

[operations/puppet@production] pyrra: varnish: workaround site grouping limitation

https://gerrit.wikimedia.org/r/1028854

Change #1028864 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] pyrra: etcd: add generic rules workaround

https://gerrit.wikimedia.org/r/1028864

Change #1028864 merged by Herron:

[operations/puppet@production] pyrra: etcd: add generic rules workaround

https://gerrit.wikimedia.org/r/1028864

Change #1028881 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] pyrra: logstash: add generic rules workaround

https://gerrit.wikimedia.org/r/1028881

Change #1028881 merged by Herron:

[operations/puppet@production] pyrra: logstash: add generic rules workaround

https://gerrit.wikimedia.org/r/1028881

Change #1029634 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] pyrra: onboard haproxy slo from grizzly

https://gerrit.wikimedia.org/r/1029634

Change #1029634 merged by Herron:

[operations/puppet@production] pyrra: onboard haproxy slo from grizzly

https://gerrit.wikimedia.org/r/1029634

Change #1029654 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] pyrra: varnish: add cluster

https://gerrit.wikimedia.org/r/1029654

Change #1029654 merged by Herron:

[operations/puppet@production] pyrra: varnish: add cluster

https://gerrit.wikimedia.org/r/1029654

Change #1030227 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] pyrra: trafficserver: onboard slo from grizzly

https://gerrit.wikimedia.org/r/1030227

Change #1030227 merged by Herron:

[operations/puppet@production] pyrra: trafficserver: onboard slo from grizzly

https://gerrit.wikimedia.org/r/1030227

Change #1031527 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] pyrra: linkrecommendation: onboard slo from grizzly

https://gerrit.wikimedia.org/r/1031527

Change #1031527 merged by Herron:

[operations/puppet@production] pyrra: linkrecommendation: onboard slo from grizzly

https://gerrit.wikimedia.org/r/1031527

Change #961132 abandoned by Herron:

[operations/dns@master] pyrra add service dns entries

Reason:

ended up piggybacking on thanos-web for this

https://gerrit.wikimedia.org/r/961132

Change #961129 abandoned by Herron:

[operations/puppet@production] services: add pyrra conftool-data and service stub entry

Reason:

ended up piggybacking on thanos-web for this

https://gerrit.wikimedia.org/r/961129

Change #961130 abandoned by Herron:

[operations/puppet@production] pyrra: use load balancing

Reason:

ended up piggybacking on thanos-web for this

https://gerrit.wikimedia.org/r/961130

Change #1051439 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] pyrra: add liftwing SLOs

https://gerrit.wikimedia.org/r/1051439

Change #1051439 merged by Herron:

[operations/puppet@production] pyrra: add liftwing SLOs

https://gerrit.wikimedia.org/r/1051439

Change #1054617 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] pyrra: onboard wdqs request SLO

https://gerrit.wikimedia.org/r/1054617

Change #1077966 had a related patch set uploaded (by Herron; author: Herron):

[operations/grafana-grizzly@master] add links to SLOs migrated to pyrra

https://gerrit.wikimedia.org/r/1077966

Change #1077966 merged by Herron:

[operations/grafana-grizzly@master] add links to SLOs migrated to pyrra

https://gerrit.wikimedia.org/r/1077966

Change #1101083 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] pyrra: onboard wdqs-availability

https://gerrit.wikimedia.org/r/1101083

Change #1101083 merged by Herron:

[operations/puppet@production] pyrra: onboard wdqs-availability

https://gerrit.wikimedia.org/r/1101083

Change #1101099 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] pyrra: switch wdqs-availability ratio type

https://gerrit.wikimedia.org/r/1101099

Change #1101099 merged by Herron:

[operations/puppet@production] pyrra: switch wdqs-availability ratio type

https://gerrit.wikimedia.org/r/1101099

Change #1101113 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] pyrra: wdqs-availability invert query

https://gerrit.wikimedia.org/r/1101113

Change #1101113 merged by Herron:

[operations/puppet@production] pyrra: wdqs-availability invert query

https://gerrit.wikimedia.org/r/1101113

Change #1101114 had a related patch set uploaded (by Herron; author: Herron):

[operations/grafana-grizzly@master] add pyrra note for wdqs-availability

https://gerrit.wikimedia.org/r/1101114

Change #1101114 merged by Herron:

[operations/grafana-grizzly@master] add pyrra note for wdqs-availability

https://gerrit.wikimedia.org/r/1101114

Change #1101558 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] thanos: add bool_gauge recording rules for search/wdqs update lag slos

https://gerrit.wikimedia.org/r/1101558

Change #1101560 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] pyrra: onboard wdqs/serach update lag slos

https://gerrit.wikimedia.org/r/1101560

Change #1101558 merged by Herron:

[operations/puppet@production] thanos: add bool_gauge recording rules for search/wdqs update lag slos

https://gerrit.wikimedia.org/r/1101558

Change #1101560 merged by Herron:

[operations/puppet@production] pyrra: onboard wdqs/serach update lag slos

https://gerrit.wikimedia.org/r/1101560

Change #1101896 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] pyrra: onboard liftwing api ng latency/availability

https://gerrit.wikimedia.org/r/1101896

Change #1101896 merged by Herron:

[operations/puppet@production] pyrra: onboard liftwing api ng latency/availability

https://gerrit.wikimedia.org/r/1101896

Change #1101911 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] pyrra: onboard liftwing slos

https://gerrit.wikimedia.org/r/1101911

Change #1101911 merged by Herron:

[operations/puppet@production] pyrra: onboard liftwing slos

https://gerrit.wikimedia.org/r/1101911

Change #1102346 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] pyrra: switch liftwing away from increase5m metrics

https://gerrit.wikimedia.org/r/1102346

Change #1102346 merged by Herron:

[operations/puppet@production] pyrra: switch liftwing away from increase5m metrics

https://gerrit.wikimedia.org/r/1102346

Change #1102366 had a related patch set uploaded (by Herron; author: Herron):

[operations/alerts@master] alertmanager: remove manually defined sli missing alert in favor or pyrra provided alert

https://gerrit.wikimedia.org/r/1102366

Change #1102366 merged by Herron:

[operations/alerts@master] alertmanager: remove manually defined sli missing alert in favor or pyrra provided alert

https://gerrit.wikimedia.org/r/1102366

I took a look as well at the general performance degradation when switching away from recording rules (i.e. increased CPU and network bandwidth on titan[12]001). Despite the attempts at optimizing what we have with @herron (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1103365?usp=search https://gerrit.wikimedia.org/r/c/operations/puppet/+/1104690?usp=search https://gerrit.wikimedia.org/r/c/operations/puppet/+/1104678?usp=search https://gerrit.wikimedia.org/r/c/operations/puppet/+/1103352?usp=search) I'm not seeing a significant change.

I was expecting thanos-query-frontend to split the long range queries (i.e. 12w) into multiple subqueries: however that doesn't happen because the query is one in the form of increase(...[12w]) and the result is a single value. In this case it doesn't look like query-frontend splits queries, therefore they are sent as-is to thanos-query.

IMHO the course going forward should be at least:

  1. go back to recording rules for liftwing (revert https://gerrit.wikimedia.org/r/c/operations/puppet/+/1102346) to get titan resources back in check. And take a closer look at exactly what's broken and what's the cause, maybe we can find a bandaid there.
  2. if we haven't already, reach out to pyrra upstream and see what their recommendation is in this (i.e. slos based on very big metrics) case is, if any

Change #1105037 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] thanos-store: enable caching bucket

https://gerrit.wikimedia.org/r/1105037

Thanks @fgiunchedi that helps explain the quite lower than I'd expect cache memory utilization in the frontend.

Before we give up on tuning Thanos I'm hoping we could rule out a couple more options, since we're likely to run into this in the future with these metrics it'd be great to land on a config that speeds these queries up without special case recording rules etc.

  • I'm curious what improvement enabling caching bucket might provide in our case https://thanos.io/v0.30/components/store.md/#caching-bucket
  • I'm also curious about alternate cache backends as several pieces of documentation and writeups found focus on the redis or memcahched backends. Maybe its the same, although some guides describe performance improvements simply due to the backend. IMO it'd be worth trying and ruling out, in my mind a simple approach like a local redis and cache config change would be enough to verify if it helps or not.

I uploaded a patch above to give the caching bucket a try, interested in your thoughts!

We can certainly try the caching bucket in thanos store, since it is easy to do. I'd be happy to be wrong although I don't think that's going to help anything, since the problem in my mind is the quantity of data that thanos components (thanos-store, thanos-query, thanos-query-frontend) have to process.

Also consider that when we switched away from recording rules both thanos-store and thanos-query CPU went up 2x-5x in aggregate across eqiad for example: https://grafana.wikimedia.org/goto/r0SfDoIHR?orgId=1 and since this is a single service SLOs causing such a bump, I don't think the approach is sustainable.

Change #1105037 merged by Herron:

[operations/puppet@production] thanos-store: enable caching bucket

https://gerrit.wikimedia.org/r/1105037

I'm beginning to see some improvements with caching bucket enabled, I think there's room for further tuning/improvement. Please see details in https://phabricator.wikimedia.org/T368953#10413075

Change #1105791 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] pyrra: wdqs match site label with = instead of =~

https://gerrit.wikimedia.org/r/1105791

Change #1105791 merged by Herron:

[operations/puppet@production] pyrra: wdqs match site label with = instead of =~

https://gerrit.wikimedia.org/r/1105791

Change #1105921 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] pyrra: remove liftwing slos

https://gerrit.wikimedia.org/r/1105921

Change #1105921 merged by Herron:

[operations/puppet@production] pyrra: remove liftwing slos

https://gerrit.wikimedia.org/r/1105921

Mentioned in T368953 as well -- The heavy liftwing slos have been offboarded for now, they are using a lot of thanos system resources and we think it'll be safest to offboard them during the break (apologies for the cross-posts, trying to keep SLO talk here and cache tuning in T368953)

@fgiunchedi let's meet in Jan to work through options for next steps. Overall I'd like to try and focus on a generalized approach, in other words something that doesn't involve writing/reasoning about special case recording rules at each SLO onboarding. Maybe stripped down variants of our heavy metrics or something like that, tbd.

Change #1054617 abandoned by Herron:

[operations/puppet@production] pyrra: onboard wdqs request SLO

https://gerrit.wikimedia.org/r/1054617

Change #1165571 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] pyrra-filesystem: clear output file on service stop

https://gerrit.wikimedia.org/r/1165571

Change #1165571 merged by Herron:

[operations/puppet@production] pyrra-filesystem: clear output files on service start

https://gerrit.wikimedia.org/r/1165571

Change #1169234 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] Pyrra-filesystem: purge unmanaged files from config directory

https://gerrit.wikimedia.org/r/1169234

Change #1169234 merged by Elukey:

[operations/puppet@production] Pyrra-filesystem: purge unmanaged files from config directory

https://gerrit.wikimedia.org/r/1169234