Page MenuHomePhabricator

Evaluate Sloth as a possible replacement for Pyrra
Closed, ResolvedPublic

Description

In T403729 some questions were raised about visualization issues of Pyrra dashboards. The SLO WG decided to test/evaluate Sloth to compare results and decide what is the best tool for the job.

Some high level things to test:

Sloth doesn't provide a UI like Pyrra, but it relies completely on Grafana. The minimal POC should be something like:

  • 1) We build the latest version of sloth (without proper debian packaging, it is not needed now).
  • 2) Run sloth (the CLI) with a Yaml spec representing one of the current Pyrra-configured SLOs, like Citoid availability. The output will be another Yaml file, containing the list of recording rules that Thanos will have to ingest.
  • 3) Instruct Thanos to pick up the yaml file (bonus: the new metrics should have a label to identify sloth-related time series, for better management).
  • 4) Optional (but it would be good): Backfill a couple of month of data for the editcheck pilot recording rules.
  • 5) Import the Sloth's Grafana JSON and visualize data.

Event Timeline

As a first very bare/minimum example I created:

version: "prometheus/v1"
service: "citoid"
labels:
  owner: "sre"
slos:
  - name: "requests-availability"
    objective: 99.5
    description: "Citoid's SLO based on availability for HTTP request responses."
    sli:
      events:
        error_query: sum(rate(istio_requests_total{source_workload_namespace="istio-system", source_workload="istio-ingressgateway", app="istio-ingressgateway", destination_canonical_service=~"citoid(?:-production)?", response_code=~"5..", prometheus="k8s"}[{{.window}}]))
        total_query: sum(rate(istio_requests_total{source_workload_namespace="istio-system", source_workload="istio-ingressgateway", app="istio-ingressgateway", destination_canonical_service=~"citoid(?:-production)?", prometheus="k8s"}[{{.window}}]))

And the CLI generated this:

# Code generated by Sloth (sloth-helm-chart-0.12.1-114-g25c5ed0): https://github.com/slok/sloth.
# DO NOT EDIT.

groups:
- name: sloth-slo-sli-recordings-citoid-requests-availability
  rules:
  - record: slo:sli_error:ratio_rate5m
    expr: |
      (sum(rate(istio_requests_total{source_workload_namespace="istio-system", source_workload="istio-ingressgateway", app="istio-ingressgateway", destination_canonical_service=~"citoid(?:-production)?", response_code=~"5..", prometheus="k8s"}[5m])))
      /
      (sum(rate(istio_requests_total{source_workload_namespace="istio-system", source_workload="istio-ingressgateway", app="istio-ingressgateway", destination_canonical_service=~"citoid(?:-production)?", prometheus="k8s"}[5m])))
    labels:
      owner: sre
      sloth_id: citoid-requests-availability
      sloth_service: citoid
      sloth_slo: requests-availability
      sloth_window: 5m
  - record: slo:sli_error:ratio_rate30m
    expr: |
      (sum(rate(istio_requests_total{source_workload_namespace="istio-system", source_workload="istio-ingressgateway", app="istio-ingressgateway", destination_canonical_service=~"citoid(?:-production)?", response_code=~"5..", prometheus="k8s"}[30m])))
      /
      (sum(rate(istio_requests_total{source_workload_namespace="istio-system", source_workload="istio-ingressgateway", app="istio-ingressgateway", destination_canonical_service=~"citoid(?:-production)?", prometheus="k8s"}[30m])))
    labels:
      owner: sre
      sloth_id: citoid-requests-availability
      sloth_service: citoid
      sloth_slo: requests-availability
      sloth_window: 30m
  - record: slo:sli_error:ratio_rate1h
    expr: |
      (sum(rate(istio_requests_total{source_workload_namespace="istio-system", source_workload="istio-ingressgateway", app="istio-ingressgateway", destination_canonical_service=~"citoid(?:-production)?", response_code=~"5..", prometheus="k8s"}[1h])))
      /
      (sum(rate(istio_requests_total{source_workload_namespace="istio-system", source_workload="istio-ingressgateway", app="istio-ingressgateway", destination_canonical_service=~"citoid(?:-production)?", prometheus="k8s"}[1h])))
    labels:
      owner: sre
      sloth_id: citoid-requests-availability
      sloth_service: citoid
      sloth_slo: requests-availability
      sloth_window: 1h
  - record: slo:sli_error:ratio_rate2h
    expr: |
      (sum(rate(istio_requests_total{source_workload_namespace="istio-system", source_workload="istio-ingressgateway", app="istio-ingressgateway", destination_canonical_service=~"citoid(?:-production)?", response_code=~"5..", prometheus="k8s"}[2h])))
      /
      (sum(rate(istio_requests_total{source_workload_namespace="istio-system", source_workload="istio-ingressgateway", app="istio-ingressgateway", destination_canonical_service=~"citoid(?:-production)?", prometheus="k8s"}[2h])))
    labels:
      owner: sre
      sloth_id: citoid-requests-availability
      sloth_service: citoid
      sloth_slo: requests-availability
      sloth_window: 2h
  - record: slo:sli_error:ratio_rate6h
    expr: |
      (sum(rate(istio_requests_total{source_workload_namespace="istio-system", source_workload="istio-ingressgateway", app="istio-ingressgateway", destination_canonical_service=~"citoid(?:-production)?", response_code=~"5..", prometheus="k8s"}[6h])))
      /
      (sum(rate(istio_requests_total{source_workload_namespace="istio-system", source_workload="istio-ingressgateway", app="istio-ingressgateway", destination_canonical_service=~"citoid(?:-production)?", prometheus="k8s"}[6h])))
    labels:
      owner: sre
      sloth_id: citoid-requests-availability
      sloth_service: citoid
      sloth_slo: requests-availability
      sloth_window: 6h
  - record: slo:sli_error:ratio_rate1d
    expr: |
      (sum(rate(istio_requests_total{source_workload_namespace="istio-system", source_workload="istio-ingressgateway", app="istio-ingressgateway", destination_canonical_service=~"citoid(?:-production)?", response_code=~"5..", prometheus="k8s"}[1d])))
      /
      (sum(rate(istio_requests_total{source_workload_namespace="istio-system", source_workload="istio-ingressgateway", app="istio-ingressgateway", destination_canonical_service=~"citoid(?:-production)?", prometheus="k8s"}[1d])))
    labels:
      owner: sre
      sloth_id: citoid-requests-availability
      sloth_service: citoid
      sloth_slo: requests-availability
      sloth_window: 1d
  - record: slo:sli_error:ratio_rate3d
    expr: |
      (sum(rate(istio_requests_total{source_workload_namespace="istio-system", source_workload="istio-ingressgateway", app="istio-ingressgateway", destination_canonical_service=~"citoid(?:-production)?", response_code=~"5..", prometheus="k8s"}[3d])))
      /
      (sum(rate(istio_requests_total{source_workload_namespace="istio-system", source_workload="istio-ingressgateway", app="istio-ingressgateway", destination_canonical_service=~"citoid(?:-production)?", prometheus="k8s"}[3d])))
    labels:
      owner: sre
      sloth_id: citoid-requests-availability
      sloth_service: citoid
      sloth_slo: requests-availability
      sloth_window: 3d
  - record: slo:sli_error:ratio_rate30d
    expr: |
      sum_over_time(slo:sli_error:ratio_rate5m{sloth_id="citoid-requests-availability", sloth_service="citoid", sloth_slo="requests-availability"}[30d])
      / ignoring (sloth_window)
      count_over_time(slo:sli_error:ratio_rate5m{sloth_id="citoid-requests-availability", sloth_service="citoid", sloth_slo="requests-availability"}[30d])
    labels:
      owner: sre
      sloth_id: citoid-requests-availability
      sloth_service: citoid
      sloth_slo: requests-availability
      sloth_window: 30d
- name: sloth-slo-meta-recordings-citoid-requests-availability
  rules:
  - record: slo:objective:ratio
    expr: vector(0.995)
    labels:
      owner: sre
      sloth_id: citoid-requests-availability
      sloth_service: citoid
      sloth_slo: requests-availability
  - record: slo:error_budget:ratio
    expr: vector(1-0.995)
    labels:
      owner: sre
      sloth_id: citoid-requests-availability
      sloth_service: citoid
      sloth_slo: requests-availability
  - record: slo:time_period:days
    expr: vector(30)
    labels:
      owner: sre
      sloth_id: citoid-requests-availability
      sloth_service: citoid
      sloth_slo: requests-availability
  - record: slo:current_burn_rate:ratio
    expr: |
      slo:sli_error:ratio_rate5m{sloth_id="citoid-requests-availability", sloth_service="citoid", sloth_slo="requests-availability"}
      / on(sloth_id, sloth_slo, sloth_service) group_left
      slo:error_budget:ratio{sloth_id="citoid-requests-availability", sloth_service="citoid", sloth_slo="requests-availability"}
    labels:
      owner: sre
      sloth_id: citoid-requests-availability
      sloth_service: citoid
      sloth_slo: requests-availability
  - record: slo:period_burn_rate:ratio
    expr: |
      slo:sli_error:ratio_rate30d{sloth_id="citoid-requests-availability", sloth_service="citoid", sloth_slo="requests-availability"}
      / on(sloth_id, sloth_slo, sloth_service) group_left
      slo:error_budget:ratio{sloth_id="citoid-requests-availability", sloth_service="citoid", sloth_slo="requests-availability"}
    labels:
      owner: sre
      sloth_id: citoid-requests-availability
      sloth_service: citoid
      sloth_slo: requests-availability
  - record: slo:period_error_budget_remaining:ratio
    expr: 1 - slo:period_burn_rate:ratio{sloth_id="citoid-requests-availability",
      sloth_service="citoid", sloth_slo="requests-availability"}
    labels:
      owner: sre
      sloth_id: citoid-requests-availability
      sloth_service: citoid
      sloth_slo: requests-availability
  - record: sloth_slo_info
    expr: vector(1)
    labels:
      owner: sre
      sloth_id: citoid-requests-availability
      sloth_mode: cli-gen-prom
      sloth_objective: "99.5"
      sloth_service: citoid
      sloth_slo: requests-availability
      sloth_spec: prometheus/v1
      sloth_version: sloth-helm-chart-0.12.1-114-g25c5ed0
- name: sloth-slo-alerts-citoid-requests-availability
  rules:
  - alert: MyServiceHighErrorRate
    expr: |
      (
          max(slo:sli_error:ratio_rate5m{sloth_id="citoid-requests-availability", sloth_service="citoid", sloth_slo="requests-availability"} > (14.4 * 0.005)) without (sloth_window)
          and
          max(slo:sli_error:ratio_rate1h{sloth_id="citoid-requests-availability", sloth_service="citoid", sloth_slo="requests-availability"} > (14.4 * 0.005)) without (sloth_window)
      )
      or
      (
          max(slo:sli_error:ratio_rate30m{sloth_id="citoid-requests-availability", sloth_service="citoid", sloth_slo="requests-availability"} > (6 * 0.005)) without (sloth_window)
          and
          max(slo:sli_error:ratio_rate6h{sloth_id="citoid-requests-availability", sloth_service="citoid", sloth_slo="requests-availability"} > (6 * 0.005)) without (sloth_window)
      )
    labels:
      category: availability
      routing_key: myteam
      severity: pageteam
      sloth_severity: page
    annotations:
      summary: High error rate on 'myservice' requests responses
      title: (page) {{$labels.sloth_service}} {{$labels.sloth_slo}} SLO error budget
        burn rate is too fast.
  - alert: MyServiceHighErrorRate
    expr: |
      (
          max(slo:sli_error:ratio_rate2h{sloth_id="citoid-requests-availability", sloth_service="citoid", sloth_slo="requests-availability"} > (3 * 0.005)) without (sloth_window)
          and
          max(slo:sli_error:ratio_rate1d{sloth_id="citoid-requests-availability", sloth_service="citoid", sloth_slo="requests-availability"} > (3 * 0.005)) without (sloth_window)
      )
      or
      (
          max(slo:sli_error:ratio_rate6h{sloth_id="citoid-requests-availability", sloth_service="citoid", sloth_slo="requests-availability"} > (1 * 0.005)) without (sloth_window)
          and
          max(slo:sli_error:ratio_rate3d{sloth_id="citoid-requests-availability", sloth_service="citoid", sloth_slo="requests-availability"} > (1 * 0.005)) without (sloth_window)
      )
    labels:
      category: availability
      severity: slack
      slack_channel: '#alerts-myteam'
      sloth_severity: ticket
    annotations:
      summary: High error rate on 'myservice' requests responses
      title: (ticket) {{$labels.sloth_service}} {{$labels.sloth_slo}} SLO error budget
        burn rate is too fast.
Error Budget calculations

The grafana dashboard's JSON shows these two expressions:

"description": "A rolling window of the total period (30d) error budget remaining.",

          "expr": "slo:period_error_budget_remaining:ratio{sloth_service=\"${service}\", sloth_slo=\"${slo}\"} or on() vector(1)",
          "legendFormat": "Remaining error budget (30d window)",
"description": "This graph shows the month error budget burn down chart (starts the 1st until the end of the month)",

          "expr": "1-(\n  sum_over_time(\n    (\n       slo:sli_error:ratio_rate1h{sloth_service=\"${service}\",sloth_slo=\"${slo}\"}\n       * on() group_left() (\n         month() == bool vector(${__to:date:M})\n       )\n    )[32d:1h]\n  )\n  / on(sloth_id)\n  (\n    slo:error_budget:ratio{sloth_service=\"${service}\",sloth_slo=\"${slo}\"} *on() group_left() (24 * days_in_month())\n  )\n)",
          "legendFormat": "Remaining error budget (month)",

So we do have both views: a calendar one and a rolling one, both lasting 30d. The calendar one can be expressed in another way to display 3 months, for example:

1-(
  sum_over_time(
    (
       slo:sli_error:ratio_rate1h{sloth_service="${service}",sloth_slo="${slo}"}
       * on() group_left() (
         (month() >= bool vector(${__from:date:M})) and
         (month() <= bool vector(${__from:date:M} + 2))
       )
    )[94d:1h]
  )
  / on(sloth_id)
  (
    slo:error_budget:ratio{sloth_service="${service}",sloth_slo="${slo}"} 
    *on() group_left() 
    (24 * (days_in_month(time() - 86400*62) + days_in_month(time() - 86400*31) + days_in_month()))
  )
)

The rolling window is implemented via slo:period_error_budget_remaining:ratio, that is:

expr: 1 - slo:period_burn_rate:ratio{sloth_id="citoid-requests-availability",
  sloth_service="citoid", sloth_slo="requests-availability"}

That in turn is:

slo:sli_error:ratio_rate30d{sloth_id="citoid-requests-availability", sloth_service="citoid", sloth_slo="requests-availability"}
/ on(sloth_id, sloth_slo, sloth_service) group_left
slo:error_budget:ratio{sloth_id="citoid-requests-availability", sloth_service="citoid", sloth_slo="requests-availability"}

And finally:

- record: slo:sli_error:ratio_rate30d
  expr: |
    sum_over_time(slo:sli_error:ratio_rate5m{sloth_id="citoid-requests-availability", sloth_service="citoid", sloth_slo="requests-availability"}[30d])
    / ignoring (sloth_window)
    count_over_time(slo:sli_error:ratio_rate5m{sloth_id="citoid-requests-availability", sloth_service="citoid", sloth_slo="requests-availability"}[30d])

  - record: slo:error_budget:ratio
    expr: vector(1-0.995)

This seems to be a calculation that takes into consideration only the rolling window, nothing outside of it (see the sum_over_time of slo:sli_error:ratio_rate5m intervals).

Alerting

Adding some alerting rules adds the following:

- name: sloth-slo-alerts-citoid-requests-availability
  rules:
  - alert: CitoidHighErrorRate
    expr: |
      (
          max(slo:sli_error:ratio_rate5m{sloth_id="citoid-requests-availability", sloth_service="citoid", sloth_slo="requests-availability"} > (14.4 * 0.005)) without (sloth_window)
          and
          max(slo:sli_error:ratio_rate1h{sloth_id="citoid-requests-availability", sloth_service="citoid", sloth_slo="requests-availability"} > (14.4 * 0.005)) without (sloth_window)
      )
      or
      (
          max(slo:sli_error:ratio_rate30m{sloth_id="citoid-requests-availability", sloth_service="citoid", sloth_slo="requests-availability"} > (6 * 0.005)) without (sloth_window)
          and
          max(slo:sli_error:ratio_rate6h{sloth_id="citoid-requests-availability", sloth_service="citoid", sloth_slo="requests-availability"} > (6 * 0.005)) without (sloth_window)
      )
    labels:
      category: availability
      routing_key: myteam
      severity: pageteam
      sloth_severity: page
    annotations:
      summary: High error rate on 'citoid' requests responses
      title: (page) {{$labels.sloth_service}} {{$labels.sloth_slo}} SLO error budget
        burn rate is too fast.
  - alert: CitoidHighErrorRate
    expr: |
      (
          max(slo:sli_error:ratio_rate2h{sloth_id="citoid-requests-availability", sloth_service="citoid", sloth_slo="requests-availability"} > (3 * 0.005)) without (sloth_window)
          and
          max(slo:sli_error:ratio_rate1d{sloth_id="citoid-requests-availability", sloth_service="citoid", sloth_slo="requests-availability"} > (3 * 0.005)) without (sloth_window)
      )
      or
      (
          max(slo:sli_error:ratio_rate6h{sloth_id="citoid-requests-availability", sloth_service="citoid", sloth_slo="requests-availability"} > (1 * 0.005)) without (sloth_window)
          and
          max(slo:sli_error:ratio_rate3d{sloth_id="citoid-requests-availability", sloth_service="citoid", sloth_slo="requests-availability"} > (1 * 0.005)) without (sloth_window)
      )
    labels:
      category: availability
      severity: slack
      slack_channel: '#alerts-myteam'
      sloth_severity: ticket
    annotations:
      summary: High error rate on 'citoid' requests responses
      title: (ticket) {{$labels.sloth_service}} {{$labels.sloth_slo}} SLO error budget
        burn rate is too fast.

These are the multi-window error budget burned alerts, similar to what Pyrra creates. Also worth reading: https://sloth.dev/usage/slo-period-windows/

I found this very interesting github issue about how Sloth optimizes the calculation of the error budget: https://github.com/slok/sloth/issues/618. I didn't get if this will a problem for us or not, but I am adding it to the task for reference so we can discuss it.

This comment was removed by elukey.

The formula for the "fixed" error budget of the current month could be translated, with some quirks, in something like the following (not tested):

1-(
  sum_over_time(
    (
       slo:sli_error:ratio_rate1h{sloth_service="${service}",sloth_slo="${slo}"}
       * on() group_left() (
         (year() == ${quarter_year}) and 
         (month() >= ${quarter_start_month}) and 
         (month() <= ${quarter_end_month})
       )
    )[94d:1h]
  )
  / on(sloth_id)
  (
    slo:error_budget:ratio{sloth_service="${service}",sloth_slo="${slo}"} 
    * on() group_left() 
    (24 * ${quarter_total_days})
  )
)

The downside is that we'll need to fill those variables once per quarter, but it seems something doable. The main issue with the current calculations is that they should take into account the "now" timestamp, independently of what from/to value one puts. This should give us a quarter SLO burn down chart, while keeping also the rolling window + alerting in place.

Today I tried to implement the Sloth rolling window for ToneCheck's latency SLO and I came up with:

1 - ((

sum_over_time((
(
sum(rate(istio_request_duration_milliseconds_count{app="istio-ingressgateway",destination_canonical_service="edit-check-predictor",prometheus="k8s-mlserve", response_code=~"2..", source_workload="istio-ingressgateway",source_workload_namespace="istio-system"}[5m])) -
sum(rate(istio_request_duration_milliseconds_bucket{app="istio-ingressgateway",destination_canonical_service="edit-check-predictor",prometheus="k8s-mlserve",response_code=~"2..", le="1000", source_workload="istio-ingressgateway",source_workload_namespace="istio-system"}[5m]))
)
/
sum(rate(istio_request_duration_milliseconds_count{app="istio-ingressgateway",destination_canonical_service="edit-check-predictor",prometheus="k8s-mlserve", response_code=~"2..", source_workload="istio-ingressgateway",source_workload_namespace="istio-system"}[5m]))
)[30d:5m])

/

count_over_time((
(
sum(rate(istio_request_duration_milliseconds_count{app="istio-ingressgateway",destination_canonical_service="edit-check-predictor",prometheus="k8s-mlserve", response_code=~"2..", source_workload="istio-ingressgateway",source_workload_namespace="istio-system"}[5m])) -
sum(rate(istio_request_duration_milliseconds_bucket{app="istio-ingressgateway",destination_canonical_service="edit-check-predictor",prometheus="k8s-mlserve",response_code=~"2..", le="1000", source_workload="istio-ingressgateway",source_workload_namespace="istio-system"}[5m]))
)
/
sum(rate(istio_request_duration_milliseconds_count{app="istio-ingressgateway",destination_canonical_service="edit-check-predictor",prometheus="k8s-mlserve", response_code=~"2..", source_workload="istio-ingressgateway",source_workload_namespace="istio-system"}[5m]))
)[30d:5m])

) / vector(1-0.9))

The resulting graph is not what I expect, so it needs refinement, posting it in here to have it ready next week!

herron claimed this task.
herron updated the task description. (Show Details)
herron subscribed.

SLO WG has decided together to proceed with a production roll out of sloth, which will be tracked in the parent task!