Page MenuHomePhabricator

Reduce the pyrra's multi-dc configurations where it makes sense
Closed, ResolvedPublic

Description

A lot of Pyrra configurations are now split by eqiad and codfw in the rolling view (slo.wikimedia.org), that is somehow confusing in my opinion since SLO owners should have a single view in most of the cases.

Event Timeline

elukey triaged this task as High priority.

Change #1166076 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] pyrra: remove multi-dc for istio-based SLOs

https://gerrit.wikimedia.org/r/1166076

Change #1166135 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] pyrra: refactor the filesystem class to be more readable

https://gerrit.wikimedia.org/r/1166135

Change #1166149 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] pyrra: remove multi-dc for wdqs

https://gerrit.wikimedia.org/r/1166149

elukey moved this task from Backlog to In Progress on the SRE-SLO board.

I think we could do it, but before committing to the change could we expand a bit on rationale and side-effects/use cases?

A quick list off the top of my head

  • Rationalize SLOs that differ today between sites - Some SLOS have different error budget values by site, what do we expect to happen by merging them?
  • Decide if we can apply this across the board - Can we simplify and remove site/datacenter splitting altogether?
  • Review/document side-effects (more opaque SLO alerts, etc)
  • Document appropriate use cases / examples for datacenters/sites in SLOs

@herron sure! So my idea stems from the fact that discovery endpoints like all k8s services should have a single SLO, since how we pool/depool and manage backend capacity behind it is not something that user is concerned about. In these cases merging the error budgets should give us a more high level view of what the user experiences, and also it would give to the SLO owner a more simplified view.

There are SLOs that need to have multiple error budgets, for example the traffic ones, but I think in that case it is more a requirement from the SLO owners.

Lemme know your thoughts!

Today I reviewed a sampling of our published SLO docs and while some do make mention of 'datacenter' and specific names like 'eqiad' 'codfw', I didn't see a case where we explicitly document if the targets are per-site or all sites. I did find in the varnish SLO mention of potentially both (per-site and aggregate) which is an interesting case to cover as well. And of course it can vary per-SLO. Overall seems a bit of a grey area that we could clarify. I think simplifying like you describe is worth trying, and IMO as we do let's update the docs to make it more clear about the datacenter scope that's being implemented and alerted on.

Change #1166076 merged by Elukey:

[operations/puppet@production] pyrra: remove multi-dc for istio-based SLOs

https://gerrit.wikimedia.org/r/1166076

Change #1166135 merged by Elukey:

[operations/puppet@production] pyrra: refactor the filesystem class to be more readable

https://gerrit.wikimedia.org/r/1166135

Change #1166149 abandoned by Elukey:

[operations/puppet@production] pyrra: remove multi-dc for wdqs

Reason:

Will follow up in a task.

https://gerrit.wikimedia.org/r/1166149

Change #1170271 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] pyrra: Limit istio latency SLI queries to a single app

https://gerrit.wikimedia.org/r/1170271

Change #1170271 merged by Vgutierrez:

[operations/puppet@production] pyrra: Limit istio SLI queries to a single app

https://gerrit.wikimedia.org/r/1170271

We discovered this Pyrra bug https://github.com/pyrra-dev/pyrra/issues/667 that is affecting all the SLOs that are istio based. The Pyrra UI assumes that the metrics are in seconds, while we have ms, and of course all the calculations are wrong. It seems an issue related to the UI only, so hopefully there may be some collaboration with upstream that fixes this.

Change #1170550 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] pyrra: simplify multi-dc handling for istio SLOs

https://gerrit.wikimedia.org/r/1170550

Change #1170550 merged by Elukey:

[operations/puppet@production] pyrra: simplify multi-dc handling for istio SLOs

https://gerrit.wikimedia.org/r/1170550

The Grafana calendar dashboard is currently broken for Istio based SLOs like citoid: https://grafana-rw.wikimedia.org/d/ccssRIenz/slo-quarterly-drilldown?forceLogin&from=2025-06-01T00:00:00.000Z&orgId=1&refresh=30s&timezone=utc&to=2025-08-31T23:59:59.000Z&var-cluster=$__all&var-site=$__all&var-slo=citoid-availability

The main issue seems to be that we have old time series that all together cause a very confusing view, we should try to drop them. @herron is it possible in your opinion?

I had a chat with Filippo and https://github.com/thanos-io/thanos/issues/1598#issuecomment-2610564533 seems telling us that it is not really possible :(

We could try to explore https://thanos.io/tip/operating/modify-objstore-data.md/

We should probably come up with a smart filtering in the dashboard to exclude old values.

We solved this adding versioning to each SLO, and we are going to roll it out very soon. Closing this task.