Page MenuHomePhabricator

Update WDQS SLOs to reflect graph split changes
Open, In Progress, MediumPublic

Description

The graph split was officially rolled out in ( ). Unfortunately, the SLO metrics queries are still pointing to the defunct full graph.

Creating this ticket to:

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

There are a few ongoing alerts like this for the search team for the last 12 days:

FIRING: [2x] SLOMetricAbsent: wdqs-update-lag

Is there something that can be done about that? Thanks.

Change #1148974 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: remove SLI/SLO for public wdqs

https://gerrit.wikimedia.org/r/1148974

bking changed the task status from Open to In Progress.May 21 2025, 9:59 PM
bking assigned this task to RKemper.
bking updated Other Assignee, added: bking.
bking updated the task description. (Show Details)
bking updated the task description. (Show Details)

Change #1148974 merged by Ryan Kemper:

[operations/puppet@production] wdqs: remove SLI/SLO for public wdqs

https://gerrit.wikimedia.org/r/1148974

Change #1148979 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: nuke previously absented pyrra update lag

https://gerrit.wikimedia.org/r/1148979

Change #1148979 merged by Ryan Kemper:

[operations/puppet@production] wdqs: nuke previously absented pyrra update lag

https://gerrit.wikimedia.org/r/1148979

Change #1155335 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: fork SLOs for wdqs-main and wdqs-scholarly

https://gerrit.wikimedia.org/r/1155335

Change #1155335 merged by Ryan Kemper:

[operations/puppet@production] wdqs: fork SLOs for wdqs-main and wdqs-scholarly

https://gerrit.wikimedia.org/r/1155335

Mentioned in SAL (#wikimedia-operations) [2025-06-18T19:32:45Z] <ryankemper> T393966 Ran puppet on titan1001 following merge of https://gerrit.wikimedia.org/r/c/operations/puppet/+/1155335. Puppet looks happy and I see the new recording rules getting created

Change #1161024 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: absent old availability metric

https://gerrit.wikimedia.org/r/1161024

Change #1161025 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: remove previously-absented slo

https://gerrit.wikimedia.org/r/1161025

Change #1161024 merged by Ryan Kemper:

[operations/puppet@production] wdqs: absent old availability metric

https://gerrit.wikimedia.org/r/1161024

New SLOs/SLIs are in place and old ones have been fully absented.

Not moving this ticket to done yet; there's a remaining task to update https://wikitech.wikimedia.org/wiki/SLO/WDQS to reflect the new changes, as well as getting it formally approved.

Change #1161025 merged by Ryan Kemper:

[operations/puppet@production] wdqs: remove previously-absented slo

https://gerrit.wikimedia.org/r/1161025

Change #1165521 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::pyrra::filesystem::slo: fix WDQS SLI

https://gerrit.wikimedia.org/r/1165521

Change #1165521 merged by Elukey:

[operations/puppet@production] profile::pyrra::filesystem::slo: fix WDQS SLI

https://gerrit.wikimedia.org/r/1165521

@Gehel @RKemper Hi! A while ago I had a chat with Ryan to figure out how to improve the current WDQS SLOs being reported in Pyrra (slo.wikimedia.org). The traffic server's metrics are used, ending up with one SLO for each DC. We'd prefer to use something closer to the service, like nginx metrics on the wdqs hosts (and possibly having a single SLO without splitting by DCs, since this is an active/active service). Are those SLOs used at the moment? Namely, are they periodically checked etc.? Otherwise I'd propose to remove their config to clean up the Pyrra's status, and then restart when you are ready. Lemme know!

@Gehel I see some moving in the sprints tags, is it being planned/worked on? I can take care of the clean up in case it is needed, so we can re-add the SLOs when we are ready.

@Gehel @RKemper Hi! A while ago I had a chat with Ryan to figure out how to improve the current WDQS SLOs being reported in Pyrra (slo.wikimedia.org). The traffic server's metrics are used, ending up with one SLO for each DC. We'd prefer to use something closer to the service, like nginx metrics on the wdqs hosts (and possibly having a single SLO without splitting by DCs, since this is an active/active service). Are those SLOs used at the moment? Namely, are they periodically checked etc.? Otherwise I'd propose to remove their config to clean up the Pyrra's status, and then restart when you are ready. Lemme know!

I've booked some time with rzl to discuss this further and I'll report back, but at first glance I think shifting to nginx metrics makes sense. If we want to tear down the existing config, doing that early next week would be best.

Met with rzl.

Discussion highlights
  • Went over the general philosophical distinction of whether we place SLO at the tightest service boundary (nginx or blazegraph itself) versus at the "user experience boundary" (i.e. trafficserver metrics)
    • We're opting for tighter service boundary, so nginx metrics seems like a good starting place for the new SLO
  • Some discussion of whether we count throttled requests as successes, or exclude them entirely (they would never count as failures since they're expected behavior). Compelling point made that if we had a massive traffic flood resulting in throttled requests dwarfing total requests, it could "hide" genuine issues with the real (non-throttled) requests. So with that in mind excluding seems appropriate (note the current SLO implementation, before these changes, just counts them as successes)
  • We definitely want to be datacenter agnostic as far as the SLO. Of course datacenter-specific metrics are still readily available, just not part of the SLO
  • Loop in Wikidata PM (& wikidata team more broadly) so they're aware of the SLO and the threshold (95% i.e. service can be down for a weekend and not violate SLO)
Practical implementation stuff

We've got done rate, error rate, and throttle rate available already: https://grafana-rw.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&refresh=1m&var-cluster_name=wdqs-main&from=now-30m&to=now&timezone=utc&var-graph_type=%289102%7C919%5B35%5D%29

So my first approach will be to directly use those. Briefly, availability = done_rate / (done_rate + error_rate - throttle_rate) (this formula assumes that throttled requests show up in the error rate; TODO I should validate this)

Also, in the medium-term, even with us using nginx-level metrics at the core DC instead of the traffic POP, we will likely want to make them conditional on the DC being pooled, so that we can exclude stuff like load tests being run on a depooled DC

@RKemper thank a lot for the update! Just to clarify, the SLO dashboarding will be done using Pyrra (slo.wikimedia.org), we have specific configs that get raw metrics and create Prometheus recording rules, that are then used in Pyrra-UI/Grafana. So we'd need from you the raw metrics only (for availability: error and total).

Change #1198583 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/puppet@production] Prometheus metrics for DNS Discovery service state

https://gerrit.wikimedia.org/r/1198583

Change #1198583 merged by CDanis:

[operations/puppet@production] Prometheus metrics for DNS Discovery service state

https://gerrit.wikimedia.org/r/1198583

Working on the new metrics here. The panel labeled success rate is what will ultimately be the SLI. There's still a couple further changes to make:

  • merge the two datacenter's metrics (they're just separated right now while the final query is getting assembled)
  • subtract throttled requests

@dcausse In this updated version of the SLI we don't want to count throttled requests as either a success or failure, but rather exclude them entirely. However I'm having a bit of trouble understanding how all the pieces fit together.

Briefly, when a request comes in from a user and hits the throttling filter, does that avoid the request ever ultimately hitting blazegraph itself? In other words, we have the metrics of blazegraph_queries_done and blazegraph_queries_error which are scraped by the prometheus blazegraph exporter, and I want to know if blazegraph_queries_error implicitly contains the throttled requests in its count or not.

if it looks like request -> throttling filter -> (if not throttled) blazegraph then I think those are already excluded and therefore I don't have to do anything. But if the throttled request still makes it to blazegraph and blazegraph issues the 4xx at that point then they would be included. I think it's the former but figured you might be able to shed some light here.

@dcausse In this updated version of the SLI we don't want to count throttled requests as either a success or failure, but rather exclude them entirely. However I'm having a bit of trouble understanding how all the pieces fit together.

Briefly, when a request comes in from a user and hits the throttling filter, does that avoid the request ever ultimately hitting blazegraph itself? In other words, we have the metrics of blazegraph_queries_done and blazegraph_queries_error which are scraped by the prometheus blazegraph exporter, and I want to know if blazegraph_queries_error implicitly contains the throttled requests in its count or not.

if it looks like request -> throttling filter -> (if not throttled) blazegraph then I think those are already excluded and therefore I don't have to do anything. But if the throttled request still makes it to blazegraph and blazegraph issues the 4xx at that point then they would be included. I think it's the former but figured you might be able to shed some light here.

The metrics exported by the blazegraph exporter coming from the /Query Engine path are internal blazegraph queries and thus cannot see throttled/banned queries, these should contain what you need for your SLO IIUC.

Change #1202049 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] (wip) wdqs: add availability sli recording rules

https://gerrit.wikimedia.org/r/1202049

The current iteration of https://gerrit.wikimedia.org/r/c/operations/puppet/+/1202049/5/modules/profile/files/thanos/recording_rules.yaml has removed the sum and rate functions, since we can rely on pyrra to compute some intermediate metrics from these SLIs.

Note that since the error_total queries (afaik) show up in done_total, we're using done_total to represent all queries and error_total to represent bad (failed) queries

Change #1202049 merged by Ryan Kemper:

[operations/puppet@production] wdqs: add availability sli recording rules

https://gerrit.wikimedia.org/r/1202049

RKemper renamed this task from Update WDQS SLO lag queries to reflect graph split changes to Update WDQS SLOs to reflect graph split changes.Thu, Jan 22, 5:21 PM

Change #1230399 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: make avail SLOs dc & svc agnostic

https://gerrit.wikimedia.org/r/1230399

Change #1230399 merged by Ryan Kemper:

[operations/puppet@production] wdqs: make avail SLOs dc & svc agnostic

https://gerrit.wikimedia.org/r/1230399

Merged patch for the new SLO (and corresponding recording rules; I realized pyrra wants stuff in terms of total and errors thus why there's 2 recording rules instead of 1 now).

At first I was thinking of a combined SLO for both main and scholarly, but on further thought it still makes sense to break out per-service; scholarly gets way less traffic so uniting the two risks us not noticing unacceptably low performance for scholarly.

Getting a patch up to restore both SLOs instead of the single unified one now.

Change #1230672 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] WDQS: separate avail SLOs per service

https://gerrit.wikimedia.org/r/1230672

Change #1235891 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] pyrra: fix wdqs availability SLO config

https://gerrit.wikimedia.org/r/1235891

Change #1235892 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] pyrra: absent old per-dc wdqs availability configs

https://gerrit.wikimedia.org/r/1235892

Change #1235893 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] pyrra: remove previously absented wdqs avail SLOs

https://gerrit.wikimedia.org/r/1235893

Change #1230672 abandoned by Ryan Kemper:

[operations/puppet@production] WDQS: separate avail SLOs per service

Reason:

taking new approach

https://gerrit.wikimedia.org/r/1230672