Page MenuHomePhabricator

The Edit Check's SLO has burned all its error budget
Open, HighPublic

Description

Hi folks!

In T395444 we have set up the following two dashboards:

These are the SLO specs: https://wikitech.wikimedia.org/wiki/SLO/EditCheck

As you can see from the dashboards, the error budget is gone, so we should figure out if this is a problem with the metrics/settings or if there is something going on with the Edit Check's code.

Event Timeline

elukey triaged this task as High priority.

This smells like a metrics issue to me -- note that on the rolling window dashboard, the "Errors" graph is regularly fluctuating between 0 and 50%, and occasionally past 100% (!) whereas in the calendar window dashboard, the "Error ratio" graph is steady between 0 and 0.03%. Those ought to be the same data, modulo the time window.

EditCheck folks do please take a look, but my hunch is that we're looking for a data problem rather than a service availability problem.

The calendar/grafana dashboard uses the following for Errors:

pyrra_errors_total{slo="$slo", site=~"$site", cluster=~"$cluster"} / pyrra_requests_total{slo="$slo", site=~"$site", cluster=~"$cluster"}

That in turn are recording rules:

- expr: sum(editcheck_sli_presavechecks_available_total)
  labels:
    service: edit-check
    slo: edit-check-pre-save-checks-ratio
    team: sre
  record: pyrra_requests_total

- expr: sum(editcheck_sli_presavechecks_shown_vs_available_total or vector(0))
  labels:
    service: edit-check
    slo: edit-check-pre-save-checks-ratio
    team: sre
  record: pyrra_errors_total

Meanwhile the Pyrra UI uses:

sum(rate(editcheck_sli_presavechecks_shown_vs_available_total[3h])) / scalar(sum(rate(editcheck_sli_presavechecks_available_total[3h]))) > 0

This is the definition of the two main "raw" metrics:

- name: editcheck_slos
  rules:
    - record: editcheck_sli_presavechecks_shown_vs_available_total
      expr: |
        (mediawiki_editcheck_preSaveChecks_total{kind="Available"} -
          ignoring(kind) mediawiki_editcheck_preSaveChecks_total{kind="NotShown"} -
          ignoring(kind) mediawiki_editcheck_preSaveChecks_total{kind="Shown"}
        )
    - record: editcheck_sli_presavechecks_available_total
      expr: sum without (kind) (mediawiki_editcheck_preSaveChecks_total{kind="Available"})

All metrics in https://w.wiki/Fh3R

I may have found some inconsistency in https://grafana.wikimedia.org/goto/NsjxWreNR. The editcheck_sli_presavechecks_shown_vs_available_total recording rule is somehow having a gauge-like behavior rather than a counter one, and I am wondering if this messes up a little our calculations. The underlying metrics seem to follow the counter pattern (https://grafana.wikimedia.org/goto/xewrZ9eHg?orgId=1) so it is not clear to me what's happening. I am wondering if this explains the odd rate() behavior that we are seeing in the Pyrra UI..

From conversation with @DLynch we think https://gerrit.wikimedia.org/r/1196940 addresses a possible underlying cause in EditCheck: if the model fetch takes so long that users abandon the page while it's underway, that will count against the SLO, since we already incremented Available but never increment Shown nor NotShown. The fix adds a 6-second timeout (down from 5 minutes, enforced elsewhere).

(This isn't metric-fudging: the timeout improves EditCheck reliability by degrading gracefully so that the slow fetch doesn't block the user experience. That's good!)

But we don't think this is happening more than very occasionally, so if (still speculative) it's the sole cause of the budget burn, then some statistics issue is responsible for overcounting it against the budget.

Quick status update: https://gerrit.wikimedia.org/r/1196940 merged and is going out on the train, so if it turns out to have an impact it should be visible shortly.

No particular change in pattern since last Thursday when the timeout patch would have reached large wikis.

Rather, based on this chart, we seem to have had basically no errors since the 14th -- which doesn't correlate with any particular code rollout or config change.

CleanShot 2025-10-28 at 10.45.19@2x.png (1,152×470 px, 56 KB)

@DLynch I think that the Errors panel in the Pyrra UI is misleading, as I wrote in T406836#11285640 my suspicion is that the new SLI metrics that we created don't behave as counters, but gauges (since every now and then they go back down to zero) and with Promql's rate this gets visualized as spikes. The Grafana dashboard seems to offer a better view:

dashboard.

The error ratio shows the trend that the service experienced recently - from Sept 9th (more or less) the errors rose above the 0.01% mark, crossing the 99% SLO target and causing the budget to be burned. After the 13th/14th (as also the Pyrra UI shows), the trend is better but we are not back under the 0.01% mark so we are not seeing full recovery (namely a slowly but steady progression to 100% error budget left).

So something seems happening but due to the 99% target, it may not be very visible and clear to spot (since the 0.01% allowance is tight). Is there anything that changed around Sept 9/10th by any chance?

Next steps

  • Editing Team to revisit this issue and decide what (if anything) we think could explain the behavior (read: errors) the dashboard is signaling:

image.png (2,972×628 px, 153 KB)

  • In parallel, @elukey: based on what you helpfully hypothesized in T406836#11285640 and T406836#11322293, it sounds like there could be issues within how these metrics are being calculated. What (if any) additional investigation do you think would need to happen before we could confidently say this is or is not contributing to or explaining the behavior we're seeing?

Next steps

  • Editing Team to revisit this issue and decide what (if anything) we think could explain the behavior (read: errors) the dashboard is signaling:

image.png (2,972×628 px, 153 KB)

In terms of timing, the Editing Team will revisit the above in during the work cycle that begins February 2nd of 16th.

Hi @ppelberg! I think that something changed in the presavechecks_shown_vs_available_total SLI metric:

Screenshot From 2026-01-16 12-31-58.png (5,080×2,084 px, 463 KB)

link: https://w.wiki/HVLN

The ratio between errors and total is around 0.20, and the SLO target is 99%, so the error budget is constantly eroded. I'd suggest to check why the ratio between errors and total changed in the past weeks, once we get it we should be able to go back to a normal state. Lemme know!

Hi Editing Team -- is this still on your plate for the current work cycle? Let us know how we can help.

Hi Editing Team -- is this still on your plate for the current work cycle? Let us know how we can help.

@RLazarus: thank you for following up on this! I've just proposed a time for you, David, and I to meet this Thursday so that we can figure out what to do next here.

Had a chat with @CDanis and @RLazarus -- there's a very compelling chart showing that the increased error rates stopped exactly when the patch turning off the tone check a/b test landed.

(Though the errors started a while after the test started, which is confusing -- could be tied to some other increase in load on the LiftWing model, perhaps because of Growth starting their scanning for their suggested tasks? I'm not sure how to check when that started; might be worth hunting down, because it lining up would make model-server-load the overwhelmingly most-likely cause.)

With this, the theory for what's going on is:

  1. A user with tone check available to them clicks "publish"
  2. The request to the tone model takes a long time to run, which delays the UI progressing to either the presave checks or the save dialog
  3. The user exits out of waiting for the presave moment to load (maybe by backing out from the toolbar and trying again, maybe by reloading the page, maybe by closing the tab entirely).

As such, we have plausible next-steps of:

  1. (optional) Someone find out when Growth started scanning for their suggested task.
  2. (optional) Someone dig into the EditAttemptStep data and see whether there's a visible impact that's confined to people enrolled in the a/b test on abort (complete abandonment) or saveIntent (backing out and trying again) events.
  3. Verify that our code setting a timeout on the tone check actually works.
    • if not, fix that
  4. Assuming it works, decide whether the current timeout of 6s represents an acceptable delay in the process.
    • if yes, consider whether we can make UX adjustments to reduce people backing out during this delay, and potentially adjust the SLO to account for this being deemed allowable (we didn't have an actual async check deployed at the time we defined the metric, so we were just guessing about its effects)
    • if not, pick a shorter timeout and see what impact that has

Testing a lot of this is complicated because of the aforementioned closing of the a/b test, so we're not going to get quick feedback on the change.

Change #1242479 had a related patch set uploaded (by DLynch; author: DLynch):

[mediawiki/extensions/VisualEditor@master] Edit check: catch various places where an error could derail things

https://gerrit.wikimedia.org/r/1242479

T418173 conveniently made it incredibly apparent what happens when the model for Tone is outright missing, and so I have a patch that will stop that from breaking things. I'm not certain it's 100% the best approach for the SLO per-se, because it does really represent an outage that's affecting the system... but the patch stops it from actually stopping people saving, so.

Change #1242493 had a related patch set uploaded (by DLynch; author: DLynch):

[mediawiki/extensions/VisualEditor@wmf/1.46.0-wmf.16] Edit check: catch various places where an error could derail things

https://gerrit.wikimedia.org/r/1242493

Change #1242479 merged by jenkins-bot:

[mediawiki/extensions/VisualEditor@master] Edit check: catch various places where an error could derail things

https://gerrit.wikimedia.org/r/1242479

Change #1242493 merged by jenkins-bot:

[mediawiki/extensions/VisualEditor@wmf/1.46.0-wmf.16] Edit check: catch various places where an error could derail things

https://gerrit.wikimedia.org/r/1242493

Mentioned in SAL (#wikimedia-operations) [2026-02-23T21:27:39Z] <tgr@deploy2002> Started scap sync-world: Backport for [[gerrit:1240832|Revert "extension-list: add a bogus extension to test l10n-update" (T411516)]], [[gerrit:1241712|Pre-deploy Comparative Reader Research survey on enwiki (T417829)]], [[gerrit:1241713|Pre-deploy Comparative Reader Research survey on eswiki (T417834)]], [[gerrit:1242493|Edit check: catch various places where an error could derail things (T406836 T418

Mentioned in SAL (#wikimedia-operations) [2026-02-23T21:41:34Z] <tgr@deploy2002> Finished scap sync-world: Backport for [[gerrit:1240832|Revert "extension-list: add a bogus extension to test l10n-update" (T411516)]], [[gerrit:1241712|Pre-deploy Comparative Reader Research survey on enwiki (T417829)]], [[gerrit:1241713|Pre-deploy Comparative Reader Research survey on eswiki (T417834)]], [[gerrit:1242493|Edit check: catch various places where an error could derail things (T406836 T41

As such, we have plausible next-steps of:

  1. (optional) Someone dig into the EditAttemptStep data and see whether there's a visible impact that's confined to people enrolled in the a/b test on abort (complete abandonment) or saveIntent (backing out and trying again) events.

Filed as T418321

@DLynch: thank you for synthesizing all that we discussed in T406836#11634075.

A resulting question for you: with T418321 and T418287 filed and T418173 resolved, what (if anything) do you think is left to be done before this ticket is ready to be Resolved?

Not sure -- based on those charts we're still seeing errors, so we probably need to look into what's going on some more.

slo.wikimedia.org doesn't exist any more, so we might need to work out what that rolling dashboard link should actually go to now

@DLynch hi! We are going to announce it more broadly, but we have now Grafana-based dashboards. Yours are here.

There are two windows: quarterly and rolling 30 days, so you can check both. The current 30d rolling window is -165%, and we don't show (atm) values less than zero in the graphs. Let's keep it monitored and see if it improves :)

Next steps

  • @DLynch to update ticket with outcome of fixes
  • @dchan to follow up w/ Reuven and communicate priority and clarify next steps

I'm always happy to get involved if you need me, but my colleague @CDanis from the SLO working group is your best contact for the followup here. :)