Page MenuHomePhabricator

Fix EditCheck's SLO metrics and create a dashboard for it
Closed, ResolvedPublic

Description

Hi!

The EditCheck metrics seem to be dropped due to Graphite being read only, and we don't have anything to measure the SLO stated in https://wikitech.wikimedia.org/wiki/SLO/EditCheck

I had a chat with David on Slack, that in turn posted a comment in #working-with-data and this was the answer:

https://www.mediawiki.org/wiki/ResourceLoader/Core_modules#mw.track
Looks like you need to prefix with stats.
Also:
Note: statsd.js checks that Prometheus metrics have the correct prefix ("mediawiki_") and suffix ("_total" or "_seconds"). Without these, warnings will be logged or errors will be thrown.
Looks like your usage did not make the list at T350592 for some reason, so maybe that's why?

Once done I'll create a new dashboard in Grafana and slo.wikimedia.org for EditCheck, so it will be easier to track the error budget etc..

Lemme know!

Event Timeline

Change #1152386 had a related patch set uploaded (by DLynch; author: DLynch):

[mediawiki/extensions/VisualEditor@master] Edit check SLO: migrate old counter stats to statslib

https://gerrit.wikimedia.org/r/1152386

elukey moved this task from Backlog to In Progress on the SRE-SLO board.

@elukey I moved this to our kanban board. David is out on vacation this week. He'll get to that when he is back.

@VPuffetMichel Hi! Is there anybody that can follow up on this task while David is afk? We currently have zero metrics related to the SLO, so we cannot check anything :) In theory there is little work remaining to do in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/1152386 (and of course I can help if needed!). Thanks in advance!

Done! The patch is ready to go in my opinion, thanks!

Change #1152386 merged by jenkins-bot:

[mediawiki/extensions/VisualEditor@master] Edit check SLO: migrate old counter stats to statslib

https://gerrit.wikimedia.org/r/1152386

@elukey Okay, this has made it to the train for this week, so we should start seeing data come in tuesday-thursday.

@DLynch that's great!

Next steps:

  • Wait for the following metrics to pop up in Prometheus: mediawiki_editcheck_preSaveChecks_total (various label values for kind)
  • Create the pyrra configuration/dashboard based on the above metrics.

Needs a follow-up because the patch isn't logging correctly.

Change #1165583 had a related patch set uploaded (by DLynch; author: DLynch):

[mediawiki/extensions/VisualEditor@master] Edit check: fix counter logging for SLO

https://gerrit.wikimedia.org/r/1165583

Change #1165583 merged by jenkins-bot:

[mediawiki/extensions/VisualEditor@master] Edit check: fix counter logging for SLO

https://gerrit.wikimedia.org/r/1165583

Change #1165589 had a related patch set uploaded (by DLynch; author: DLynch):

[mediawiki/extensions/VisualEditor@wmf/1.45.0-wmf.8] Edit check: fix counter logging for SLO

https://gerrit.wikimedia.org/r/1165589

Change #1165589 merged by jenkins-bot:

[mediawiki/extensions/VisualEditor@wmf/1.45.0-wmf.8] Edit check: fix counter logging for SLO

https://gerrit.wikimedia.org/r/1165589

Mentioned in SAL (#wikimedia-operations) [2025-07-01T19:16:59Z] <kemayo@deploy1003> Started scap sync-world: Backport for [[gerrit:1165589|Edit check: fix counter logging for SLO (T395444)]]

Mentioned in SAL (#wikimedia-operations) [2025-07-01T19:19:04Z] <kemayo@deploy1003> kemayo: Backport for [[gerrit:1165589|Edit check: fix counter logging for SLO (T395444)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-07-01T19:26:06Z] <kemayo@deploy1003> Finished scap sync-world: Backport for [[gerrit:1165589|Edit check: fix counter logging for SLO (T395444)]] (duration: 09m 07s)

Okay, all fixed.

Issue was that we'd merged ve.track( 'stats.mediawiki_editcheck_preSaveChecks_total', { kind: 'Available' } ); which should have been ve.track( 'stats.mediawiki_editcheck_preSaveChecks_total', 1, { kind: 'Available' } ); because you can't omit the count when providing labels.

Thanks, I see the metrics now in Prometheus! The next step is for me to create the dashboards, I'll try to do it tomorrow or early next week.

@DLynch Hi! I have a couple of questions for you:

  • This is a preview of the metrics, https://w.wiki/EjUp, could you please check if the current rate is consistent with what you expect as traffic volume?
  • The tool that we use to make SLO dashboards is called Pyrra, and it is the one powering slo.wikimedia.org. It offers a way to create a Ratio-based SLO, but the two input parameters are 1) an error metric 2) the grand total of requests. After checking https://wikitech.wikimedia.org/wiki/SLO/EditCheck I am wondering how to adapt the SLI's specs to Pyrra's requirements. Would it sound consistent to set the "Error metric" as mediawiki_editcheck_preSaveChecks_total - mediawiki_editcheck_preSaveChecks_total{kind=~(NotShown|Shown)} and the total as mediawiki_editcheck_preSaveChecks_total?

Thanks in advance for the help!

@elukey I'm not actually sure what rate we're expecting at the moment. That doesn't look implausible, at least.

That wouldn't work for the ratio -- literally anything with a "kind" is a success, and if I'm understanding this query syntax correctly you'd be counting "anything that's not NotShown/Shown" as an error. I think what you need is:

  • Total: mediawiki_editcheck_preSaveChecks_total{kind=Available}
  • Error: mediawiki_editcheck_preSaveChecks_total{kind=Available} - mediawiki_editcheck_preSaveChecks_total{kind~=Available}.

I have realized I was parsing those queries as if they were Lua, but it's probably more likely that =~ means pattern-matching here. In which case the original proposal looks great. :D

Change #1174748 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::thanos::recording_rules: add two rules for the EditCheck SLO

https://gerrit.wikimedia.org/r/1174748

Change #1174749 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::pyrra::filesystem::slos: add edit-check ratio

https://gerrit.wikimedia.org/r/1174749

Change #1174748 merged by Elukey:

[operations/puppet@production] profile::thanos::recording_rules: add two rules for the EditCheck SLO

https://gerrit.wikimedia.org/r/1174748

Change #1174749 merged by Elukey:

[operations/puppet@production] profile::pyrra::filesystem::slos: add edit-check ratio

https://gerrit.wikimedia.org/r/1174749

The dashboards are up!

The latter may need more data to be collected, let's wait some days before rechecking.

@DLynch some useful docs: https://wikitech.wikimedia.org/wiki/SLO/Template_instructions/Dashboards_and_alerts

At the moment the alerts are not enabled :)

@DLynch I was about to close the task but I noticed that the error budget is very much in the red for the last month:
rolling window dashboard.
calendar window dashboard.

Is there anything ongoing? Otherwise it may be the Pyrra calculations that are wrong, in that case we can try to fix it.

Hm. That starts on the 9th, which doesn't conveniently coincide with any config change to launch something... so it would probably have to be something to do with wmf.18 rolling out that wiki on the train. There's not anything inherently suspicious-looking in that release, though. I will dig into it.

@DLynch hi! Any updates? The error budget keeps going down :)

Closed the task in favor of T406836, since the work is done :)