Page MenuHomePhabricator

Create SLO dashboard for tone (peacock) check model
Closed, ResolvedPublic3 Estimated Story Points

Description

As an ML engineer,
I want to have SLO dashboards and alerts for the peacock detection model
so that I can ensure the model server's performance & availability and align with the expected objectives and allow us to proactively resolve any issues.

As part of this task we should explore if we can create the new dashboards with pyrra on slo.wikimedia.org instead of grafana grizzly using a similar approach to this patch
The process to add SLOs starts by following the process described in https://wikitech.wikimedia.org/wiki/SLO/Template_instructions

Details

Other Assignee
achou

Related Objects

Event Timeline

isarantopoulos set the point value for this task to 3.
isarantopoulos renamed this task from Create SLO dashboard for peacock detection model model to Create SLO dashboard for peacock detection mode.Apr 17 2025, 11:26 AM
isarantopoulos renamed this task from Create SLO dashboard for peacock detection mode to Create SLO dashboard for peacock detection model.
isarantopoulos renamed this task from Create SLO dashboard for peacock detection model to Create SLO dashboard for tone check model.May 21 2025, 7:45 AM
isarantopoulos renamed this task from Create SLO dashboard for tone check model to Create SLO dashboard for tone (peacock) check model.
isarantopoulos updated the task description. (Show Details)

The first step is to read and create a draft of https://wikitech.wikimedia.org/wiki/SLO/Template_instructions. I am available to have a meeting about it, so we can discuss doubts etc..

Example of a past SLO: https://wikitech.wikimedia.org/wiki/SLO/Citoid

The first step is to read and create a draft of https://wikitech.wikimedia.org/wiki/SLO/Template_instructions. I am available to have a meeting about it, so we can discuss doubts etc..

Example of a past SLO: https://wikitech.wikimedia.org/wiki/SLO/Citoid

Hey @elukey I see that there is already an SLO/EditCheck wikitech page last time edited December 2024. Should we create another one?

@gkyziridis naming clash :D That SLO is related to a part of Visual Editor, so we should find a different name. Maybe we could come up with a suffix/postfix to indicate that those are ML model servers, no preference. Do you have any proposal?

I'd suggest ToneCheck, EditCheck_tone or something similar.

I created an initial page for SLO/ToneCheck. It is still under progress, I just fill it in until the service level indicators.

In order to avoid any confusion with any backend/frontend implementation shall we specifically mention in the title of the SLO that it is about the model service? I was thinking something like SLO/ToneCheck Model. What do you think?

The SLO draft is now complete: https://wikitech.wikimedia.org/wiki/SLO/ToneCheck
@elukey, we'd appreciate your feedback when you have time :)

In order to avoid any confusion with any backend/frontend implementation shall we specifically mention in the title of the SLO that it is about the model service? I was thinking something like SLO/ToneCheck Model. What do you think?

I agree! This means we'll need to move the content to a new page like: https://wikitech.wikimedia.org/wiki/SLO/ToneCheck_Model

Another option is that we could use something like SLO/InferenceService/ToneCheck_Model, so we can put all the ML model servers' SLO under SLO/InferenceService/ in the future.

The SLO draft is now complete: https://wikitech.wikimedia.org/wiki/SLO/ToneCheck
@elukey, we'd appreciate your feedback when you have time :)

In order to avoid any confusion with any backend/frontend implementation shall we specifically mention in the title of the SLO that it is about the model service? I was thinking something like SLO/ToneCheck Model. What do you think?

I agree! This means we'll need to move the content to a new page like: https://wikitech.wikimedia.org/wiki/SLO/ToneCheck_Model

Another option is that we could use something like SLO/InferenceService/ToneCheck_Model, so we can put all the ML model servers' SLO under SLO/InferenceService/ in the future.

I would go for a https://wikitech.wikimedia.org/wiki/SLO/ToneCheck_Model page, I already moved the content over there.
@elukey whenever you have time cast an eye over it and let us know. If seems ok we can proceed to the deployment (whenever you have time ofc).

@achou @gkyziridis thank you for working on this!

I have a comment regarding the Success Ratio SLI which is defined as following in the draft

Success ratio SLI: All 200 versus over all requests.
Success Ratio SLO: 95%

I think it is problematic as it counts the 200 response codes which means that if we have malformed requests from the user/service side we will end up eating the available budget.
In my opinion the Service availability SLI which is set to 95% is enough.

I think that we would benefit if the SLIs are described clearly in the final part. Instead of having just the % thresholds we can add the description. like this:

Service Availability SLO: 95% of all requests return a 200/300/400 response code

Latency SLI, acceptable fraction: 90% of all requests return a response within 1000 milliseconds

The descriptions can be totally different I'm just giving this as an example cause this is the information that we want to be highly visible when someone visits this page

Thanks a lot for working on this! I'd suggest to add a few more details to:

  1. Organizational - in this case one of the big thing that we are trying to figure out is if the ML team needs support from the SRE team or not when handling alerts/pages/etc.. The ML has its own SREs (now one but soon two), so in theory it could sustain the load, but the team needs to specifically state it in the document. I'd also add a bit more details about what the service does, there is a single sentence but it may be difficult to fully grasp for people not working with/on ML daily. For example, it would be nice to add some examples about what it does and how it is supposed to be called, etc..
  2. Architectural - I'd add more details about the Swift dependency, adding some info about the S3 endpoint and the Thanos cluster.
  3. Client facing - I don't see the Tone check project mentioned, it would be useful to add more info about the fact that MediaWiki/Visual Editor (IIUC) will probably call the service etc.. And even more details about what a statbox is, how the service is called from there, etc.. Again, nothing too deep but assume that this document could be read and understood but a wide audience, not only ML-tech-savvy ones.
  4. Operational - it would be good to understand how the ML team is going to handle alerts from now on, is there going to be a way to page team members in an oncall-rotation during daytime? If not, what is it going to be the ops handling of alerts? At the moment IIUC we just check IRC, that may or may not work, needs to be decided :)
achou updated Other Assignee, added: achou; removed: AikoChou.Jun 10 2025, 3:02 PM

I tried to address the above comments.

  • Organizational: the ML team is solely responsible for this service so I think this is ok:

The Machine Learning (ML) team is responsible the development, deployment, and maintenance of the ToneCheck Inference Service.

  • Architectural: I've added a description for the k8s clusters, swift and Docker registry.
  • Client facing : I have included a short description about the Tone check project and how this is going to be used. I removed the reference to statbox as I don't think it is relevant and mention that it can be used by any internal or external developer.
  • Operational: At the moment we get notified for alerts via IRC and email. There will be a team member that will be on ops rotation who will triage incoming alerts during working hours (EU daytime). This seems enough to start with, but we're open to suggestions.

Really great updates!

  • Operational: At the moment we get notified for alerts via IRC and email. There will be a team member that will be on ops rotation who will triage incoming alerts during working hours (EU daytime). This seems enough to start with, but we're open to suggestions.

I think that there is value is specifying this, since it is an important bit. Moreover, I'd explicitly mark the fact that you don't need SRE support, so issues over the weekend and/or during EU night time will be left pending. I am a bit worried that the 95% SLO target may not be enough if an outage could potentially run for hours and hours, so I'd do a simple test: try to figure out an avg of requests for each day, then multiply for the whole quarter and calculate what is the percentage of requests dropped in an entire EU night. After that I think we'll have a clearer idea if 95 is enough or not. Lemme know your thoughts :)

A couple of more comments:

  1. The Reconciliation section should state why you took the trade-off between ideal and realistic, namely if you have worked with Editing on it etc.. It is important so we know why the SLO targets have been chosen.
  2. The choice of non-50x responses seems fine to me, but I am a little worried about the latency target since 20x vs 30x vs 40x have a very different set of performances (20x are the slowest of course). So a mixture of 40x could for example improve the overall latency for 20x responses, potentially masking real issues. For the latency use case I'd concentrate on 20x only, lemme know your thoughts!

Thanks a lot for the helpful comments Luca and sorry for the delayed response here.

regarding operations I assumed that this would be enough, although this is mentioned in the Organizational section.

The Machine Learning (ML) team is solely responsible the development, deployment, and maintenance of the ToneCheck Inference Service.

In order to be explicit about this I've added the following in the beginning of the Operational section

The ML team will be solely responsible for troubleshooting this service and no additional support from SREs is needed.

  1. regarding realistic targets I have updated the section to give a clear description how we have come to this conclusion and explicitly mention that this has been discussed with the Editing team(primary users for now). We've done that based on number of hours and not requests, although the calculations would be exactly the same if traffic doesn't fluctuate a lot over each 24h window.

5% downtime can be translated to 108 hours of downtime per quarter or 36 hours per month would allow for downtime during a whole weekend. Judging from past behaviour of this service and other services this seems to be enough. I think that we can start with this and re-adjust in susequent quarters if needed. I'm thinking that an even lower SLO would more likely demonstrate that we need to rethink about the whole service.

  1. latency SLO and non 2xx response codes : This makes sense and I'll change the description so that we report latency only on 20x unless there is any objection from anyone else on the team.

After a brief IRC discussion with the team I have updated the page to mention only successful requests for the latency SLI.

Change #1165548 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] pyrra: add tonecheck Pyrra config

https://gerrit.wikimedia.org/r/1165548

Change #1165548 merged by Elukey:

[operations/puppet@production] pyrra: add tonecheck Pyrra config

https://gerrit.wikimedia.org/r/1165548

@elukey I see that the patch has been merged and the dashboards are now available 🎉 Thank you for all the work and the help.
Availability SLO
Latency SLO
I have two questions:

  1. Does this mean they are ready and we can communicate them to other teams or is there anything else to figure out?
  2. I’m having some difficulty interpreting the dashboard and the error budget calculation. I noticed that in the latency dashboard, the error budget has been fluctuating. For reference, I saw values like -12%, then 3%, and later -16%. I was under the impression that the error budget starts at 100% and only decreases as we consume it. Am I correct or am I still missing something in the way I should be reading the dashboards?

@isarantopoulos I am still working on the latency SLO since we have a problem between Pyrra and the istio latency metrics, so for the moment I'd say that you can keep an eye only in the availability one. Due to a Pyrra limitation we are not yet able to backfill data, so the "history" goes as far as when the config was added.

We offer two kind of dashboards, a "Rolling" and a "Fixed/Calendar" one, see https://wikitech.wikimedia.org/wiki/SLO/Template_instructions/Dashboards_and_alerts. Once everything is settled I can take the ML team through those if you want :)

Got it, thanks! I remember you about the lack of backfilling but didn't know how that would be interpreted in the dashboard. Now it makes sense., thanks for clarifying.
We'll keep an eye on the availability dashboard for now.

Change #1170564 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] pyrra: fix Istio latency metric config with latency_target_requests_regex

https://gerrit.wikimedia.org/r/1170564

Change #1170564 merged by Elukey:

[operations/puppet@production] pyrra: fix Istio latency metric config with latency_target_requests_regex

https://gerrit.wikimedia.org/r/1170564

@isarantopoulos we have improved the Pyrra's default config for latency but after a chat with Valentin we believe that the current error budget graph is due to past datapoints showing high latency. When Pyrra adds new configs, it automatically creates Prometheus recording rules, that are aimed to create a more performant version of a target metric+labels combination. The default is to use 4 weeks time windows (as big as the rolling window) for the recording rules, meaning that every new datapoint's rate() is averaged also using the past 4 weeks of datapoints. This is more precise when doing rates that take into consideration long term fluctuations, rather than the immediate past. High p9X percentiles (like p95) show high latency, more than what the SLO's target tolerates. We think that these past datapoints are contributing to the error budget being exhausted now, because we have very few datapoints and the prometheus recording rules follow "long trends".

Next steps:

  • We want to try to backfill data in T400071 to see what changes.
  • We may play with shorter time windows for the recording rules, and see how it goes.

Today I tried to review the graphs in the Tone Check's latency SLO page, and this is what I found:

  • The Duration panel is broken due to https://github.com/pyrra-dev/pyrra/issues/667, because it assumes the underlying metric is in seconds while it is in ms (it is an Istio metric). Fixing it doesn't seem easy, but we'll try to work with upstream.
  • The other panels seemed wrong, but I checked the data and I think the make sense.
  • The error budget graph is not intuitive, it shows red then green and then red. The first red block should be related to few data ingested in the Prometheus recording rules, and we hope it will get better with the aforementioned backfilling task. The green block seems legit, and also the second red one, see below.

Something happened on the 24th, causing the error budget to be consumed:

Screenshot From 2025-07-29 16-26-59.png (1×3 px, 261 KB)

It seems inline with the Istio dashboard:

Screenshot From 2025-07-29 16-29-48.png (1×4 px, 304 KB)

And I see that Ilias deployed on staging and prod the same day, just earlier on: https://sal.toolforge.org/production?p=0&q=edit-check&d=2025-07-24

Do we know what was happening? At the moment we are just in a discovery phase, but in the bright future something like that (if it is legitimate and I am not missing something) will trigger an alert for rapid error budget exhaustion.

@elukey The 24th was when the train reached most wikis containing the change that turned on running the tone check on many edits behind the scenes so we can tag revisions. This would have massively increased the load going to the model, as any edit made on wikis in supported languages by users with <100 edits (this includes logged out users) is going to be checked. See: T388716.

Hey @elukey thnx for sharing this issue. I have a question: Is this issue blocking the A/B testing ?
tagging @SSalgaonkar-WMF

Hey @gkyziridis, nono this is something related to the SLO itself, we'll need to review the targets since I think are too tight, nothing that should stop the a/b testing.

Hey @elukey.

And I see that Ilias deployed on staging and prod the same day, just earlier on: https://sal.toolforge.org/production?p=0&q=edit-check&d=2025-07-24
Do we know what was happening? At the moment we are just in a discovery phase, but in the bright future something like that (if it is legitimate and I am not missing something) will trigger an alert for rapid error budget exhaustion.

What I see in the git history is the following:

  • 23-24 of July the autoscaling based autoscaling.knative.dev/metric: "rps" metric with target autoscaling.knative.dev/target: "15" and maxReplicas: 3 was deployed.
  • Two weeks ago from today (13-14 August) we deployed a new image of edit-check model with a higher limit of maximum characters in the input text (from max_char_length: int = 1000 we went to max_char_length: int = 2000).

These two changes are the latest changes happened during the last month.
My thoughts:

  1. The autoscaling enablement happened exactly the day that you mentioned the issue in the above comment: https://phabricator.wikimedia.org/T390706#11043355, but I am not sure if and how this could cause this problem.
  2. The second change related to the max_char_length could cause issues on latency since we are using a less tight character limit which means that the server accepts more requests (since the filter is less tight), and the model processes bigger/longer paragraphs as input data. Which means that the number of throughput requests is higher and the input data are bigger, so it seems reasonable to add more processing time.

I'm resolving this task as the work to define the SLO and implement the initial dashboards has been concluded. There is an open issue regarding issues in the way that pyrra calculates the error budgets in T403729: Pyrra calculations for the Initial error budget value of calendar windows and Sloth as a possible replacement T404171: Evaluate Sloth as a possible replacement for Pyrra and when that is concluded we'll revisit the dashboards but the SLO definition will not change.
We have another task to review the tone check SLO in T403378: Review Tone Check Latency SLO and its targets.
Thank you everyone for the work done here!

To keep archives happy, I added a more detailed explanation of the current limits that Pyrra shows in T403729#11200918. As far as I can tell, the goal of setting up the dashboards and the SLO is done, and we'll iterate from now on to evolve our tooling based on the use case (could be migrating to Sloth or something else).

The last step is to set https://wikitech.wikimedia.org/wiki/SLO/ToneCheck_Model into its final approved state, I'll ask a review to the SLO WG and then I'll report back.