Page MenuHomePhabricator

Review Tone Check Latency SLO and its targets
Closed, ResolvedPublic

Assigned To
Authored By
isarantopoulos
Sep 1 2025, 9:50 AM
Referenced Files
F65963429: image.png
Sep 5 2025, 7:53 AM
F65963426: image.png
Sep 5 2025, 7:53 AM
F65963421: image.png
Sep 5 2025, 7:53 AM
F65963318: image.png
Sep 5 2025, 7:53 AM
F65952613: Screenshot 2025-09-03 at 3.18.23 PM.png
Sep 3 2025, 12:22 PM
F65952612: Screenshot 2025-09-03 at 3.18.11 PM.png
Sep 3 2025, 12:22 PM
F65948738: image.png
Sep 2 2025, 2:02 PM
F65948712: image.png
Sep 2 2025, 2:02 PM

Description

As an ML engineer,
I want to review the current status of the Tone Check latency SLO defined in T390706: Create SLO dashboard for tone (peacock) check model, so that I can ensure it accurately reflects user experience and re-configure it to provide actionable insights for improving performance.

As defined in the Tone Check model SLO we have the following latency SLO:

Latency SLO, acceptable fraction: 90% of all successful requests (2xx) that complete within 1000 milliseconds, measured at the server side.

The previous SLO quarter ended on 31st of August (20250601-20250831) in which we failed to reach to the latency target. Instead of the targeted 90%, only 79% of requests were completed in <1second as shown in the Pyrra Dashboard.

As part of this task we would like to:

  • Investigate as much as possible the reasons behind the increased latencies compared to the initial load tests. One issue that we have spotted is that latencies reported by Istio (and istio sidecar) are totally different than these reported by the kserve inference pods. The latter shows really low latencies while the istio metrics are more close to what is actually happening in reality.
  • Revisit both the SLI definition as well as the SLO and decide whether we should update one of them or both.

The initial SLI that we had defined was:

Latency SLI, acceptable fraction: The percentage of all successful requests (2xx) that complete within 1000 milliseconds (1 sec), measured at the server side.

In redefining the latency SLO we have the following options:

  1. change the SLI: increase the milliseconds that are defined in the SLI
  2. change its SLO definition and decrease the target from the initial 90%.
  3. change both of the above

Event Timeline

Hey @elukey , you can hit the model on staging via:

curl -s -X \
POST "https://inference-staging.svc.codfw.wmnet:30443/v1/models/edit-check:predict" \
-H "Host: edit-check.edit-check.wikimedia.org" \
-d '{
    "instances":[
        {
            "lang": "en",
            "check_type": "tone",
            "original_text": "This band was originated at 1992",
            "modified_text": "This band was formed in 1992 in Athens",
            "page_title": "Beatallica"
        }]}'

I'm just pasting here some notes based on our discussions in IRC so that information doesn't get lost.

Looks like the higher numbers in Istio compared to KServe are probably due to overhead outside of model inference. Could be queueing, time in the batcher (if it’s active -- turns out it is not T403423: Kserve batcher doesn't seem to be properly configured for edit-check), or just general networking delay.

Also worth noting: this might be tied to autoscaling — if pods are cold or scaling up too slowly, that could definitely be adding to the lag as requests are queued until the pod starts and this could lead to these delays.

We're approaching this one step at a time:

  1. fix batching issue
  2. experiment: disable autoscaling and set a fixed number of replicas to see if the istio vs kserve container metrics are more aligned then.
  3. based on 2 configure autoscaling targets and number of replicas and reach an informed decision on the latency SLO

The batcher is fixed and deployed from this ticket: https://phabricator.wikimedia.org/T403423 .
Without disabling the autoscaling we can see that the metrics are more aligned, please check the images below.

Home >Dashboards > Kubernetes > Istio-sidecar

image.png (2×3 px, 684 KB)

image.png (2×3 px, 682 KB)

Home > Dashboards > Kubernetes > Istio > View panel

image.png (2×3 px, 695 KB)

Home > Dashboards > Kubernetes > KServe Inference Services

image.png (2×3 px, 655 KB)

@gkyziridis I checked the last days of P90 latency for edit check (Istio gw perspective) and after you added the batcher it doesn't seem that it improved a lot (high latency are still registered etc..). Shall we also try with the autoscaling path, to see if setting a baseline of 2/3 fixed pods makes any change?

Hey @elukey, yes lets go towards that path.
What I also see in Knative-Serving is that we still have many short-lived pods as you had already mentioned in irc.
The current deployment has:

maxReplicas: 3
autoscaling.knative.dev/metric: "rps"
autoscaling.knative.dev/target: "15"

So we disable completely autoscaling and we set 3 pods fixed ?
I can open a patch for that asap.

We could simply add minReplicas: 3, that would disable autoscaling but leave the current configs. What do you think? Anyway, +1 :)

Change #1184498 had a related patch set uploaded (by Gkyziridis; author: Gkyziridis):

[operations/deployment-charts@master] ml-services: Disable autoscaling on edit-check model.

https://gerrit.wikimedia.org/r/1184498

I would also suggest to experiment with autoscaling in staging by tweaking different parts of the config to understand if things work properly.
For example: with autoscaling enabled perform a load test that incrementally adds users/clients and see it the autoscaling targets are respected , basically answering the question: does a new pod get scheduled when we reach 11-12rps or much earlier? In production we don't have that many rps so in theory the deployment shouldn't need to scale up
+1 on setting a fixed number of replicas. However, setting a fixed number of replicas will allow us to understand if we have added latency because of the time it takes for pods to start and the kserve container to be ready but it doesn't help us understand why autoscheduling is behaving this way.

To rephrase the question I had above:

Why do we see the number of pods going up and down since the autoscaling target isn't met?
In this example the max number of rps over the last 2 days didn't exceed 2-3 rps and our target is set to 15 which means that autoscaling should kick in at approx 11 rps. Is it misconfigured or is there a sudden spike that isn't captured in the graphs but triggers it?

Screenshot 2025-09-03 at 3.18.11 PM.png (258×1 px, 39 KB)

Screenshot 2025-09-03 at 3.18.23 PM.png (1×2 px, 98 KB)

Change #1184498 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: Disable autoscaling on edit-check model.

https://gerrit.wikimedia.org/r/1184498

Autoscaling disabled and deployed on staging/prod for edit-check.
We can see 3 pods running:

$ kube_env edit-check ml-serve-eqiad
$ kubectl get pods
NAME                                                    READY   STATUS        RESTARTS   AGE
edit-check-predictor-00008-deployment-876d7f68b-hz5n4   4/4     Terminating   0          25h
edit-check-predictor-00009-deployment-95d98477-7c944    4/4     Running       0          40s
edit-check-predictor-00009-deployment-95d98477-rj5fn    4/4     Running       0          40s
edit-check-predictor-00009-deployment-95d98477-sbpcm    4/4     Running       0          41s

$ kube_env edit-check ml-serve-codfw
$ kubectl get pods
NAME                                                     READY   STATUS        RESTARTS   AGE
edit-check-predictor-00008-deployment-7679ff79c4-jsfnn   2/4     Terminating   0          25h
edit-check-predictor-00009-deployment-7f69fb4d9f-7456x   4/4     Running       0          57s
edit-check-predictor-00009-deployment-7f69fb4d9f-hsbhh   4/4     Running       0          57s
edit-check-predictor-00009-deployment-7f69fb4d9f-vv4cg   4/4     Running       0          57s

$ kube_env edit-check ml-staging-codfw
$ kubectl get pods
NAME                                                     READY   STATUS    RESTARTS   AGE
edit-check-predictor-00009-deployment-7bcf95f695-8bn5d   4/4     Running   0          4m58s
edit-check-predictor-00009-deployment-7bcf95f695-k5dxz   4/4     Running   0          4m57s
edit-check-predictor-00009-deployment-7bcf95f695-w5wtk   4/4     Running   0          4m57s

I got the kserve logs for the 3 eqiad pods and filtered them, this is what I found:

root@deploy1003:~# kubectl logs edit-check-predictor-00009-deployment-95d98477-sbpcm -n edit-check  | grep predict_ms | cut -d " " -f 10,11 | sort -n -k 2 | tail -n 10
predict_ms: 911.769390106,
predict_ms: 913.497447968,
predict_ms: 1106.354475021,
predict_ms: 1213.174104691,
predict_ms: 1772.389888763,
predict_ms: 1899.28150177,
predict_ms: 2007.984161377,
predict_ms: 2204.708099365,
predict_ms: 3988.733530045,
predict_ms: 5674.947500229,

root@deploy1003:~# kubectl logs edit-check-predictor-00009-deployment-95d98477-rj5fn -n edit-check  | grep predict_ms | cut -d " " -f 10,11 | sort -n -k 2 | tail -n 10
predict_ms: 605.425834656,
predict_ms: 608.915090561,
predict_ms: 642.387866974,
predict_ms: 727.770328522,
predict_ms: 730.818986893,
predict_ms: 878.833770752,
predict_ms: 1349.320411682,
predict_ms: 1506.88457489,
predict_ms: 2758.105039597,
predict_ms: 3494.561195374,

root@deploy1003:~# kubectl logs edit-check-predictor-00009-deployment-95d98477-7c944 -n edit-check  | grep predict_ms | cut -d " " -f 10,11 | sort -n -k2 | tail -n 10
predict_ms: 326.092004776,
predict_ms: 326.75409317,
predict_ms: 457.978725433,
predict_ms: 529.308319092,
predict_ms: 545.562744141,
predict_ms: 545.922756195,
predict_ms: 600.206375122,
predict_ms: 792.585611343,
predict_ms: 1744.035959244,
predict_ms: 1849.771261215,

Our theory of Istio being the only culprit seems not right, because I can see keserve's predic_ms values up to 5 seconds in some cases. It seems consistent with what Istio reports as P90 for edit check, so there may also be an issue with the performance of the model server itself? If it is so different from staging, is there anything different in their configs that could explain it?

Our theory of Istio being the only culprit seems not right, because I can see keserve's predic_ms values up to 5 seconds in some cases

@elukey After George deployed the batcher properly in T403423: Kserve batcher doesn't seem to be properly configured for edit-check the kserve and istio latencies are now aligned.
We were discussing yesterday that we need to figure out why the latencies as so far away from what the load tests reported previously when we ran them for the deployment in the experimental namespace.
@gkyziridis I think it would make sense to rerun the load tests in staging and then identify any differences.
An additional thing we discussed was that we should start logging (and perhaps exporting as a prometheus metric) the length of the text that is submitted with the request either in characters or tokens. This would help us better debug as well as understand production traffic and in turn create load tests that follow similar sized inputs.

Apart from configuration differences we'd also need to check what else has changed in the service -- if anything at all -- that would justify these numbers.

Our theory of Istio being the only culprit seems not right, because I can see keserve's predic_ms values up to 5 seconds in some cases

@elukey After George deployed the batcher properly in T403423: Kserve batcher doesn't seem to be properly configured for edit-check the kserve and istio latencies are now aligned.
We were discussing yesterday that we need to figure out why the latencies as so far away from what the load tests reported previously when we ran them for the deployment in the experimental namespace.

Yep the values that I've reported are after the batcher deployment, my point was that the still high p90 latency reported is not only istio-related, but it is also kserve related :)

Yes yes. Thanks for pasting the kserve logs. It is now clear that istio has nothing to do with this -- quite the opposite -- the istio dashboards report the real latencies.
While we debug this it is important to understand the flow of the requests which is the following:
istio sidecar -> agent sidecar -> kserve container.
Although the high latencies reported by the container indicate that we should start by focusing on the actual service (kserve container) rather than any other part

A simple and effective debug strategy could be do add logging about the payload received from the client, so that coupling high latency predict_ms with its client request becomes easier. Maybe we are getting something unexpected from live traffic, that wasn't tested via load-tests..

A simple and effective debug strategy could be do add logging about the payload received from the client, so that coupling high latency predict_ms with its client request becomes easier. Maybe we are getting something unexpected from live traffic, that wasn't tested via load-tests..

What I found out is that the tests we were running initially were pretty light sending very small request payloads with a few tokens. In reality it seems that we have longer paragraphs in the requests.

Another finding is that we had set the AutoScaling to 15 RPS which was too high based on our current traffic, which means that the autoscaling were never triggered correctly because we never reached 15RPS.

image.png (532×1 px, 98 KB)

We had set the AutoScaling under experimental like this:

metadata:
  annotations:
    autoscaling.knative.dev/metric: rps
    autoscaling.knative.dev/target: "3"

  predictor:
    maxReplicas: 3
    minReplicas: 1
    batcher:
      maxBatchSize: 32
      maxLatency: 100

It seems to work running heavy load tests targeting experimental ns:

image.png (1×2 px, 257 KB)

users = 50
spawn-rate = 15
run-time = 300s
number_of_tokens = 30-180
wait_time = between(0.5, 0.5)

[2025-09-05 07:22:02,855] stat1010/INFO/locust.main: Starting Locust 2.33.2
[2025-09-05 07:22:02,856] stat1010/INFO/locust.main: Run time limit set to 300 seconds
[2025-09-05 07:22:02,856] stat1010/INFO/locust.runners: Ramping to 50 users at a rate of 15.00 per second
[2025-09-05 07:22:05,860] stat1010/INFO/locust.runners: All users spawned: {"EditCheckPeacock": 50} (50 total users)
[2025-09-05 07:27:02,307] stat1010/INFO/locust.main: --run-time limit reached, shutting down

Load test results are within the threshold
[2025-09-05 07:27:02,400] stat1010/INFO/locust.main: Shutting down (exit code 0)
Type     Name                                                                          # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
POST     /v1/models/edit-check-staging:predict                                           2605     0(0.00%) |   5180     206   15242   4300 |    8.74        0.00
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
         Aggregated                                                                      2605     0(0.00%) |   5180     206   15242   4300 |    8.74        0.00

Response time percentiles (approximated)
Type     Name                                                                                  50%    66%    75%    80%    90%    95%    98%    99%  99.9% 99.99%   100% # reqs
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
POST     /v1/models/edit-check-staging:predict                                                4300   5800   7300   8300  11000  13000  13000  14000  15000  15000  15000   2605
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
         Aggregated                                                                           4300   5800   7300   8300  11000  13000  13000  14000  15000  15000  15000   2605

Latency - P 90

image.png (752×938 px, 75 KB)

kubectl logs edit-check-predictor-00046-deployment-55d976ccd6-8mljd -n experimental | grep predict_ms | cut -d " " -f 10,11 | sort -n -k2 | tail -n 10
predict_ms: 4622.619152069,
predict_ms: 4663.602828979,
predict_ms: 4686.727046967,
predict_ms: 4688.564777374,
predict_ms: 4942.443847656,
predict_ms: 6137.283086777,
predict_ms: 7969.329118729,
predict_ms: 8913.5658741,
predict_ms: 9004.33588028,
predict_ms: 9060.306072235,

Traffic

image.png (674×930 px, 77 KB)

I feel that it is time to experiment with the GPUs as well and see if we still have high latencies.
We can be sure for the following:

  1. Batcher is correctly configured
  2. Autoscaling is working but 15RPS was too high value for our current traffic.

I feel that it is time to experiment with the GPUs as well and see if we still have high latencies.
We can be sure for the following:

  1. Batcher is correctly configured
  2. Autoscaling is working but 15RPS was too high value for our current traffic.

My 2c: we should concentrate in figuring out the performances of a single pod, and then test/configure autoscaling. The current issue seems to be that some requests take a huge amount of time compared to the others, but we don't know how. Is it something related to specific requests? Or is it simply growing proportionally with the number of tokens present in the request's payload?

@gkyziridis thanks for providing the updates load tests and the graphs! The issue we experienced with autoscaling in production was that pods scaled up although we never reached more than 2-3 rps.
I agree with @elukey that we should focus on the performance of a single pod and we can revisit autoscaling afterwards.
Enabling a GPU will give us reduced latencies but it doesnt help us better understand production traffic so that we can create load tests that help us take informed decisions wrt to throughput, batching, autoscaling, resources etc. It is straightforward to run a load test in staging using a GPU so you can go ahead and do that using the same load test as above.
Expanding on what Luca mentioned above and regardless of GPU usage, I would suggest to:

  1. log the payload size: number of tokens for modified text and original text would be the best thing to monitor as that is the actual input to the model. Perhaps the sum would be enough. Alternatively we can log the length of the strings (again we can just use the sum) which is more straightforward to do and it could be a good proxy. Our goal here would be to understand production traffic so that we can create meaningful load tests
  2. What do the pod resources (CPU, memory) look like in the load test you shared? Was there high usage? was it constant? Our goal here would be to understand if there is a specific combination of input size + throughput where response latency becomes much slower.

I think the above things will help us understand better the current service's limitations and help us move forward. Additional suggestions are also welcome.

@isarantopoulos I totally agree with this plan.

  1. Log input string lengths (sum)
  2. Monitor the resources find correlations on input size and resources
  3. Enable GPUs on experimental staging and use the same tests

Some simple and initial indications in a single pod on experimental staging:

1# Big Request
2# preprocess_ms: 0.045061111, explain_ms: 0, predict_ms: 533.566951752, postprocess_ms: 0.023126602
3{
4 "instances":[
5 {
6 "lang": "en",
7 "check_type": "tone",
8 "original_text": "In recent years, the management of the coastal preservation program has come under intense scrutiny from environmental advocates, researchers, and local communities. The authorities responsible for maintaining and protecting the fragile shoreline ecosystems have repeatedly failed to implement timely and effective measures to counteract the accelerating effects of erosion and climate change. Reports from multiple watchdog organizations revealed instances of mismanagement, insufficient budget allocations, and inadequate planning that led to the degradation of several protected zones. Local communities, many of whom rely on the coastal regions for their livelihoods, expressed growing frustration at the government's inability to follow through on previously promised restoration initiatives. Internal audits pointed to significant delays in project execution and a lack of communication among the various agencies involved. Critics argue that if immediate corrective actions are not taken, the damage could become irreversible, leading to the collapse of marine biodiversity and the displacement of coastal populations.",
9 "modified_text": "Over the past several years, the coastal preservation program has attracted significant attention from environmental experts, civic leaders, and local residents. While challenges have emerged in addressing shoreline degradation and responding to climate-related threats, the program continues to evolve through policy refinement and stakeholder engagement. Recent findings by environmental groups have highlighted areas for improvement, including the need for more strategic funding allocation and enhanced inter-agency coordination. Recognizing these concerns, regional authorities have initiated dialogue with affected communities and environmental scientists to co-develop actionable plans aimed at sustainable coastal restoration. Although earlier phases of the initiative experienced some delays and resource limitations, the growing collaboration between public officials and environmental groups signals a shift toward more responsive governance. By fostering open communication, embracing transparency, and committing to science-based solutions, the program aims to safeguard biodiversity while supporting resilient coastal communities for generations to come.",
10 "page_title": "LLM Generated Text around 1150 characters"
11 },
12 {
13 "lang": "en",
14 "check_type": "tone",
15 "original_text": "Researchers, and local communities. The authorities responsible for maintaining and protecting the fragile shoreline ecosystems have repeatedly failed to implement timely and effective measures to counteract the accelerating effects of erosion and climate change. Reports from multiple watchdog organizations revealed instances of mismanagement, insufficient budget allocations, and inadequate planning that led to the degradation of several protected zones. Local communities, many of whom rely on the coastal regions for their livelihoods, expressed growing frustration at the government's inability to follow through on previously promised restoration initiatives. Internal audits pointed to significant delays in project execution and a lack of communication among the various agencies involved. Critics argue that if immediate corrective actions are not taken, the damage could become irreversible, leading to the collapse of marine biodiversity and the displacement of coastal populations.",
16 "modified_text": "Attention from environmental experts, civic leaders, and local residents. While challenges have emerged in addressing shoreline degradation and responding to climate-related threats, the program continues to evolve through policy refinement and stakeholder engagement. Recent findings by environmental groups have highlighted areas for improvement, including the need for more strategic funding allocation and enhanced inter-agency coordination. Recognizing these concerns, regional authorities have initiated dialogue with affected communities and environmental scientists to co-develop actionable plans aimed at sustainable coastal restoration. Although earlier phases of the initiative experienced some delays and resource limitations, the growing collaboration between public officials and environmental groups signals a shift toward more responsive governance. By fostering open communication, embracing transparency, and committing to science-based solutions, the program aims to safeguard biodiversity while supporting resilient coastal communities for generations to come. Over the past several years.",
17 "page_title": "LLM Generated Text around 2000 characters"
18 }
19 ]
20}
21
22------------------------------
23
24# Small Request
25# preprocess_ms: 0.052690506, explain_ms: 0, predict_ms: 121.881723404, postprocess_ms: 0.026702881
26{
27 "instances":[
28 {
29 "lang": "en",
30 "check_type": "tone",
31 "original_text": "This band was originated at 1992",
32 "modified_text": "This band was formed in 1992 in Athens",
33 "page_title": "Beatallica"
34 },
35 {
36 "lang": "en",
37 "check_type": "tone",
38 "original_text": "This band was originated at 1992",
39 "modified_text": "This is the greatest band in the world. No one has ever done anything better than this!!",
40 "page_title": "Beatallica"
41 },
42 {
43 "lang": "en",
44 "check_type": "tone",
45 "original_text": "This is the greatest band in the world. No one has ever done anything better than this!!",
46 "modified_text": "This band was originated at 1992",
47 "page_title": "Beatallica"
48 },
49 {
50 "lang": "en",
51 "check_type": "tone",
52 "original_text": "This is the greatest band in the world. No one has ever done anything better than this!!",
53 "modified_text": "This is the most amazing band in all over the world. Definitely the best band!",
54 "page_title": "Beatallica"
55 }
56 ]
57}
58
59------------------------------
60
61# Mixed Request with error
62# preprocess_ms: 0.080347061, explain_ms: 0, predict_ms: 1296.497821808, postprocess_ms: 0.032901764
63{
64 "instances":[
65 {
66 "lang": "en",
67 "check_type": "tone",
68 "original_text": "This band was originated at 1992",
69 "modified_text": "This band was formed in 1992 in Athens",
70 "page_title": "Beatallica"
71 },
72 {
73 "lang": "en",
74 "check_type": "tone",
75 "original_text": "This band was originated at 1992",
76 "modified_text": "This is the greatest band in the world. No one has ever done anything better than this!!",
77 "page_title": "Beatallica"
78 },
79 {
80 "lang": "en",
81 "check_type": "tone",
82 "original_text": "This is the greatest band in the world. No one has ever done anything better than this!!",
83 "modified_text": "This band was originated at 1992",
84 "page_title": "Beatallica"
85 },
86 {
87 "lang": "en",
88 "check_type": "tone",
89 "original_text": "This is the greatest band in the world. No one has ever done anything better than this!!",
90 "modified_text": "This is the most amazing band in all over the world. Definitely the best band!",
91 "page_title": "Beatallica"
92 },
93 {
94 "lang": "en",
95 "check_type": "tone",
96 "original_text": "In recent years, the management of the coastal preservation program has come under intense scrutiny from environmental advocates, researchers, and local communities. The authorities responsible for maintaining and protecting the fragile shoreline ecosystems have repeatedly failed to implement timely and effective measures to counteract the accelerating effects of erosion and climate change. Reports from multiple watchdog organizations revealed instances of mismanagement, insufficient budget allocations, and inadequate planning that led to the degradation of several protected zones. Local communities, many of whom rely on the coastal regions for their livelihoods, expressed growing frustration at the government's inability to follow through on previously promised restoration initiatives. Internal audits pointed to significant delays in project execution and a lack of communication among the various agencies involved. Critics argue that if immediate corrective actions are not taken, the damage could become irreversible, leading to the collapse of marine biodiversity and the displacement of coastal populations.",
97 "modified_text": "Over the past several years, the coastal preservation program has attracted significant attention from environmental experts, civic leaders, and local residents. While challenges have emerged in addressing shoreline degradation and responding to climate-related threats, the program continues to evolve through policy refinement and stakeholder engagement. Recent findings by environmental groups have highlighted areas for improvement, including the need for more strategic funding allocation and enhanced inter-agency coordination. Recognizing these concerns, regional authorities have initiated dialogue with affected communities and environmental scientists to co-develop actionable plans aimed at sustainable coastal restoration. Although earlier phases of the initiative experienced some delays and resource limitations, the growing collaboration between public officials and environmental groups signals a shift toward more responsive governance. By fostering open communication, embracing transparency, and committing to science-based solutions, the program aims to safeguard biodiversity while supporting resilient coastal communities for generations to come.",
98 "page_title": "LLM Generated Text around 1150 characters"
99 },
100 {
101 "lang": "en",
102 "check_type": "tone",
103 "original_text": "In recent years, the management of the coastal preservation program has come under intense scrutiny from environmental advocates, researchers, and local communities. The authorities responsible for maintaining and protecting the fragile shoreline ecosystems have repeatedly failed to implement timely and effective measures to counteract the accelerating effects of erosion and climate change. Reports from multiple watchdog organizations revealed instances of mismanagement, insufficient budget allocations, and inadequate planning that led to the degradation of several protected zones. Local communities, many of whom rely on the coastal regions for their livelihoods, expressed growing frustration at the government's inability to follow through on previously promised restoration initiatives. Internal audits pointed to significant delays in project execution and a lack of communication among the various agencies involved. Critics argue that if immediate corrective actions are not taken, the damage could become irreversible, leading to the collapse of marine biodiversity and the displacement of coastal populations.",
104 "modified_text": "Over the past several years, the coastal preservation program has attracted significant attention from environmental experts, civic leaders, and local residents. While challenges have emerged in addressing shoreline degradation and responding to climate-related threats, the program continues to evolve through policy refinement and stakeholder engagement. Recent findings by environmental groups have highlighted areas for improvement, including the need for more strategic funding allocation and enhanced inter-agency coordination. Recognizing these concerns, regional authorities have initiated dialogue with affected communities and environmental scientists to co-develop actionable plans aimed at sustainable coastal restoration. Although earlier phases of the initiative experienced some delays and resource limitations, the growing collaboration between public officials and environmental groups signals a shift toward more responsive governance. By fostering open communication, embracing transparency, and committing to science-based solutions, the program aims to safeguard biodiversity while supporting resilient coastal communities for generations to come. Over the past several years, the coastal preservation program has attracted significant attention from environmental experts, civic leaders, and local residents. While challenges have emerged in addressing shoreline degradation and responding to climate-related threats, the program continues to evolve through policy refinement and stakeholder engagement. Recent findings by environmental groups have highlighted areas for improvement, including the need for more strategic funding allocation and enhanced inter-agency coordination. Recognizing these concerns, regional authorities have initiated dialogue with affected communities and environmental scientists to co-develop actionable plans aimed at sustainable coastal restoration. Although earlier phases of the initiative experienced some delays and resource limitations, the growing collaboration between public officials and environmental groups signals a shift toward more responsive governance. By fostering open communication, embracing transparency, and committing to science-based solutions, the program aims to safeguard biodiversity while supporting resilient coastal communities for generations to come.",
105 "page_title": "LLM Generated Text around 2200 characters"
106 }
107 ]
108}

We can observe that the last request (the mixed one at the bottom of the paste) contains a request-instance which will return a 400 (more than 2000 characters) and it has the higher latency.

The load tests should be focused on single instance requests instead of multiple ones as this is the way that requests are sent in prod as well. Batching does happen using the batcher.
We should already be able to report on pod resources. Is this the time when you ran the load tests https://grafana.wikimedia.org/goto/bTPrpo9Hg?orgId=1? It shows some increased CPU usage and a bit of throttling.
In the new tests we can start with a 1-2 rps (or users as it would be in locust) and then gradually increase and see where cpu usage increases. During this investigation we can also use more CPU resources to understand performance.

re: logging input. Coming back to this, although we do need to log the input, we can indeed proceed with load tests that are close to 1-2k characters, so we don't have to wait.

We should already be able to report on pod resources. Is this the time when you ran the load tests https://grafana.wikimedia.org/goto/bTPrpo9Hg?orgId=1? It shows some increased CPU usage and a bit of throttling.

Yeap that was the time window.
I will rerun the locust tests using that strategy and I will disable the autoscaling in experimental.

I synced with @BWojtowicz-WMF on the current status of this investigation.
I am pasting the latest findings from last week.

  • Tests:
    • The latest tests configured to be close to reality.
    • Texts between 1k-2k characters num_words = random.randint(40, 56).
    • Slowly spawn users spawn-rate = 1-2, users = 30, wait_time = between(0.0, 0.1) or (1, 1) # 100ms or 1s.
  • Infra:
    • Experimental ns.
    • Resources: cpu: 6, memory: 8Gi , maxReplicas: 1
  • Findings:

Thoughts

  • In reality (on prod) we are not experiencing high traffic, but we are experiencing high latencies while receiving 1-2 RPS.
  • The model is huge, but we were testing it with very small input instances so the model was just projecting a small input vector into a higher dimensional space and calculate similarities between small vectors in a small dimensional space (kinda fast on cpu as we saw).
  • Now the input text is big, close to 2k characters, meaning that the model input exists in a high dimensional vector space. The model needs to calculate similarities between big vectors (tensors) which as it seems is not fast on cpu. So, the model on cpu needs to deal with two big vectors (original_text, modified_text) as pairs existing in a high dimensional vector space which seems to be not suitable for a cpu operation.
  • Experiment with parallelism: single pod, many resources to see if we still have throttling (not sure if this is going to work or if it is a good idea).
  • Experiment with scaling: less resources but many pods:
    • Experiment with different RPS thresholds for AH.
  • Run the same heavy tests on GPU.

After syncing with the Editing team on the latency SLO and we decided to enable a GPU until we investigate the issue. I'm going to open a patch to deploy and test on staging and we can also deploy to prod after that.
I'll open the patch for the edit-check namespace and I'll also disable autoscaling there.
@BWojtowicz-WMF we can continue the work required for the CPU investigation in the experimental namespace

Change #1186431 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: enable GPU for tone check in staging

https://gerrit.wikimedia.org/r/1186431

The latest tests configured to be close to reality.

Texts between 1k-2k characters num_words = random.randint(40, 56)

@gkyziridis I still see num_words = random.randint(5, 20) in the repo. If you have a patch ready could you upload that?
Perhaps instead of words we can have load tests that have random values between 1000-2000 characters which better reflects the paragraphs we are scoring in production

Change #1186431 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: enable GPU for tone check in staging

https://gerrit.wikimedia.org/r/1186431

Change #1186447 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: enable GPU for edit-check in prod

https://gerrit.wikimedia.org/r/1186447

I enabled the GPU in staging and used the following load test config to have texts between 1k-2k characters:

def get_random_input_params():
    num_words = random.randint(1000,1998)
    original = "".join(["W"] * num_words)
    modified = "".join(["a"] * num_words)
    return original, modified

got these results

[2025-09-09 09:18:20,755] stat1008/INFO/locust.runners: All users spawned: {"EditCheckPeacock": 20} (20 total users)
[2025-09-09 09:20:19,212] stat1008/INFO/locust.main: --run-time limit reached, shutting down
Load test results are within the threshold
[2025-09-09 09:20:19,311] stat1008/INFO/locust.main: Shutting down (exit code 0)
Type     Name                                   # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s
--------|-------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
POST     /v1/models/edit-check:predict           13303     0(0.00%) |    127      57     479    120 |  111.40        0.00
--------|-------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
         Aggregated                              13303     0(0.00%) |    127      57     479    120 |  111.40        0.00

Response time percentiles (approximated)
Type     Name                                           50%    66%    75%    80%    90%    95%    98%    99%  99.9% 99.99%   100% # reqs
--------|-----------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
POST     /v1/models/edit-check:predict                  120    160    170    170    180    190    200    210    340    450    480  13303
--------|-----------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
         Aggregated                                     120    160    170    170    180    190    200    210    340    450    480  13303

Proceeding to enable it in production as well.

Change #1186447 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: enable GPU for edit-check in prod

https://gerrit.wikimedia.org/r/1186447

Change #1186482 had a related patch set uploaded (by Gkyziridis; author: Gkyziridis):

[machinelearning/liftwing/inference-services@main] edit-check: Update locust tests

https://gerrit.wikimedia.org/r/1186482

Tests on experimental staging CPU.
Patch is ready

users = 60
spawn-rate = 1
run-time = 200s

num_words = random.randint(40, 56)
wait_time = between(0.0, 0.1)

[2025-09-09 11:34:01,509] stat1010/INFO/locust.main: Starting Locust 2.33.2
[2025-09-09 11:34:01,509] stat1010/WARNING/locust.main: Python 3.9 support is deprecated and will be removed soon
[2025-09-09 11:34:01,509] stat1010/INFO/locust.main: Run time limit set to 200 seconds
[2025-09-09 11:34:01,510] stat1010/INFO/locust.runners: Ramping to 60 users at a rate of 1.00 per second
[2025-09-09 11:35:00,581] stat1010/INFO/locust.runners: All users spawned: {"EditCheckPeacock": 60} (60 total users)
[2025-09-09 11:37:18,405] stat1010/INFO/locust.main: --run-time limit reached, shutting down
Load test results are within the threshold
[2025-09-09 11:37:18,484] stat1010/INFO/locust.main: Shutting down (exit code 0)
Type     Name                                                                          # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
POST     /v1/models/edit-check-staging:predict                                            641     0(0.00%) |  14360     366   18836  18000 |    3.41        0.00
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
         Aggregated                                                                       641     0(0.00%) |  14360     366   18836  18000 |    3.41        0.00

Response time percentiles (approximated)
Type     Name                                                                                  50%    66%    75%    80%    90%    95%    98%    99%  99.9% 99.99%   100% # reqs
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
POST     /v1/models/edit-check-staging:predict                                               18000  18000  19000  19000  19000  19000  19000  19000  19000  19000  19000    641
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
         Aggregated                                                                          18000  18000  19000  19000  19000  19000  19000  19000  19000  19000  19000    641

Change #1186447 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: enable GPU for edit-check in prod

https://gerrit.wikimedia.org/r/1186447

After enabling the GPU in the above patch the latencies have been stable and SLO targets are now met and are close to 99% vs the 90% target. No further actions need to be taken at this point.
https://grafana.wikimedia.org/goto/BsgK3OgDR?orgId=1
https://slo.wikimedia.org/objectives?expr={__name__=%22tonecheck-latency-v1%22,%20revision=%221%22,%20service=%22tonecheck%22,%20team=%22ml%22}&grouping={}&from=now-1h&to=now