Page MenuHomePhabricator

Increased latencies in reference-quality models (ref-need)
Open, Needs TriagePublicBUG REPORT

Description

The reference quality models (reference-risk and reference-need) models are bundled in one service and deployed in the revision-models namespace.

We have observed increased latencies in the preprocess phase of the reference-need model. More specifically we observe p75 latency spikes. These can be verified by the kserve inferences services dashboard in grafana.
Additionally, we have detected CPU throttling, which may be contributing to these delays. This is easily observed on the Kubernetes pod details grafana dashboard.
The service seems to have some steady traffic that ranges from 20-35 reqs/sec.

To tackle the latencies we have

  1. increased the maxReplicas of the service initially from 3->5 and then from 5>8. Additionally we
  2. increased the cpu requests/limits on the pod for the kserve-container from 12->16.

Although the above steps help mitigate the issue for a while, they don't fix it as from time to time latencies go up again.

Looking into the code of the service the following takes place in preprocessing:

  1. A request to mwapi is made and extracts information on the current revision (e.g. page_id, revision_id, user_id)
  2. using the information from the first request 3 additional requests are made to mwapi using async requests and asyncio.gather

So in total we have 4 requests to mwapi that are being made for every request to reference-need.

Details

Related Changes in Gerrit:
SubjectRepoBranchLines +/-
operations/deployment-chartsmaster+3 -3
operations/deployment-chartsmaster+2 -2
operations/deployment-chartsmaster+3 -3
operations/deployment-chartsmaster+6 -6
operations/deployment-chartsmaster+19 -9
machinelearning/liftwing/inference-servicesmain+19 -6
machinelearning/liftwing/inference-servicesmain+95 -27
machinelearning/liftwing/inference-servicesmain+26 -3
operations/deployment-chartsmaster+2 -2
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+6 -2
machinelearning/liftwing/inference-servicesmain+1 -1
machinelearning/liftwing/inference-servicesmain+12 -4
operations/deployment-chartsmaster+12 -31
operations/deployment-chartsmaster+2 -2
operations/deployment-chartsmaster+91 -8
machinelearning/liftwing/inference-servicesmain+25 -12
machinelearning/liftwing/inference-servicesmain+25 -12
operations/deployment-chartsmaster+3 -1
machinelearning/liftwing/inference-servicesmain+2 -1
operations/deployment-chartsmaster+2 -2
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+1 -1
machinelearning/liftwing/inference-servicesmain+7 -8
Show related patches Customize query in gerrit
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
feat(reference-need): allow batch_size to be set in classify functionrepos/research/knowledge_integrity!53isarantoisaranto/batch-ref-needmain
Customize query in GitLab

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change #1125422 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng: increase resource quota for revision models ns

https://gerrit.wikimedia.org/r/1125422

Change #1126061 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] reference-quality: revert multiple workers

https://gerrit.wikimedia.org/r/1126061

Change #1126070 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] reference-quality: set reference-need batch size through env var

https://gerrit.wikimedia.org/r/1126070

Change #1126094 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] reference-quality: allow to deploy models separately

https://gerrit.wikimedia.org/r/1126094

Change #1126061 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] reference-quality: revert multiple workers

https://gerrit.wikimedia.org/r/1126061

Change #1126122 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: revert uvicorn multiple workers

https://gerrit.wikimedia.org/r/1126122

Change #1126122 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: revert uvicorn multiple workers

https://gerrit.wikimedia.org/r/1126122

Change #1126094 abandoned by Ilias Sarantopoulos:

[machinelearning/liftwing/inference-services@main] reference-quality: allow to deploy models separately

Reason:

duplicate

https://gerrit.wikimedia.org/r/1126094

We can explore the following options to make the service more reliable:

  1. Separate the 2 model server deployments: reference-risk and reference-need will be deployed in different pods. We can deploy the new services first and then change the header in the API GW so that external users will not need to make any changes. (inf services patch)
  2. Use batching while running inference in reference-need (knowledge integrity patch, inf services patch). At the moment each sentence is being scored sequentially, leading in increased predict latencies
  3. Use multiprocessing in preprocess: this is a bit tricky as we are using async sessions which can't be pickled so using a process pool is not straight forward. A more delicate approach would be to pass a process pool to knowledge_integrity so it would use async for the mwapi request and multiprocessing for everything else.

At the moment we are exploring options 1 and 2 which are more straight forward and we can revisit option 3 if the first 2 dont bring the service to an acceptable performance.

Change #1126499 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] reference-quality: allow to deploy models separately

https://gerrit.wikimedia.org/r/1126499

Change #1126523 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] (WIP) api-gateway: change hosts for reference-risk/need

https://gerrit.wikimedia.org/r/1126523

re: model deployment separation
I want to propose the following process:

  1. Merge the patch in inference-services that allows to separate the processes
  2. test that with deployments in ml-staging
  3. deploy new services in revision-models namespace in production. These are only going to have 1 replica to start with (and max replicas to 8).
  4. make the change for the host headers in the api gateway
  5. external traffic coming from the API gateway will start being routed to the new pods and new replicas will spawn according the incoming traffic
  6. remove the old deployments once they have zero traffic

Change #1126499 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] reference-quality: allow to deploy models separately

https://gerrit.wikimedia.org/r/1126499

Change #1126545 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: separate deployment for reference quality models

https://gerrit.wikimedia.org/r/1126545

Change #1126545 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: separate deployment for reference quality models

https://gerrit.wikimedia.org/r/1126545

Change #1126523 merged by jenkins-bot:

[operations/deployment-charts@master] api-gateway: change hosts for reference-risk/need

https://gerrit.wikimedia.org/r/1126523

Change #1126592 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: remove old ref-quality deployment and increase resources

https://gerrit.wikimedia.org/r/1126592

Change #1126592 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: remove old ref-quality deployment and increase resources

https://gerrit.wikimedia.org/r/1126592

We have separated the service and it solved the problem for reference-risk which is now service at low latencies https://grafana.wikimedia.org/goto/vkYvln2NR?orgId=1
Reference-need however which was the original problem still experiences high throttling. I understand that this is because of the heavy lifting done in predict so I intend to try to use batch inference for this (related patch)

Change #1126070 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] reference-quality: set reference-need batch size through env var

https://gerrit.wikimedia.org/r/1126070

Change #1126945 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] reference-quality: update knowledge integrity

https://gerrit.wikimedia.org/r/1126945

Change #1126945 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] reference-quality: update knowledge integrity

https://gerrit.wikimedia.org/r/1126945

Reference-need however which was the original problem still experiences high throttling. I understand that this is because of the heavy lifting done in predict

Not sure if this a solution we'd want to consider at this point but I just wanted to mention that we ran some experiments around optimizing transformer models in T368614: Essential work - model quantization and on re-running them for the reference-need model, found that converting the model to ONNX and applying all graph optimizations + dynamic quantization yields a model that is 2X faster than the PyTorch INT8 quantized model being used on liftwing with almost no loss in performance on a test dataset of 15K samples:

image.png (958×1 px, 319 KB)

Change #1127052 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] reference-quality: multiprocessing with process pool for inference

https://gerrit.wikimedia.org/r/1127052

Update:
We have also applied batching and haven't gotten any improvement. We have verified that the issue is caused by the blocking inference code so we're focusing on that.

@MunizaA that sounds like a great approach! thanks for sharing. I'll explore ONNX, let me know if there is any code snippet for the specific model.

On top of that we will also explore multiprocessing (starting from the above patch) but for now it yields the following serialization error:

RuntimeError: Cowardly refusing to serialize non-leaf tensor which requires_grad, since autograd does not support crossing process boundaries.  If you just want to transfer the data, call detach() on the tensor before serializing (e.g., putting it on the queue).

I tried calling .detach() on the tensor before returning to the process with no luck. I am using a copy of knowledge-integrity repo locally so I can change things.

I'll look into ONNX and then go back to multiprocessing but using torch.multiprocessing this time. Our last resort will be utilizing a GPU.

@MunizaA that sounds like a great approach! thanks for sharing. I'll explore ONNX, let me know if there is any code snippet for the specific model.

Hi, that would be under Comparison with ONNX (optimized and quantized) model here. The required dependencies and their specific versions are listed under Setup.

Thanks @MunizaA I will take a look and try it

I started looking a bit into where time is spent during the predict function as there are feature transformations that are taking place which benefit from multiprocessing.
Tested with various batch sizes (batch sizes >32 didn't bring any improvements for the samples I tested). I ran all tests using an en revision_id: 1278579810 which is quite big (133KB).
I tried pickling the features so that when using use_pickled_features=True the features were loaded from a pickled file and only the predict code was ran.

batch_sizeuse_pickled_featuresbest_time_sec
1True1.11
1False1.5
4True0.83
4False1.22
8True0.81
8False1.21
16True0.81
16False1.2
32True0.92
32False1.34

It seems like a significant amount of time is spent on extracting the features.

To dig a bit deeper I profiled the classify function

%prun -s cumulative -l 10 classify(ref_obj, rev, batch_size=16)

   3159475 function calls (2815335 primitive calls) in 2.244 seconds

   Ordered by: cumulative time
   List reduced from 577 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    2.244    2.244 {built-in method builtins.exec}
        1    0.000    0.000    2.244    2.244 <string>:1(<module>)
        1    0.000    0.000    2.244    2.244 reference_need.py:73(classify)
        1    0.000    0.000    1.406    1.406 reference_need.py:49(extract_features)
        1    0.000    0.000    1.406    1.406 featureset.py:301(get_features)
        1    0.004    0.004    1.406    1.406 reference_need.py:37(_transformed_ref_need_features)
        1    0.003    0.003    1.401    1.401 utils.py:125(extract_sentences)
37670/160    0.132    0.000    1.080    0.007 utils.py:37(parse_anything)
      160    0.000    0.000    1.078    0.007 __init__.py:68(parse)
        1    0.000    0.000    0.838    0.838 bert.py:8(predict_sentences_scores)
%prun -s cumulative -l 10 classify(ref_obj, rev, batch_size=16, use_pickled_features=True)

        29338 function calls (28206 primitive calls) in 0.835 seconds

   Ordered by: cumulative time
   List reduced from 287 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.835    0.835 {built-in method builtins.exec}
        1    0.000    0.000    0.835    0.835 <string>:1(<module>)
        1    0.000    0.000    0.835    0.835 reference_need.py:73(classify)
        1    0.000    0.000    0.834    0.834 bert.py:8(predict_sentences_scores)
        1    0.000    0.000    0.834    0.834 text_classification.py:121(__call__)
        1    0.000    0.000    0.834    0.834 base.py:1019(__call__)
        1    0.000    0.000    0.834    0.834 base.py:1063(<listcomp>)
  230/115    0.000    0.000    0.834    0.007 pt_utils.py:117(__next__)
  149/115    0.000    0.000    0.829    0.007 {built-in method builtins.next}
        8    0.000    0.000    0.786    0.098 base.py:981(forward)

In this specific example anything related to predict takes ~800ms and feature extraction ~1.4seconds.
I ran this on ml-lab which is a beefy machine so results on a localhost or on a pod would be significantly worse (running inference for this sample locally takes ~5s).

I will look into ONNX for now but I would conclude that multiprocessing would help here.

Change #1127494 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: apply inference batching on reference-need

https://gerrit.wikimedia.org/r/1127494

Change #1127494 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: apply inference batching on reference-need

https://gerrit.wikimedia.org/r/1127494

Change #1127530 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] admin_ng: increase cpu resource_quota for revision-models

https://gerrit.wikimedia.org/r/1127530

Change #1127541 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: increase ref-risk autoscaling

https://gerrit.wikimedia.org/r/1127541

Change #1127541 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: increase ref-risk autoscaling

https://gerrit.wikimedia.org/r/1127541

Change #1127530 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng: increase cpu resource_quota for revision-models

https://gerrit.wikimedia.org/r/1127530

Change #1127052 abandoned by Ilias Sarantopoulos:

[machinelearning/liftwing/inference-services@main] reference-quality: multiprocessing with process pool for inference

Reason:

reimplemented in a different way

https://gerrit.wikimedia.org/r/1127052

Change #1128414 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] reference-need: multiprocessing in predict

https://gerrit.wikimedia.org/r/1128414

Change #1128414 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] reference-need: multiprocessing in predict

https://gerrit.wikimedia.org/r/1128414

pasting some raw load test results . sorry for the awful format, I'm running some more tests on ml-staging and will report back

Specific rev id

lang:en
revision_id: 1278579810
size: 133464

before

Type     Name                                  # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s
--------|------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
POST     /v1/models/reference-need:predict         48     3(6.25%) |  28576     309   54514  28000 |    0.82        0.05
--------|------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
         Aggregated                                48     3(6.25%) |  28576     309   54514  28000 |    0.82        0.05

Response time percentiles (approximated)
Type     Name                                          50%    66%    75%    80%    90%    95%    98%    99%  99.9% 99.99%   100% # reqs
--------|----------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
POST     /v1/models/reference-need:predict           28000  38000  42000  44000  50000  53000  55000  55000  55000  55000  55000     48
--------|----------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
         Aggregated                                  28000  38000  42000  44000  50000  53000  55000  55000  55000  55000  55000     48

After

[2025-03-17 15:27:25,159] wmf3251/INFO/locust.main: --run-time limit reached, shutting down
Load test results are within the threshold
[2025-03-17 15:27:25,212] wmf3251/INFO/locust.main: Shutting down (exit code 1)
Type     Name                                  # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s
--------|------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
POST     /v1/models/reference-need:predict         76     2(2.63%) |  23856     338   35125  27000 |    1.28        0.03
--------|------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
         Aggregated                                76     2(2.63%) |  23856     338   35125  27000 |    1.28        0.03

Response time percentiles (approximated)
Type     Name                                          50%    66%    75%    80%    90%    95%    98%    99%  99.9% 99.99%   100% # reqs
--------|----------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
POST     /v1/models/reference-need:predict           27000  30000  32000  32000  33000  34000  34000  35000  35000  35000  35000     76
--------|----------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
         Aggregated                                  27000  30000  32000  32000  33000  34000  34000  35000  35000  35000  35000     76

standard data - sample_all.tsv

Before

[2025-03-17 15:20:20,941] wmf3251/INFO/locust.main: Shutting down (exit code 0)
Type     Name                                  # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s
--------|------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
POST     /v1/models/reference-need:predict         43     0(0.00%) |   2307     445    6491   1500 |    0.36        0.00
--------|------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
         Aggregated                                43     0(0.00%) |   2307     445    6491   1500 |    0.36        0.00

Response time percentiles (approximated)
Type     Name                                          50%    66%    75%    80%    90%    95%    98%    99%  99.9% 99.99%   100% # reqs
--------|----------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
POST     /v1/models/reference-need:predict            1500   2700   4000   4100   4800   4900   6500   6500   6500   6500   6500     43
--------|----------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
         Aggregated                                   1500   2700   4000   4100   4800   4900   6500   6500   6500   6500   6500     43

After

[2025-03-17 15:15:25,906] wmf3251/INFO/locust.runners: All users spawned: {"ReferenceNeed": 2} (2 total users)
[2025-03-17 15:17:25,563] wmf3251/INFO/locust.main: --run-time limit reached, shutting down
Load test results are within the threshold
[2025-03-17 15:17:25,609] wmf3251/INFO/locust.main: Shutting down (exit code 0)
Type     Name                                  # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s
--------|------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
POST     /v1/models/reference-need:predict         55     0(0.00%) |   1394     426    4940   1000 |    0.47        0.00
--------|------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
         Aggregated                                55     0(0.00%) |   1394     426    4940   1000 |    0.47        0.00

Response time percentiles (approximated)
Type     Name                                          50%    66%    75%    80%    90%    95%    98%    99%  99.9% 99.99%   100% # reqs
--------|----------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
POST     /v1/models/reference-need:predict            1000   1500   1900   2100   2500   3300   3900   4900   4900   4900   4900     55
--------|----------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
         Aggregated                                   1000   1500   1900   2100   2500   3300   3900   4900   4900   4900   4900     55

Change #1128824 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] reference-quality: multiprocessing - do not use process pool for workers=1

https://gerrit.wikimedia.org/r/1128824

Change #1128824 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] reference-quality: multiprocessing - do not use process pool for workers=1

https://gerrit.wikimedia.org/r/1128824

I tried running some load tests on ml-staging experimental using 2 workers and multiprocessing and I saw really high cpu throttling.
I assume that each worker is trying to use all available CPU cores (16).
torch uses ONM_NUM_THREADS which would automatically be set to 16. I tried setting this to 7 but didn't get any better results.
Will try again and report back.

Screenshot 2025-03-18 at 8.11.58 PM.png (530×1 px, 64 KB)

@isarantopoulos the CPU throttling can be a bit misleading sometimes (see this for more info), in a lot of cases a high number of threads may cause high throttling, since the overall "cpu time across multiple cpus" (pass me the sentence, it is not 100% correct) crosses the allowed threshold. Basically in this case you have a limited time window that theads are allowed to run on multiple cpus, and the more threads the easier it is for the time window to be consumed in a short burst (without the threads being able to progress much their work). It is as if each thread would get a little "slice" of time to run, burning the allowed overall time very quickly.

One thing that I'd do is to repeat the test and inspect what threads/processes are running when you see high cpu throttling. It may not be only ONM_NUM_THREADS, this is my fear. I am available to check when you load test, in case lemme know!

@elukey thanks for chiming in, this is very useful!
I noticed 2 things I miscalculated:

  1. above I mentioned ONM_NUM_THREADS instead of OMP_NUM_THREADS which is the correct variable name so I went back to retest BUT
  2. even if we set this through the environment the value set will be overwritten on runtime due to this

I'll issue a workaround patch to retest it. Is there any way I can check the number of threads running given that I don't have top/htop etc available in the container? Otherwise let's sync over IRC, I can leave a long running load test on that will melt the pod.

Change #1130089 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: udpate ml-staging ref-need deployment

https://gerrit.wikimedia.org/r/1130089

I tried multiprocessing and run some load tests under heavy load (50 concurrent users), a scenario under which we are currently seeing cpu throttling.
In a pod that uses 16 cpu cores setting NUM_THREADS to 7 for each of the 2 workers is the sweet spot which we see no throttling.

Load Test Comparison Summary

All the failures in multiprocessing are related to 400 error responses and we didnt get 1 single 500.

WorkersFileRequestsFails (%)Avg (ms)Min (ms)Max (ms)Median (ms)RPSFail/sP90 (ms)P99 (ms)
1ref_need_errors301114 (37.9%)4532563560897530001.010.386000060000
2ref_need_errors4937 (1.4%)2807012060001290001.650.023900052000
1sample_all256554 (2.1%)55943672147253008.560.18670017000
2sample_all363882 (2.3%)3909966656400012.150.2748005600
1test_en234190 (81.2%)54899245560159600000.810.666000060000
2test_en2510 (0.0%)53114285059717590000.840.005900060000

On the left we can see a single process throttling while on the right we have 0 throttling even under heavy load.

Screenshot 2025-03-21 at 12.44.30 PM.png (254×580 px, 35 KB)
Screenshot 2025-03-21 at 1.15.04 PM.png (273×635 px, 32 KB)

Extended results available in

1# Load test results
2
350 users, 5 minutes, 16CPUs
4
5# workers = 1 | ref_need_errors.csv
6
7```markdown
8Type Name # reqs # fails | Avg Min Max Med | req/s failures/s
9--------|--------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
10POST /v1/models/reference-need:predict 301 114(37.87%) | 45325 635 60897 53000 | 1.01 0.38
11--------|--------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
12 Aggregated 301 114(37.87%) | 45325 635 60897 53000 | 1.01 0.38
13
14Response time percentiles (approximated)
15Type Name 50% 66% 75% 80% 90% 95% 98% 99% 99.9% 99.99% 100% # reqs
16--------|------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
17POST /v1/models/reference-need:predict 53000 60000 60000 60000 60000 60000 60000 60000 61000 61000 61000 301
18--------|------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
19 Aggregated 53000 60000 60000 60000 60000 60000 60000 60000 61000 61000 61000 301
20
21Error report
22# occurrences Error
23------------------|-------------------------------------------------------------------------------------------------------------------------------------------
24114 POST /v1/models/reference-need:predict: RetriesExceeded('https://inference-staging.svc.codfw.wmnet:30443/v1/models/reference-need:predict', 0, original=The read operation timed out)
25------------------|--------------------
26```
27
28
29# workers = 1 | sample_all.csv
30
31```markdown
32Type Name # reqs # fails | Avg Min Max Med | req/s failures/s
33--------|--------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
34POST /v1/models/reference-need:predict 2565 54(2.11%) | 5594 367 21472 5300 | 8.56 0.18
35--------|--------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
36 Aggregated 2565 54(2.11%) | 5594 367 21472 5300 | 8.56 0.18
37
38Response time percentiles (approximated)
39Type Name 50% 66% 75% 80% 90% 95% 98% 99% 99.9% 99.99% 100% # reqs
40--------|------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
41POST /v1/models/reference-need:predict 5300 5700 6000 6200 6700 7500 15000 17000 21000 21000 21000 2565
42--------|------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
43 Aggregated 5300 5700 6000 6200 6700 7500 15000 17000 21000 21000 21000 2565
44
45Error report
46# occurrences Error
47------------------|-------------------------------------------------------------------------------------------------------------------------------------------
4854 POST /v1/models/reference-need:predict: BadStatusCode('https://inference-staging.svc.codfw.wmnet:30443/v1/models/reference-need:predict', code=400)
49------------------|-------------------------------------------------------------------------------------------------------------------------------------------
50
51```
52
53# workers = 1 | test_en.csv
54
55```markdown
56Type Name # reqs # fails | Avg Min Max Med | req/s failures/s
57--------|--------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
58POST /v1/models/reference-need:predict 234 190(81.20%) | 54899 2455 60159 60000 | 0.81 0.66
59--------|--------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
60 Aggregated 234 190(81.20%) | 54899 2455 60159 60000 | 0.81 0.66
61
62Response time percentiles (approximated)
63Type Name 50% 66% 75% 80% 90% 95% 98% 99% 99.9% 99.99% 100% # reqs
64--------|------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
65POST /v1/models/reference-need:predict 60000 60000 60000 60000 60000 60000 60000 60000 60000 60000 60000 234
66--------|------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
67 Aggregated 60000 60000 60000 60000 60000 60000 60000 60000 60000 60000 60000 234
68
69Error report
70# occurrences Error
71------------------|-------------------------------------------------------------------------------------------------------------------------------------------
72190 POST /v1/models/reference-need:predict: RetriesExceeded('https://inference-staging.svc.codfw.wmnet:30443/v1/models/reference-need:predict', 0, original=The read operation timed out)
73------------------|-------------------------------------------------------------------------------------------------------------------------------------------
74
75```
76
77#
78
79
80# workers = 2 | ref_need_errors.csv | NUM_THREADS=7
81
82```markdown
83Type Name # reqs # fails | Avg Min Max Med | req/s failures/s
84--------|--------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
85POST /v1/models/reference-need:predict 493 7(1.42%) | 28070 120 60001 29000 | 1.65 0.02
86--------|--------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
87 Aggregated 493 7(1.42%) | 28070 120 60001 29000 | 1.65 0.02
88
89Response time percentiles (approximated)
90Type Name 50% 66% 75% 80% 90% 95% 98% 99% 99.9% 99.99% 100% # reqs
91--------|------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
92POST /v1/models/reference-need:predict 29000 32000 34000 35000 39000 42000 52000 52000 60000 60000 60000 493
93--------|------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
94 Aggregated 29000 32000 34000 35000 39000 42000 52000 52000 60000 60000 60000 493
95
96Error report
97# occurrences Error
98------------------|-------------------------------------------------------------------------------------------------------------------------------------------
996 POST /v1/models/reference-need:predict: BadStatusCode('https://inference-staging.svc.codfw.wmnet:30443/v1/models/reference-need:predict', code=400)
1001 POST /v1/models/reference-need:predict: RetriesExceeded('https://inference-staging.svc.codfw.wmnet:30443/v1/models/reference-need:predict', 0, original=The read operation timed out)
101------------------|-------------------------------------------------------------------------------------------------------------------------------------------
102
103```
104
105
106# workers = 2 | sample_all.csv | NUM_THREADS=7
107
108```markdown
109Type Name # reqs # fails | Avg Min Max Med | req/s failures/s
110--------|--------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
111POST /v1/models/reference-need:predict 3638 82(2.25%) | 3909 96 6656 4000 | 12.15 0.27
112--------|--------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
113 Aggregated 3638 82(2.25%) | 3909 96 6656 4000 | 12.15 0.27
114
115Response time percentiles (approximated)
116Type Name 50% 66% 75% 80% 90% 95% 98% 99% 99.9% 99.99% 100% # reqs
117--------|------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
118POST /v1/models/reference-need:predict 4000 4300 4400 4500 4800 5200 5400 5600 6200 6700 6700 3638
119--------|------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
120 Aggregated 4000 4300 4400 4500 4800 5200 5400 5600 6200 6700 6700 3638
121
122Error report
123# occurrences Error
124------------------|-------------------------------------------------------------------------------------------------------------------------------------------
12582 POST /v1/models/reference-need:predict: BadStatusCode('https://inference-staging.svc.codfw.wmnet:30443/v1/models/reference-need:predict', code=400)
126------------------|-------------------------------------------------------------------------------------------------------------------------------------------
127
128make: *** [M
129```
130
131
132# workers = 2 | test_en.csv | NUM_THREADS=7
133
134```markdown
135Type Name # reqs # fails | Avg Min Max Med | req/s failures/s
136--------|--------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
137POST /v1/models/reference-need:predict 251 0(0.00%) | 53114 2850 59717 59000 | 0.84 0.00
138--------|--------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
139 Aggregated 251 0(0.00%) | 53114 2850 59717 59000 | 0.84 0.00
140
141Response time percentiles (approximated)
142Type Name 50% 66% 75% 80% 90% 95% 98% 99% 99.9% 99.99% 100% # reqs
143--------|------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
144POST /v1/models/reference-need:predict 59000 59000 59000 59000 59000 60000 60000 60000 60000 60000 60000 251
145--------|------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
146 Aggregated 59000 59000 59000 59000 59000 60000 60000 60000 60000 60000 60000 251
147
148```
149
150
151# workers = 2 | test_en.csv | NUM_THREADS=7 | timeout=20seconds
152
153```markdown
154[2025-03-21 11:38:57,645] stat1008/INFO/locust.main: Shutting down (exit code 1)
155Type Name # reqs # fails | Avg Min Max Med | req/s failures/s
156--------|--------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
157POST /v1/models/reference-need:predict 708 697(98.45%) | 19950 6679 20309 20000 | 2.37 2.33
158--------|--------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
159 Aggregated 708 697(98.45%) | 19950 6679 20309 20000 | 2.37 2.33
160
161Response time percentiles (approximated)
162Type Name 50% 66% 75% 80% 90% 95% 98% 99% 99.9% 99.99% 100% # reqs
163--------|------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
164POST /v1/models/reference-need:predict 20000 20000 20000 20000 20000 20000 20000 20000 20000 20000 20000 708
165--------|------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
166 Aggregated 20000 20000 20000 20000 20000 20000 20000 20000 20000 20000 20000 708
167
168Error report
169# occurrences Error
170------------------|-------------------------------------------------------------------------------------------------------------------------------------------
171697 POST /v1/models/reference-need:predict: BadStatusCode('https://inference-staging.svc.codfw.wmnet:30443/v1/models/reference-need:predict', code=504)
172------------------|-------------------------------------------------------------------------------------------------------------------------------------------
173
174```
175
176# 1 user | 1 workers | ref_need_errors.tsv
177
178```markdown
179Type Name # reqs # fails | Avg Min Max Med | req/s failures/s
180--------|--------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
181POST /v1/models/reference-need:predict 227 0(0.00%) | 1152 140 18422 410 | 0.77 0.00
182--------|--------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
183 Aggregated 227 0(0.00%) | 1152 140 18422 410 | 0.77 0.00
184
185Response time percentiles (approximated)
186Type Name 50% 66% 75% 80% 90% 95% 98% 99% 99.9% 99.99% 100% # reqs
187--------|------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
188POST /v1/models/reference-need:predict 410 650 850 1300 3300 4700 7600 9200 18000 18000 18000 227
189--------|------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
190 Aggregated 410 650 850 1300 3300 4700 7600 9200 18000 18000 18000 227
191
192```
193
194# 1 user | 2 workers | ref_need_errors.tsv
195
196```markdown
197Type Name # reqs # fails | Avg Min Max Med | req/s failures/s
198--------|--------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
199POST /v1/models/reference-need:predict 162 1(0.62%) | 1643 129 29529 460 | 0.56 0.00
200--------|--------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
201 Aggregated 162 1(0.62%) | 1643 129 29529 460 | 0.56 0.00
202
203Response time percentiles (approximated)
204Type Name 50% 66% 75% 80% 90% 95% 98% 99% 99.9% 99.99% 100% # reqs
205--------|------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
206POST /v1/models/reference-need:predict 470 850 1200 1700 4600 6700 13000 14000 30000 30000 30000 162
207--------|------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
208 Aggregated 470 850 1200 1700 4600 6700 13000 14000 30000 30000 30000 162
209
210Error report
211# occurrences Error
212------------------|-------------------------------------------------------------------------------------------------------------------------------------------
2131 POST /v1/models/reference-need:predict: BadStatusCode('https://inference-staging.svc.codfw.wmnet:30443/v1/models/reference-need:predict', code=400)
214------------------|-------------------------------------------------------------------------------------------------------------------------------------------
215
216make: *** [Makefile:19: run-locust-test] Error 1
217```
218
219# 1 user | 1 workers | test_en.tsv
220
221```markdown
222Type Name # reqs # fails | Avg Min Max Med | req/s failures/s
223--------|--------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
224POST /v1/models/reference-need:predict 126 0(0.00%) | 2214 1959 2801 2200 | 0.42 0.00
225--------|--------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
226 Aggregated 126 0(0.00%) | 2214 1959 2801 2200 | 0.42 0.00
227
228Response time percentiles (approximated)
229Type Name 50% 66% 75% 80% 90% 95% 98% 99% 99.9% 99.99% 100% # reqs
230--------|------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
231POST /v1/models/reference-need:predict 2200 2300 2300 2300 2400 2400 2600 2600 2800 2800 2800 126
232--------|------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
233 Aggregated 2200 2300 2300 2300 2400 2400 2600 2600 2800 2800 2800 126
234
235```
236
237# 1 user | 2 workers | test_en.tsv
238
239```markdown
240Type Name # reqs # fails | Avg Min Max Med | req/s failures/s
241--------|--------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
242POST /v1/models/reference-need:predict 109 0(0.00%) | 2597 2406 8669 2600 | 0.36 0.00
243--------|--------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
244 Aggregated 109 0(0.00%) | 2597 2406 8669 2600 | 0.36 0.00
245
246Response time percentiles (approximated)
247Type Name 50% 66% 75% 80% 90% 95% 98% 99% 99.9% 99.99% 100% # reqs
248--------|------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
249POST /v1/models/reference-need:predict 2600 2600 2600 2600 2600 2600 2600 2700 8700 8700 8700 109
250--------|------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
251 Aggregated 2600 2600 2600 2600 2600 2600 2600 2700 8700 8700 8700 109
252```

  1. What is expected after this deployment:
  2. the service to be operational again as it will have no cpu throttling and we'll see the preprocess latencies drop drastically to the ms range
  3. All alerts related to 500s should disappear
  4. Response latency might be a bit slower in some cases as each worker will be using half the cpu than it is using now, but we then increasing cpu count on the pod will bring an improvement.

Change #1130089 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: enable multiprocessing for reference-need

https://gerrit.wikimedia.org/r/1130089

Change #1131327 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] admin_ng: increase pod/container limitranges fo revision models

https://gerrit.wikimedia.org/r/1131327

Change #1131327 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng: increase pod/container limitranges for revision models

https://gerrit.wikimedia.org/r/1131327

Change #1131697 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: reduce num of cpu cores in reference-need

https://gerrit.wikimedia.org/r/1131697

Change #1131697 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: reduce num of cpu cores in reference-need

https://gerrit.wikimedia.org/r/1131697

Change #1132111 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: increase ref-need memory limits/requests

https://gerrit.wikimedia.org/r/1132111

Change #1132111 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: increase ref-need memory limits/requests

https://gerrit.wikimedia.org/r/1132111

We are no longer getting 500s as before so the stability has improved BUT the overall latency of the service is still slow. Looking at the latency percentiles in Istio dashboards we see increased latencies https://grafana.wikimedia.org/goto/1LfIu8THg?orgId=1

The kserve inference latency graphs show that preprocess has decreased but predict has increased https://grafana.wikimedia.org/goto/l155uUoHR?orgId=1
This is a result of using less cpu resources per process. Increasing the cpu count per process to 16 (as the initial single process deployment) is not possible due to resource limitations on the nodes (we'd need 34 cpu cores per pod and this isn't possible at the moment). We have increased to 10 cpu cores per process but the latencies are still high.

There is an increasing memory consumption which ends up in pods getting killed because they get out of memory (OOMKilled).
Evidence of this can be seen in grafana

Screenshot 2025-03-31 at 3.37.57 PM.png (270×475 px, 20 KB)

This increasing memory usage pattern likely indicates that old processes still occupy memory. The model is being loaded in each process so this would explain this pattern.
Our initial implementation of loading the model in each process is not ideal but it was done to overcome serialization issues of the model while also avoiding this overhead.
This coincides with the increase in the cpu cores of each process (from 7 to 10 but still keep 2 for the main event loop).

The next step would be to make sure the process pool is properly refreshed and resources are freed when the pool shuts down.

I have verified the above by looking at a specific pod:

  1. Found some BrokenProcessPool exceptions in the logs which means that there is a shutdown and restart of the process pool
  2. See the bump in memory in grafana followed, the pod goes past the limit so a pod restart happens.

Change #1155655 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: increase workers in viwiki-reverted

https://gerrit.wikimedia.org/r/1155655

Change #1155655 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: increase workers in viwiki-reverted

https://gerrit.wikimedia.org/r/1155655