Page MenuHomePhabricator

Load test the Lift Wing cluster
Closed, ResolvedPublic

Description

We should start running some load tests (even basic ones) to understand basic things:

  1. Are our metrics enough and reliable? Do the dashboards depict a complete picture of how the traffic flows?
  2. How much traffic a single pod can take?
  3. We should test how scale-up/down pods works for Knative, and what settings are needed. Initially we could be generous and oversize a bit the min set of pods for wikis, and allow them to scale up to absorb traffic peaks.
  4. If the rate limit is implemented on the API-Gateway, test it. Otherwise we should implement it and see if it works.

Event Timeline

I am doing a very simple and base load testing from deploy1002:

elukey@deploy1002:~$ cat input.json 
{"rev_id":123456}
elukey@deploy1002:~$ siege -c 50 "https://inference.svc.eqiad.wmnet:30443/v1/models/enwiki-goodfaith:predict POST `cat input.json`" -H "Host: enwiki-goodfaith-predictor-default.revscoring-editquality.wikimedia.org"

The limits for our current revscoring pods are:

Limits:
  cpu:     100m
  memory:  100Mi
Requests:
  cpu:      25m
  memory:   100Mi

The kubernetes-pod-details dashboard is nice to observe how a pod is doing while serving requests.

Kserve seems using tornado with asyncio, and https://github.com/kserve/kserve/commit/c10e6271897d7fd058f5618d5e0e70b31496f64c shows a good point about migrating to kserve 0.7 asap (we are currently running KServer from kfserving==0.3)

Tornado on each KServer pod (in Kserve 0.7) has two parameters to tune:

  1. workers
  2. async_io_workers

https://www.tornadoweb.org/en/stable/guide/running.html#processes-and-ports

IIUC the workers are the ones passed to the start method of the Tornado's httpserver, that spawns a process with a dedicated IOLoop for each worker indicated (the will all share the same socket to get requests from). The IOLoop is a wrapper around asyncio's event loop, that gets a number of executors equal to the async_io_workers parameter.

We have nproc=72 on ml-serve100X nodes, so it seems a compelling reason to have https://github.com/kserve/kserve/commit/c10e6271897d7fd058f5618d5e0e70b31496f64c.

Not sure about the httpserver's workers, at the moment we have one but it seems enough for a single pod. We should figure out if the pod limits are ok or if we want to change them too.

There is definitely something weird happening for our pods:

elukey@ml-serve1002:~$ ps -eLf | grep python | awk '{ print $2" "$10" "$11}' | uniq -c
      2 1295 python3 /usr/local/bin/prometheus-nic-saturation-exporter
    127 7645 python3 model-server/model.py
    127 7706 python3 model-server/model.py

Too many threads, it may be https://github.com/kserve/kserve/commit/c10e6271897d7fd058f5618d5e0e70b31496f64c. From the Grafana dashboard I see that there is a lot of throttling as well, even when the pod is not serving requests.

elukey@ml-serve1004:~$ ps -eLf | grep python | awk '{ print $2" "$10" "$11}' | uniq -c
      2 1404 python3 /usr/local/bin/prometheus-nic-saturation-exporter
      1 4931 python3 model-server/model.py
      1 10488 grep python

After moving to Kserve 0.7 the situation seems better! Also confirmed by logs:

[I 211217 15:34:52 kfserver:150] Registering model: enwiki-goodfaith
[I 211217 15:34:52 kfserver:120] Setting asyncio max_workers as 5
[I 211217 15:34:52 kfserver:127] Listening on port 8080
[I 211217 15:34:52 kfserver:129] Will fork 1 workers

Change 753996 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] knative-serving: add params to configure requests/limits of queue-proxy

https://gerrit.wikimedia.org/r/753996

Change 753996 merged by Elukey:

[operations/deployment-charts@master] knative-serving: add params to configure requests/limits of queue-proxy

https://gerrit.wikimedia.org/r/753996

I also left some questions in Kserve's upstream slack channel about how tornato/asyncio/etc.. are working and if we should follow specific guidelines to make our code as parallelizable/non-blocking as possible. When we have a solid understanding of how to write our code we'll also be able to tune the cgroups memory/cpu size of the kserve predictor/transformer containers.

From IRC:

the kserve upstream folks gave me this link 
https://kserve.github.io/website/master/modelserving/v1beta1/custom/custom_model/
so the TL;DR seems to be:
- the async workers are used only if we mark code as co-routine etc.. and it seems mostly helping in the transformer, when calling external http services etcc.
- the tornado workers have an async loop, and code runs in it so blocking/cpu-bound code reduces parallelism a lot
- we could deploy ray workers (not still super sure how) in order to have a separate pool for models so the tornado loop would only dispatch calls to the models it may be as simple as adding the annotation to the code (last famous words)

@ACraze let's discuss how to proceed during the next meeting or on IRC. We could simply increase the number of CPUs available for each pod (using the tornado workers) or try the Ray workers/pool way.

@elukey i've been reading the kserve docs (https://kserve.github.io/website/master/modelserving/v1beta1/custom/custom_model/#parallel-inference) and I think we should tune the 'frontend-endpoint' and try the tornado workers first. Definitely interested in the ray workers too though, but I think that might be a bit more complex..?

After a chat with the team, we decided to keep the tornado workers setting to 1 (default), and try the auto-scaling features offered by Knative (min/max replicas etc..).

Interesting discovery - it seems that my previous tests with ab and siege used http/1.0, not 1.1, and the responses from istio where all 426 upgrade required (so not really representative of HTTP traffic handled by a single pod). The ab tool seems not ready for http 1.1 yet, siege should support it but I am using the version on deploy1002 that could be outdated.

It is very weird, siege supports HTTP/1.1 but I see the following:

elukey@deploy1002:~$ siege "https://inference.svc.eqiad.wmnet:30443/v1/models/enwiki-goodfaith:predict POST `cat input.json`" -H "Host: enwiki-goodfaith.revscoring-editquality-goodfaith.wikimedia.org" -g
POST /v1/models/enwiki-goodfaith:predict HTTP/1.0
Accept: */*
Accept-Encoding: gzip, deflate
User-Agent: Mozilla/5.0 (pc-x86_64-linux-gnu) Siege/4.0.4
Host: enwiki-goodfaith.revscoring-editquality-goodfaith.wikimedia.org
Connection: close
Content-type: application/x-www-form-urlencoded
Content-length: 18

{"rev_id":1343456}

HTTP/1.1 426 Upgrade Required
date: Thu, 19 May 2022 09:46:29 GMT
server: istio-envoy
connection: close
content-length: 0


Transactions:		           1 hits
Availability:		      100.00 %
Elapsed time:		        1.02 secs
Data transferred:	        0.00 MB
Response time:		        0.01 secs
Transaction rate:	        0.98 trans/sec
Throughput:		        0.00 MB/sec
Concurrency:		        0.01
Successful transactions:           0
Failed transactions:	           0
Longest transaction:	        0.01
Shortest transaction:	        0.00

So POST /v1/models/enwiki-goodfaith:predict HTTP/1.0 is clearly not right, but the siege config suggests otherwise:

elukey@deploy1002:~$ siege --config | grep protocol
protocol:                       HTTP/1.1

I was able to use wrk, very interesting tool installed on deploy1002. We can use lua scripts like the following:

elukey@deploy1002:~$ cat inference.lua 

wrk.method = "POST"
file = io.open("input.json", "rb")
wrk.body = file:read("*a")
wrk.headers["Content-Type"] = "application/json"
wrk.headers["Host"] = "enwiki-goodfaith.revscoring-editquality-goodfaith.wikimedia.org"

logfile = io.open("wrk.log", "w");
local cnt = 0;

response = function(status, header, body)
     logfile:write("status:" .. status .. "\n" .. body .. "\n-------------------------------------------------\n");
end

done = function(summary, latency, requests)
     logfile.close();
end

and then:

elukey@deploy1002:~$ wrk --timeout 5s -s inference.lua https://inference.svc.eqiad.wmnet:30443/v1/models/enwiki-goodfaith:predict --latency
Running 10s test @ https://inference.svc.eqiad.wmnet:30443/v1/models/enwiki-goodfaith:predict
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.06s     1.02s    4.12s    78.26%
    Req/Sec     0.96      0.55     2.00     70.83%
  Latency Distribution
     50%    3.56s 
     75%    3.67s 
     90%    3.71s 
     99%    4.12s 
  24 requests in 10.02s, 6.89KB read
  Socket errors: connect 0, read 0, write 0, timeout 1
Requests/sec:      2.39
Transfer/sec:     703.92B

Note: I had to increase the DestinationRule's max allowed conns to the mwapi to 100 (was 10), otherwise the sidecar proxy limited too much traffic causing kserve to return 500s.

Note2: with 2 threads and 10 connections directed to a single pod, the latency gets to seconds very easily. There are probably some bottlenecks that we can solve, hopefully we'll be able to improve numbers.

With 1 connection for 1 worker latencies are more inline with what we are used to:

elukey@deploy1002:~$ wrk -c 1 -t 1 --timeout 2s -s inference.lua https://inference.svc.eqiad.wmnet:30443/v1/models/enwiki-goodfaith:predict --latency
Running 10s test @ https://inference.svc.eqiad.wmnet:30443/v1/models/enwiki-goodfaith:predict
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   402.47ms   21.46ms 444.42ms   66.67%
    Req/Sec     2.00      0.29     3.00     91.67%
  Latency Distribution
     50%  397.57ms
     75%  420.17ms
     90%  438.79ms
     99%  444.42ms
  24 requests in 10.02s, 6.87KB read
Requests/sec:      2.40
Transfer/sec:     701.89B

Note3: We don't have any form of caching for the moment of scores generated, so adding it will surely improve the overall performances of the revscoring models.

Things to do (in my opinion):

  • work on T302232 to add score caching
  • figure out if we can have async connections to the mw-api (so not blocking the main tornado worker, but using the async loop threads).

@elukey It seems we don't use async right now. We can try to use coroutines to preprocess and see if it would improve performance. Also writing test jobs for revscoring models (like the test_server.py you pasted) would be something good to do.

I have some questions --
If I make code changes, can I test it the same way on ML-Sandbox? or I have to push code to production then we test?

And where did you increase the DestinationRule's max allowed conns to the mwapi to 100 (was 10)? because I need to do this as well for testing.

@elukey It seems we don't use async right now. We can try to use coroutines to preprocess and see if it would improve performance. Also writing test jobs for revscoring models (like the test_server.py you pasted) would be something good to do.

My only doubt is if our preprocess can become a co-routine, since we use a dedicated library to call the mw api. For example, in the predict code it seems that kserve uses async and the await keywords, with a dedicated http_client. What do you think?

I have some questions --
If I make code changes, can I test it the same way on ML-Sandbox? or I have to push code to production then we test?

You should be able to build the Dockerfiles with the new code on the ml-sandbox, and then use the new docker images in the InferenceService k8s resources. There is definitely a way to do it, I've done something similar in local one time, if you don't manage to to do it let's sync and hack :)

And where did you increase the DestinationRule's max allowed conns to the mwapi to 100 (was 10)? because I need to do this as well for testing.

This was done manually in production, there is a dedicated setting in deployment-charts, but it shouldn't be a problem on the ml-sandbox.

I read the kserve docs: https://kserve.github.io/website/master/modelserving/v1beta1/custom/custom_model/#parallel-inference

There are two ways to run parallel inference:

  • tune the workers parameter for the Tornado's httpserver
  • use RayServe to deploy ray workers

The first option is not working for our current kserve version 0.7.0 because there is a bug. Fixed in kserve 0.8.0.

If we upgrade to kserve 0.8.0, we will have dependency conflicts on numpy:

#18 208.8     kserve 0.8.0 depends on numpy~=1.19.2
#18 208.8     revscoring 2.8.2 depends on numpy<1.18.999 and >=1.18.4

Not sure if we can upgrade numpy for rescoring.

Therefore I tried the second option based on kserve's example code. I added a ray decorator to the model class with 4 replicas and made the predict method to be a coroutine (just added the async keyword). Solved some dependencies issue and built a new docker image.

I tested locally using wrk and compared with the single process model server.

With 2 threads and 4 connections:

# 4 ray workers
➜  articlequality wrk -c 4 -t 2 --timeout 2s -s inference.lua http://localhost:8080/v1/models/enwiki-articlequality:predict --latency
Running 10s test @ http://localhost:8080/v1/models/enwiki-articlequality:predict
  2 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.61s   171.55ms   2.00s    76.19%
    Req/Sec     2.60      3.56     9.00     80.00%
  Latency Distribution
     50%    1.54s 
     75%    1.71s 
     90%    1.81s 
     99%    2.00s 
  23 requests in 10.08s, 8.29KB read
  Socket errors: connect 0, read 0, write 0, timeout 2
Requests/sec:      2.28
Transfer/sec:     842.05B

# single process
➜  articlequality wrk -c 4 -t 2 --timeout 2s -s inference.lua http://localhost:8080/v1/models/enwiki-articlequality:predict --latency
Running 10s test @ http://localhost:8080/v1/models/enwiki-articlequality:predict
  2 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.44s   651.97ms   1.91s   100.00%
    Req/Sec     0.30      0.48     1.00     70.00%
  Latency Distribution
     50%    1.91s 
     75%    1.91s 
     90%    1.91s 
     99%    1.91s 
  10 requests in 10.11s, 3.60KB read
  Socket errors: connect 0, read 0, write 0, timeout 8
Requests/sec:      0.99
Transfer/sec:     365.09B

With 2 threads and 8 connections:

# 4 ray workers
➜  articlequality wrk -c 8 -t 2 --timeout 5s -s inference.lua http://localhost:8080/v1/models/enwiki-articlequality:predict --latency
Running 10s test @ http://localhost:8080/v1/models/enwiki-articlequality:predict
  2 threads and 8 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.53s   722.12ms   3.89s    60.71%
    Req/Sec     3.42      4.65    19.00     79.17%
  Latency Distribution
     50%    2.67s 
     75%    2.91s 
     90%    3.64s 
     99%    3.89s 
  28 requests in 10.04s, 10.09KB read
Requests/sec:      2.79
Transfer/sec:      1.00KB

# single process
➜  articlequality wrk -c 8 -t 2 --timeout 5s -s inference.lua http://localhost:8080/v1/models/enwiki-articlequality:predict --latency
Running 10s test @ http://localhost:8080/v1/models/enwiki-articlequality:predict
  2 threads and 8 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.83s     1.45s    4.67s    60.00%
    Req/Sec     0.40      0.52     1.00     60.00%
  Latency Distribution
     50%    2.84s 
     75%    3.75s 
     90%    4.67s 
     99%    4.67s 
  10 requests in 10.07s, 3.60KB read
  Socket errors: connect 0, read 0, write 0, timeout 5
Requests/sec:      0.99
Transfer/sec:     366.47B

My local env has limited resources, so the latency gets 1s with only 1 connection:

➜  articlequality wrk -c 1 -t 1 --timeout 2s -s inference.lua http://localhost:8080/v1/models/enwiki-articlequality:predict --latency
Running 10s test @ http://localhost:8080/v1/models/enwiki-articlequality:predict
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.02s   153.38ms   1.34s    88.89%
    Req/Sec     0.56      0.53     1.00     55.56%
  Latency Distribution
     50%  923.94ms
     75%    1.08s 
     90%    1.34s 
     99%    1.34s 
  9 requests in 10.10s, 3.24KB read
Requests/sec:      0.89
Transfer/sec:     328.88B

Thinking the next step is to test on the ml-sandbox.

My only doubt is if our preprocess can become a co-routine, since we use a dedicated library to call the mw api. For example, in the predict code it seems that kserve uses async and the await keywords, with a dedicated http_client. What do you think?

@elukey Yeah I looked the code. I think things become tricky when we use mwapi and revscoring extractor. But I managed to use ray workers and changed the prediction to a co-routine. (see comment above :))

I read the kserve docs: https://kserve.github.io/website/master/modelserving/v1beta1/custom/custom_model/#parallel-inference

There are two ways to run parallel inference:

  • tune the workers parameter for the Tornado's httpserver
  • use RayServe to deploy ray workers

The first option is not working for our current kserve version 0.7.0 because there is a bug. Fixed in kserve 0.8.0.

The workers parameter should be related to the number of model server python processes to fork, 1 should be enough for our use case so we shouldn't be affected by the bug. There are also max_asyncio_workers that should be related to the max number of co-routine/async to handle, see the Arguments section. IIUC if we increase workers we just get more model server processes that can run predict in parallel, meanwhile we are trying to make preprocess async in this task. We can try a combination of the two of course, no problem :)

If we upgrade to kserve 0.8.0, we will have dependency conflicts on numpy:

#18 208.8     kserve 0.8.0 depends on numpy~=1.19.2
#18 208.8     revscoring 2.8.2 depends on numpy<1.18.999 and >=1.18.4

Not sure if we can upgrade numpy for rescoring.

It should be possible if we upgrade revscoring on the Docker images to the last version, but we may also need an extra patch/release of revscoring too.

Going to review the wrk numbers in a bit! Thanks for the work!

@achou I am curious about the running processes inside the pod after we use kserve - what do you see if you run ps -aux | grep python and ps -eLf | grep python? My understanding is that every Ray worker should be a python process, in this case it would be very interesting. We currently have some restrictions for memory/cpu of every pod in production, so we probably have to tune settings for this use case. For example, with 2 ray workers, I'd expect to see:

  1. one python process for the Tornado async loop, that handles http requests toward the pod for kserve
  2. two python processes for Ray workers, every one of them running a model (and possibly co-routines).

The main issue that remains in my opinion is that the preprocess call is blocking, and it is probably also executed on Ray workers, so it limits our scalability. For example, say that a Ray worker is handling a call to the MW API, it will not be able to handle anything else (like a predict running model's code etc..). We may need to skip mwapi at this point for something custom and async, let me know your thoughts.

We paused this task for a long time due to T320374, and we have opened new subtasks to track the work. Closing this one in favor of more specific ones (like T323624)