Upgrade kserve on the following model servers from 0.11.1 to 0.11.2
- revscoring
- langid
- llm
- revertrisk-language-agnostic
- revertrisk-wikidata
Upgrade kserve on the following model servers from 0.11.1 to 0.11.2
Change 975814 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):
[machinelearning/liftwing/inference-services@main] Upgrade model servers to kserve 0.11.2
Just wanted to provide a reference for the revertrisk-wikidata model, which is currently under evaluation and improvement. T343419
Change 975814 merged by jenkins-bot:
[machinelearning/liftwing/inference-services@main] Upgrade model servers to kserve 0.11.2
Change 976748 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):
[operations/deployment-charts@master] ml-services: update docker images to latest versions
Change 976748 merged by jenkins-bot:
[operations/deployment-charts@master] ml-services: update docker images to latest versions
All revscoring models have been deployed on ml-staging.
While running some load tests on ml-staging for enwiki-goodfaith I noticed some increased latencies compared to older load tests. This happens when number of connections is >1.
Will investigate further to check if the issue is with the inputs by running the same load test on production, but until then I'm holding the production deployment.
Load tests with 1 thread and 1 connection had same results. The example below is a load test with increased latencies:
wrk -c 4 -t 2 --timeout 2s -s revscoring.lua https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-goodfaith:predict --latency --header "Host: enwiki-goodfaith.revscoring-editquality-goodfaith.wikimedia.org" -d 60 -- enwiki.input thread 1 created logfile wrk_1.log created thread 2 created logfile wrk_2.log created Running 1m test @ https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-goodfaith:predict 2 threads and 4 connections Thread Stats Avg Stdev Max +/- Stdev Latency 729.71ms 330.75ms 1.91s 68.14% Req/Sec 3.46 3.34 10.00 64.79% Latency Distribution 50% 664.28ms 75% 913.57ms 90% 1.22s 99% 1.55s 145 requests in 1.00m, 50.98KB read Socket errors: connect 0, read 0, write 0, timeout 32 Requests/sec: 2.41 Transfer/sec: 868.76B
And an old one that was significantly better
wrk -c 4 -t 2 --timeout 2s -s revscoring.lua https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-goodfaith:predict --latency -d 60 --header enwiki-goodfaith.revscoring-editquality-goodfaith.wikimedia.org -- enwiki.input Running 1m test @ https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-goodfaith:predict 2 threads and 4 connections Thread Stats Avg Stdev Max +/- Stdev Latency 433.57ms 77.57ms 797.58ms 71.92% Req/Sec 5.44 2.80 10.00 67.64% Latency Distribution 50% 410.98ms 75% 474.18ms 90% 546.55ms 99% 697.93ms 552 requests in 1.00m, 207.54KB read Requests/sec: 9.19 Transfer/sec: 3.45KB
Running the same test on production gave the following results, so it doesnt seem to be such a big difference (although there is some)
wrk -c 4 -t 2 --timeout 2s -s revscoring.lua https://inference.svc.codfw.wmnet:30443/v1/models/enwiki-goodfaith:predict --latency --header "Host: enwiki-goodfaith.revscoring-editquality-goodfaith.wikimedia.org" -d 60 -- enwiki.input thread 1 created logfile wrk_1.log created thread 2 created logfile wrk_2.log created Running 1m test @ https://inference.svc.codfw.wmnet:30443/v1/models/enwiki-goodfaith:predict 2 threads and 4 connections Thread Stats Avg Stdev Max +/- Stdev Latency 524.99ms 486.69ms 2.00s 81.21% Req/Sec 3.97 3.49 10.00 59.57% Latency Distribution 50% 294.92ms 75% 730.76ms 90% 1.27s 99% 1.96s 191 requests in 1.00m, 67.15KB read Socket errors: connect 0, read 0, write 0, timeout 42 Requests/sec: 3.18 Transfer/sec: 1.12KB thread 1 made 95 requests and got 92 responses thread 2 made 101 requests and got 99 responses
I see a drop in performance (not as big as the one we experienced with other servers). This is for articlequality. On top we have the new version in staging and in the bottom the current production one
wrk -c 1 -t 1 --timeout 50s -s revscoring.lua https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-articlequality:predict --latency -d 60 -- articlequality.input thread 1 created logfile wrk_1.log created Running 1m test @ https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-articlequality:predict 1 threads and 1 connections Thread Stats Avg Stdev Max +/- Stdev Latency 1.40s 40.59ms 1.51s 76.19% Req/Sec 0.00 0.00 0.00 100.00% Latency Distribution 50% 1.39s 75% 1.41s 90% 1.47s 99% 1.51s 42 requests in 1.00m, 19.38KB read Requests/sec: 0.70 Transfer/sec: 330.22B thread 1 made 44 requests and got 42 responses
wrk -c 1 -t 1 --timeout 50s -s revscoring.lua https://inference.svc.codfw.wmnet:30443/v1/models/enwiki-articlequality:predict --latency -d 60 -- articlequality.input thread 1 created logfile wrk_1.log created Running 1m test @ https://inference.svc.codfw.wmnet:30443/v1/models/enwiki-articlequality:predict 1 threads and 1 connections Thread Stats Avg Stdev Max +/- Stdev Latency 1.09s 89.55ms 1.30s 69.81% Req/Sec 0.08 0.27 1.00 92.45% Latency Distribution 50% 1.09s 75% 1.17s 90% 1.19s 99% 1.30s 53 requests in 1.00m, 24.45KB read Requests/sec: 0.88 Transfer/sec: 416.61B thread 1 made 55 requests and got 53 responses
There is a difference in the deployments as production is using multiprocessing but the results are also not inline with previous[[ https://phabricator.wikimedia.org/T348265#9249921 | load tests ran for articlequality ]] (same input)
Results for drafttopic kserve 0.11.2
wrk -c 1 -t 1 --timeout 50s -s revscoring.lua https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-drafttopic:predict --header "Host: enwiki-drafttopic.revscoring-drafttopic.wikimedia.org" --latency -d 60 -- enwiki.input thread 1 created logfile wrk_1.log created Running 1m test @ https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-drafttopic:predict 1 threads and 1 connections Thread Stats Avg Stdev Max +/- Stdev Latency 201.34ms 197.60ms 996.26ms 87.99% Req/Sec 7.03 3.13 10.00 86.13% Latency Distribution 50% 120.68ms 75% 195.33ms 90% 410.24ms 99% 979.44ms 310 requests in 1.00m, 1.10MB read Requests/sec: 5.16 Transfer/sec: 18.68KB thread 1 made 312 requests and got 310 responses
kserve 0.11.1 on production.
wrk -c 1 -t 1 --timeout 50s -s revscoring.lua https://inference.svc.codfw.wmnet:30443/v1/models/enwiki-drafttopic:predict --header "Host: enwiki-drafttopic.revscoring-drafttopic.wikimedia.org" --latency -d 60 -- enwiki.input thread 1 created logfile wrk_1.log created Running 1m test @ https://inference.svc.codfw.wmnet:30443/v1/models/enwiki-drafttopic:predict 1 threads and 1 connections Thread Stats Avg Stdev Max +/- Stdev Latency 244.10ms 240.95ms 1.20s 86.51% Req/Sec 6.53 3.31 10.00 35.50% Latency Distribution 50% 133.80ms 75% 287.33ms 90% 592.72ms 99% 1.06s 262 requests in 1.00m, 0.93MB read Requests/sec: 4.36 Transfer/sec: 15.79KB thread 1 made 264 requests and got 262 responses
Results are similar (new version slightly better) so we can proceed upgrading this one.
Change 977605 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):
[operations/deployment-charts@master] ml-services: update articlequality and articletopic to kserve 0.11.2
Change 977605 merged by jenkins-bot:
[operations/deployment-charts@master] ml-services: update articlequality and articletopic to kserve 0.11.2
I ran load testing for all revscoring model servers comparing staging (version 0.11.2) with production (0.11.1).
All servers brought similar results with exception articlequality which were worse as recorded in a previous comment. My assumption was that this is because we have enabled multiprocessing on that server and this assumption was validated as load testing results after deploying kserve 0.11.2 to prod are the same.
There is something weird though going on with damaging and goodfaith as they seem to be performing worse in production in comparison to staging and this could be related to the alerts we are getting every once in a while https://phabricator.wikimedia.org/T351735
wrk -c 1 -t 1 --timeout 50s -s revscoring.lua https://inference.svc.codfw.wmnet:30443/v1/models/enwiki-damaging:predict --header "Host: enwiki-damaging.revscoring-editquality-damaging.wikimedia.org" --latency -d 60 -- enwiki.input thread 1 created logfile wrk_1.log created Running 1m test @ https://inference.svc.codfw.wmnet:30443/v1/models/enwiki-damaging:predict 1 threads and 1 connections Thread Stats Avg Stdev Max +/- Stdev Latency 2.53s 3.21s 12.67s 82.09% Req/Sec 2.82 2.43 10.00 72.73% Latency Distribution 50% 768.89ms 75% 3.79s 90% 7.88s 99% 12.67s 44 requests in 1.00m, 15.41KB read Requests/sec: 0.73 Transfer/sec: 263.03B thread 1 made 46 requests and got 44 responses isaranto@deploy2002:~/load_testing$ wrk -c 1 -t 1 --timeout 50s -s revscoring.lua https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-damaging:predict --header "Host: enwiki-damaging.revscoring-editquality-damaging.wikimedia.org" --latency -d 60 -- enwiki.input thread 1 created logfile wrk_1.log created Running 1m test @ https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-damaging:predict 1 threads and 1 connections Thread Stats Avg Stdev Max +/- Stdev Latency 884.27ms 1.05s 3.99s 82.99% Req/Sec 3.47 2.29 10.00 77.57% Latency Distribution 50% 317.54ms 75% 1.19s 90% 2.84s 99% 3.98s 107 requests in 1.00m, 37.46KB read Requests/sec: 1.78 Transfer/sec: 639.24B thread 1 made 109 requests and got 107 responses ---------------------------------------------------------------------------------------------------------------------- wrk -c 1 -t 1 --timeout 50s -s revscoring.lua https://inference.svc.codfw.wmnet:30443/v1/models/enwiki-goodfaith:predict --header "Host: enwiki-goodfaith.revscoring-editquality-goodfaith.wikimedia.org" --latency -d 60 -- enwiki.input thread 1 created logfile wrk_1.log created Running 1m test @ https://inference.svc.codfw.wmnet:30443/v1/models/enwiki-goodfaith:predict 1 threads and 1 connections Thread Stats Avg Stdev Max +/- Stdev Latency 1.12s 1.33s 6.58s 84.82% Req/Sec 3.01 2.40 10.00 77.78% Latency Distribution 50% 427.58ms 75% 1.60s 90% 3.05s 99% 5.84s 81 requests in 1.00m, 28.46KB read Requests/sec: 1.35 Transfer/sec: 485.71B thread 1 made 83 requests and got 81 responses isaranto@deploy2002:~/load_testing$ wrk -c 1 -t 1 --timeout 50s -s revscoring.lua https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-goodfaith:predict --header "Host: enwiki-goodfaith.revscoring-editquality-goodfaith.wikimedia.org" --latency -d 60 -- enwiki.input thread 1 created logfile wrk_1.log created Running 1m test @ https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-goodfaith:predict 1 threads and 1 connections Thread Stats Avg Stdev Max +/- Stdev Latency 886.40ms 1.06s 4.04s 82.99% Req/Sec 3.56 2.22 10.00 79.44% Latency Distribution 50% 323.90ms 75% 1.22s 90% 2.85s 99% 4.00s 107 requests in 1.00m, 37.59KB read Requests/sec: 1.78 Transfer/sec: 640.51B thread 1 made 109 requests and got 107 responses
The above makes sense (answering to myself 😛 ) as codfw in production is not idle but constantly gets traffic from enwiki.
Running a load test on eqiad verified this:
wrk -c 1 -t 1 --timeout 50s -s revscoring.lua https://inference.svc.eqiad.wmnet:30443/v1/models/enwiki-damaging:predict --header "Host: enwiki-damaging.revscoring-editquality-damaging.wikimedia.org" --latency -d 60 -- enwiki.input thread 1 created logfile wrk_1.log created Running 1m test @ https://inference.svc.eqiad.wmnet:30443/v1/models/enwiki-damaging:predict 1 threads and 1 connections Thread Stats Avg Stdev Max +/- Stdev Latency 922.10ms 1.07s 4.44s 82.44% Req/Sec 2.73 1.71 10.00 52.53% Latency Distribution 50% 379.49ms 75% 1.09s 90% 2.83s 99% 4.13s 99 requests in 1.00m, 34.66KB read Requests/sec: 1.65 Transfer/sec: 590.62B thread 1 made 101 requests and got 99 responses
The following model servers have been upgraded to kserve 0.11.2
Also: