We're going to upgrade the Revert Risk Multilingual model server docker images to KServe 0.11
Description
Details
Title | Reference | Author | Source Branch | Dest Branch | |
---|---|---|---|---|---|
feat: allow to set number of threads in catboost models | repos/research/knowledge_integrity!31 | isaranto | add-threads-arg | main |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | elukey | T337213 Update to KServe 0.11 | |||
Resolved | isarantopoulos | T347551 Upgrade Revert Risk Multilingual docker images to KServe 0.11.2 | |||
Open | BUG REPORT | None | T349844 Increased latencies with Kserve 0.11.1 (cgroups v2) | ||
Resolved | MunizaA | T350389 Upgrade xgboost in knowledge_integrity | |||
Open | isarantopoulos | T353461 Allow to set Catboost's threads in readability-liftwing |
Event Timeline
Change 964559 had a related patch set uploaded (by Elukey; author: Elukey):
[machinelearning/liftwing/inference-services@main] revert-risk: upgrade to KServe 0.11.1
Change 964559 merged by jenkins-bot:
[machinelearning/liftwing/inference-services@main] revert-risk: upgrade to KServe 0.11.1
Change 965057 had a related patch set uploaded (by AikoChou; author: AikoChou):
[operations/deployment-charts@master] ml-services: upgrade kserve to 0.11.1 for revertrisk
Change 965057 merged by jenkins-bot:
[operations/deployment-charts@master] ml-services: upgrade kserve to 0.11.1 for revertrisk
Change 967442 had a related patch set uploaded (by Elukey; author: Elukey):
[operations/deployment-charts@master] ml-services: set OMP_NUM_THREADS in all revertrisk isvc
Change 967442 merged by Elukey:
[operations/deployment-charts@master] ml-services: set OMP_NUM_THREADS in all revertrisk isvc
Ran some load tests
isaranto@deploy2002:~/pycharm/test/wrk$ wrk -c 1 -t 1 --timeout 5s -s revertrisk.lua https://inference-staging.svc.codfw.wmnet:30443/v1/models/revertrisk-multilingual:predict -d 60 --header "Host: revertrisk-multilingual.revertrisk.wikimedia.org" --latency -- revertrisk.input thread 1 created logfile wrk_1.log created Running 1m test @ https://inference-staging.svc.codfw.wmnet:30443/v1/models/revertrisk-multilingual:predict 1 threads and 1 connections Thread Stats Avg Stdev Max +/- Stdev Latency 1.35s 1.20s 4.56s 81.82% Req/Sec 1.25 1.46 5.00 81.25% Latency Distribution 50% 765.36ms 75% 2.26s 90% 2.88s 99% 4.56s 32 requests in 1.00m, 11.64KB read Socket errors: connect 0, read 0, write 0, timeout 1 Non-2xx or 3xx responses: 1 Requests/sec: 0.53 Transfer/sec: 198.40B thread 1 made 34 requests and got 32 responses wrk -c 2 -t 2 --timeout 5s -s revertrisk.lua https://inference-staging.svc.codfw.wmnet:30443/v1/models/revertrisk-multilingual:predict -d 60 --header "Host: revertrisk-multilingual.revertrisk.wikimedia.org" --latency -- revertrisk.input thread 1 created logfile wrk_1.log created thread 2 created logfile wrk_2.log created Running 1m test @ https://inference-staging.svc.codfw.wmnet:30443/v1/models/revertrisk-multilingual:predict 2 threads and 2 connections Thread Stats Avg Stdev Max +/- Stdev Latency 1.52s 888.01ms 3.78s 65.91% Req/Sec 0.32 0.68 2.00 88.00% Latency Distribution 50% 1.36s 75% 2.06s 90% 2.76s 99% 3.78s 50 requests in 1.00m, 18.22KB read Socket errors: connect 0, read 0, write 0, timeout 6 Non-2xx or 3xx responses: 2 Requests/sec: 0.83 Transfer/sec: 310.39B thread 1 made 27 requests and got 25 responses thread 2 made 26 requests and got 25 responses wrk -c 4 -t 4 --timeout 5s -s revertrisk.lua https://inference-staging.svc.codfw.wmnet:30443/v1/models/revertrisk-multilingual:predict -d 60 --header "Host: revertrisk-multilingual.revertrisk.wikimedia.org" --latency -- revertrisk.input thread 1 created logfile wrk_1.log created thread 2 created logfile wrk_2.log created thread 3 created logfile wrk_3.log created thread 4 created logfile wrk_4.log created Running 1m test @ https://inference-staging.svc.codfw.wmnet:30443/v1/models/revertrisk-multilingual:predict 4 threads and 4 connections Thread Stats Avg Stdev Max +/- Stdev Latency 2.07s 984.02ms 4.39s 63.41% Req/Sec 0.27 1.42 10.00 98.04% Latency Distribution 50% 2.05s 75% 2.86s 90% 3.16s 99% 4.39s 51 requests in 1.00m, 18.65KB read Socket errors: connect 0, read 0, write 0, timeout 10 Non-2xx or 3xx responses: 4 Requests/sec: 0.85 Transfer/sec: 317.72B thread 1 made 15 requests and got 13 responses thread 2 made 14 requests and got 13 responses thread 3 made 14 requests and got 13 responses thread 4 made 13 requests and got 12 responses
Results are comparable to the old load-tests done by Aiko.
Actually there are differences compared to old load tests even when I run the tests for 10s like the ones @achou ran in the link above.
wrk -c 1 -t 1 --timeout 5s -s revertrisk.lua https://inference-staging.svc.codfw.wmnet:30443/v1/models/revertrisk-multilingual:predict --header "Host: revertrisk-multilingual.revertrisk.wikimedia.org" --latency -- revertrisk.input thread 1 created logfile wrk_1.log created Running 10s test @ https://inference-staging.svc.codfw.wmnet:30443/v1/models/revertrisk-multilingual:predict 1 threads and 1 connections Thread Stats Avg Stdev Max +/- Stdev Latency 486.39ms 409.13ms 1.25s 70.00% Req/Sec 2.22 1.56 5.00 66.67% Latency Distribution 50% 359.72ms 75% 795.97ms 90% 1.25s 99% 1.25s 10 requests in 10.01s, 3.65KB read Non-2xx or 3xx responses: 1 Requests/sec: 1.00 Transfer/sec: 373.66B thread 1 made 12 requests and got 10 responses
wrk -c 2 -t 2 --timeout 5s -s revertrisk.lua https://inference-staging.svc.codfw.wmnet:30443/v1/models/revertrisk-multilingual:predict --header "Host: revertrisk-multilingual.revertrisk.wikimedia.org" --latency -- revertrisk.input thread 1 created logfile wrk_1.log created thread 2 created logfile wrk_2.log created Running 10s test @ https://inference-staging.svc.codfw.wmnet:30443/v1/models/revertrisk-multilingual:predict 2 threads and 2 connections Thread Stats Avg Stdev Max +/- Stdev Latency 1.07s 717.63ms 2.79s 72.22% Req/Sec 1.65 3.22 10.00 88.24% Latency Distribution 50% 1.11s 75% 1.56s 90% 1.97s 99% 2.79s 17 requests in 10.02s, 6.22KB read Non-2xx or 3xx responses: 2 Requests/sec: 1.70 Transfer/sec: 636.13B thread 1 made 10 requests and got 8 responses thread 2 made 10 requests and got 9 responses
I'm investigating if it is due to different input or something else
Change 968715 had a related patch set uploaded (by AikoChou; author: AikoChou):
[operations/deployment-charts@master] ml-services: set OMP_NUM_THREADS and OMP_THREAD_LIMIT in rr-multilingual
Change 968715 merged by jenkins-bot:
[operations/deployment-charts@master] ml-services: set OMP_NUM_THREADS and OMP_THREAD_LIMIT in rr-multilingual
Here are some load test results after setting OMP_NUM_THREADS and OMP_THREAD_LIMIT env vars. We found out that reverrisk-multilingual also uses catboost, which for some reasons created a large number of additional threads when using Kserve 0.11.
- run 10s
aikochou@deploy2002:~/wrk/rr$ wrk -c 1 -t 1 --timeout 5s -s script.lua https://inference-staging.svc.codfw.wmnet:30443/v1/models/revertrisk-multilingual:predict --header "Host: revertrisk-multilingual.revertrisk.wikimedia.org" --latency -- input thread 1 created logfile wrk_1.log created Running 10s test @ https://inference-staging.svc.codfw.wmnet:30443/v1/models/revertrisk-multilingual:predict 1 threads and 1 connections Thread Stats Avg Stdev Max +/- Stdev Latency 345.43ms 164.91ms 698.15ms 83.33% Req/Sec 2.69 1.65 5.00 53.85% Latency Distribution 50% 374.78ms 75% 484.13ms 90% 499.09ms 99% 698.15ms 13 requests in 10.02s, 4.71KB read Socket errors: connect 0, read 0, write 0, timeout 1 Requests/sec: 1.30 Transfer/sec: 481.80B thread 1 made 15 requests and got 13 responses
aikochou@deploy2002:~/wrk/rr$ wrk -c 2 -t 2 --timeout 5s -s script.lua https://inference-staging.svc.codfw.wmnet:30443/v1/models/revertrisk-multilingual:predict --header "Host: revertrisk-multilingual.revertrisk.wikimedia.org" --latency -- input thread 1 created logfile wrk_1.log created thread 2 created logfile wrk_2.log created Running 10s test @ https://inference-staging.svc.codfw.wmnet:30443/v1/models/revertrisk-multilingual:predict 2 threads and 2 connections Thread Stats Avg Stdev Max +/- Stdev Latency 508.37ms 164.28ms 791.71ms 60.00% Req/Sec 1.41 0.94 3.00 70.59% Latency Distribution 50% 507.46ms 75% 654.71ms 90% 709.17ms 99% 791.71ms 17 requests in 10.02s, 6.16KB read Socket errors: connect 0, read 0, write 0, timeout 2 Requests/sec: 1.70 Transfer/sec: 629.79B thread 1 made 11 requests and got 9 responses thread 2 made 9 requests and got 8 responses
aikochou@deploy2002:~/wrk/rr$ wrk -c 4 -t 4 --timeout 5s -s script.lua https://inference-staging.svc.codfw.wmnet:30443/v1/models/revertrisk-multilingual:predict --header "Host: revertrisk-multilingual.revertrisk.wikimedia.org" --latency -- input thread 1 created logfile wrk_1.log created thread 2 created logfile wrk_2.log created thread 3 created logfile wrk_3.log created thread 4 created logfile wrk_4.log created Running 10s test @ https://inference-staging.svc.codfw.wmnet:30443/v1/models/revertrisk-multilingual:predict 4 threads and 4 connections Thread Stats Avg Stdev Max +/- Stdev Latency 884.78ms 288.42ms 1.65s 71.43% Req/Sec 0.75 0.65 2.00 53.57% Latency Distribution 50% 872.10ms 75% 1.02s 90% 1.29s 99% 1.65s 28 requests in 10.02s, 10.15KB read Requests/sec: 2.79 Transfer/sec: 1.01KB thread 1 made 9 requests and got 7 responses thread 2 made 8 requests and got 7 responses thread 3 made 8 requests and got 7 responses thread 4 made 8 requests and got 7 responses
- run 1 min
aikochou@deploy2002:~/wrk/rr$ wrk -c 1 -t 1 -d 1m --timeout 5s -s script.lua https://inference-staging.svc.codfw.wmnet:30443/v1/models/revertrisk-multilingual:predict --header "Host: revertrisk-multilingual.revertrisk.wikimedia.org" --latency -- input thread 1 created logfile wrk_1.log created Running 1m test @ https://inference-staging.svc.codfw.wmnet:30443/v1/models/revertrisk-multilingual:predict 1 threads and 1 connections Thread Stats Avg Stdev Max +/- Stdev Latency 807.87ms 812.49ms 4.27s 86.17% Req/Sec 2.07 1.56 5.00 69.51% Latency Distribution 50% 503.23ms 75% 940.67ms 90% 2.07s 99% 4.27s 82 requests in 1.00m, 29.73KB read Socket errors: connect 0, read 0, write 0, timeout 1 Requests/sec: 1.36 Transfer/sec: 506.64B thread 1 made 84 requests and got 82 responses
Both avg latency and 99% latency showed better performance than the previous results done by @isarantopoulos.
Comparing with the old load-tests, the first run (1 thread, 1 connection) and second run (2 threads, 2 connections) performed slightly worse with more timeouts and fewer total requests. However, the third run (4 threads, 4 connections) showed comparable performance.
Change 968998 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):
[operations/deployment-charts@master] ml-services: remove old rr multilingual form staging
Change 968998 merged by jenkins-bot:
[operations/deployment-charts@master] ml-services: remove unused deployments from staging
Change 975008 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):
[machinelearning/liftwing/inference-services@main] revert kserve upgrades
Change 975008 merged by jenkins-bot:
[machinelearning/liftwing/inference-services@main] revert kserve upgrades
Change 975205 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):
[operations/deployment-charts@master] ml-services: rollback xgboost/catboost models to kserve 0.10
We can proceed with this after https://github.com/catboost/catboost/pull/2519 has been included in a new catboost release (support for CgroupsV2)
Change 976748 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):
[operations/deployment-charts@master] ml-services: update docker images to latest versions
Change 975205 abandoned by Ilias Sarantopoulos:
[operations/deployment-charts@master] ml-services: rollback xgboost/catboost models to kserve 0.10
Reason:
abandoned in favor of https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/976748 which has the latest updates
Change 976748 merged by jenkins-bot:
[operations/deployment-charts@master] ml-services: update docker images to latest versions
Deployed the model server so that it has the latest image. It is still running kserve 0.10. Status is the same and we are waiting for a new catboost release.
isaranto opened https://gitlab.wikimedia.org/repos/research/knowledge_integrity/-/merge_requests/31
feat: allow to set number of threads in catboost models
mnz merged https://gitlab.wikimedia.org/repos/research/knowledge_integrity/-/merge_requests/31
feat: allow to set number of threads in catboost models
Change 998946 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):
[machinelearning/liftwing/inference-services@main] rrml: upgrade kserve to 0.11.2
Since the latest catboost release is still pending we discussed in proceeding without it for now by manually limiting the number of threads. I also asked on GH if there is any release to be out soon.
Change 998946 merged by jenkins-bot:
[machinelearning/liftwing/inference-services@main] rrml: upgrade kserve to 0.11.2
Change 999570 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):
[operations/deployment-charts@master] ml-services: update multilingual revertrisk image
Change 999570 merged by jenkins-bot:
[operations/deployment-charts@master] ml-services: update multilingual revertrisk image
Some load test results with wrk on staging: I ran the same ones we previously did.
- Run for 10s
isaranto@deploy2002:~/inference-services/test/wrk/revertrisk$ wrk -c 1 -t 1 --timeout 5s -s revertrisk.lua https://inference-staging.svc.codfw.wmnet:30443/v1/models/revertrisk-multilingual:predict --header "Host: revertrisk-multilingual.revertrisk.wikimedia.org" --latency -- revertrisk.input thread 1 created logfile wrk_1.log created Running 10s test @ https://inference-staging.svc.codfw.wmnet:30443/v1/models/revertrisk-multilingual:predict 1 threads and 1 connections Thread Stats Avg Stdev Max +/- Stdev Latency 399.34ms 215.46ms 802.59ms 70.00% Req/Sec 2.55 2.70 10.00 90.91% Latency Distribution 50% 315.68ms 75% 534.73ms 90% 802.59ms 99% 802.59ms 11 requests in 10.02s, 4.01KB read Socket errors: connect 0, read 0, write 0, timeout 1 Non-2xx or 3xx responses: 1 Requests/sec: 1.10 Transfer/sec: 410.38B thread 1 made 13 requests and got 11 responses isaranto@deploy2002:~/inference-services/test/wrk/revertrisk$ wrk -c 2 -t 2 --timeout 5s -s revertrisk.lua https://inference-staging.svc.codfw.wmnet:30443/v1/models/revertrisk-multilingual:predict --header "Host: revertrisk-multilingual.revertrisk.wikimedia.org" --latency -- revertrisk.input thread 1 created logfile wrk_1.log created thread 2 created logfile wrk_2.log created Running 10s test @ https://inference-staging.svc.codfw.wmnet:30443/v1/models/revertrisk-multilingual:predict 2 threads and 2 connections Thread Stats Avg Stdev Max +/- Stdev Latency 509.16ms 244.04ms 1.01s 80.00% Req/Sec 2.15 2.16 10.00 90.00% Latency Distribution 50% 587.16ms 75% 667.84ms 90% 763.25ms 99% 1.01s 20 requests in 10.02s, 7.30KB read Non-2xx or 3xx responses: 2 Requests/sec: 2.00 Transfer/sec: 746.58B thread 1 made 12 requests and got 10 responses thread 2 made 11 requests and got 10 responses isaranto@deploy2002:~/inference-services/test/wrk/revertrisk$ wrk -c 4 -t 4 --timeout 5s -s revertrisk.lua https://inference-staging.svc.codfw.wmnet:30443/v1/models/revertrisk-multilingual:predict --header "Host: revertrisk-multilingual.revertrisk.wikimedia.org" --latency -- revertrisk.input thread 1 created logfile wrk_1.log created thread 2 created logfile wrk_2.log created thread 3 created logfile wrk_3.log created thread 4 created logfile wrk_4.log created Running 10s test @ https://inference-staging.svc.codfw.wmnet:30443/v1/models/revertrisk-multilingual:predict 4 threads and 4 connections Thread Stats Avg Stdev Max +/- Stdev Latency 884.85ms 392.62ms 1.71s 75.00% Req/Sec 1.08 1.94 10.00 90.00% Latency Distribution 50% 938.09ms 75% 1.18s 90% 1.42s 99% 1.71s 40 requests in 10.02s, 14.58KB read Non-2xx or 3xx responses: 5 Requests/sec: 3.99 Transfer/sec: 1.46KB thread 1 made 12 requests and got 10 responses thread 2 made 11 requests and got 10 responses thread 3 made 10 requests and got 9 responses thread 4 made 12 requests and got 11 responses
- Run for 1m
isaranto@deploy2002:~/inference-services/test/wrk/revertrisk$ wrk -c 1 -t 1 --timeout 5s -s revertrisk.lua https://inference-staging.svc.codfw.wmnet:30443/v1/models/revertrisk-multilingual:predict --header "Host: revertrisk-multilingual.revertrisk.wikimedia.org" --latency -d 1m -- revertrisk.input thread 1 created logfile wrk_1.log created Running 1m test @ https://inference-staging.svc.codfw.wmnet:30443/v1/models/revertrisk-multilingual:predict 1 threads and 1 connections Thread Stats Avg Stdev Max +/- Stdev Latency 783.85ms 820.04ms 4.55s 88.64% Req/Sec 2.08 1.71 10.00 73.42% Latency Distribution 50% 492.36ms 75% 941.72ms 90% 1.87s 99% 4.55s 79 requests in 1.00m, 28.67KB read Socket errors: connect 0, read 0, write 0, timeout 1 Non-2xx or 3xx responses: 1 Requests/sec: 1.31 Transfer/sec: 488.45B thread 1 made 81 requests and got 79 responses
There aren't significant differences with the old tests results and I consider it is safe to proceed with the kserve upgrade.