Page MenuHomePhabricator

Increased latencies with Kserve 0.11.1 (cgroups v2)
Open, MediumPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):
While upgrading model servers to kserve 0.11.1 during load testing we noticed that some model servers demonstrate increased latencies. These model servers include gradient boosting models (GBM) xgboost (revertrisk-language-agnostic) and catboost(revertrisk-multilingual and readability).
The issue that arises is that too many threads are created (100+) causing model inference to throttle with increased load. This is probably caused because of cgroups v2 is not supported by the underlying libraries.

At the moment we have limited the number of threads by setting OMP_NUM_THREADS and
OMP_THREAD_LIMIT environment variables on the pods but we still see more threads created. Even if we succeed to set just 1 thread this is a temporary solution as it has a toll on model performance.

The issue in xgboost seems to be resolved and support is available in release 2.0.1.
For catboost we have opened and issue on GH but on our side we can start by specifying the number of threads after we load the model if possible.

Event Timeline

isarantopoulos renamed this task from Increased latencies with Kserve 0.11.1 to Increased latencies with Kserve 0.11.1 (cgroups v2).Oct 26 2023, 4:31 PM

I started doing some prep-work on knowledge_integrity to update xgboost
https://gitlab.wikimedia.org/repos/research/knowledge_integrity/-/tree/update-xgboost

One thing that needs to happen is update base python that is being used in the project and in CI (from minimum of 3.7 to 3.8) as xgboost 2.0.1 releases have 3.8 as minimum.

Next steps:

  • Work with Research on T350389 to move KI to xgboost 2.0.1
  • Keep working with Yandex upstream on https://github.com/catboost/catboost/issues/2518, and hopefully get a new release of Catboost. It is likely going to take some time.
  • While waiting for a permanent fix in xgboost, we should probably use the thread_count parameter where catboost is used to tune the max amount of threads to use.

Change 975274 had a related patch set uploaded (by AikoChou; author: AikoChou):

[machinelearning/liftwing/inference-services@main] revert-risk: update knowledge integrity to 0.5.0

https://gerrit.wikimedia.org/r/975274

Change 975274 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] revert-risk: upgrade Kserve 0.11.1 and knowledge integrity 0.5.0

https://gerrit.wikimedia.org/r/975274

Change 975304 had a related patch set uploaded (by AikoChou; author: AikoChou):

[operations/deployment-charts@master] ml-services: update revertrisk-la image and model binary

https://gerrit.wikimedia.org/r/975304

Change 975304 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update revertrisk-la image and model binary

https://gerrit.wikimedia.org/r/975304

Opened T353461 to track the efforts to fix catboost in Readability.