Steps to replicate the issue (include links if applicable):
While upgrading model servers to kserve 0.11.1 during load testing we noticed that some model servers demonstrate increased latencies. These model servers include gradient boosting models (GBM) xgboost (revertrisk-language-agnostic) and catboost(revertrisk-multilingual and readability).
The issue that arises is that too many threads are created (100+) causing model inference to throttle with increased load. This is probably caused because of cgroups v2 is not supported by the underlying libraries.
At the moment we have limited the number of threads by setting OMP_NUM_THREADS and
OMP_THREAD_LIMIT environment variables on the pods but we still see more threads created. Even if we succeed to set just 1 thread this is a temporary solution as it has a toll on model performance.
The issue in xgboost seems to be resolved and support is available in release 2.0.1.
For catboost we have opened and issue on GH but on our side we can start by specifying the number of threads after we load the model if possible.