Page MenuHomePhabricator

Set automatically libomp's num threads when using Pytorch
Open, Needs TriagePublic3 Estimated Story Points

Description

Problem: When we use Pytorch, most of the times libomp is used to parallelize some inference steps. The library doesn't recognize containers, and it uses the number of CPUs of the underlying k8s worker (via sysfs reading) to set the number of threads to use. The immediate consequence is that most of the times a container doesn't run with a ton of CPUs set, and the high number of threads trigger cpu throttling (causing high latency etc..).

The solution for the moment is to set the OMP_NUM_THREAD env variable, that instructs libomp how many threads to use, but it is not really flexible since it needs to be coupled to the CPU settings of the container. It will surely happen in the future that we increase the number of CPUs assigned to a container (in a pod) in k8s without adjusting OMP_NUM_THREADS as well, ending up in a lot of time wasted to figure out the source of the inconsistency.

We should try to find a way in the Python code to automatically set OMP_NUM_THREAD when needed.

Event Timeline

Change 1011130 had a related patch set uploaded (by Elukey; author: Elukey):

[machinelearning/liftwing/inference-services@main] resource_utils.py: add a function to automatically set OMP_NUM_THREADS

https://gerrit.wikimedia.org/r/1011130

Change 1011131 had a related patch set uploaded (by Elukey; author: Elukey):

[machinelearning/liftwing/inference-services@main] readability: set automatically OMP_NUM_THREADS

https://gerrit.wikimedia.org/r/1011131

Change 1011130 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] resource_utils.py: add a function to automatically set OMP_NUM_THREADS

https://gerrit.wikimedia.org/r/1011130

The main issue with the approach of setting the OMP_NUM_THREADS variable is that IIUC it needs to be set before torch is imported/initialized, that is not 100% easy with the current code. For example, if we take readability's model server, we should restructure its code to call the set_omp_num_threads() help function before anything else.

After reading https://pytorch.org/docs/stable/notes/cpu_threading_torchscript_inference.html we decided to try the approach of setting torch's threads programmatically, way easier approach. The downside is that other libs, like numpy, do use OMP_NUM_THREADS so if the torch's solution works we may not have found something completely generic.

Change 1011131 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] readability: set torch threads using get_cpu_count

https://gerrit.wikimedia.org/r/1011131

Change rOMWC1012398c1b09 had a related patch set uploaded (by Elukey; author: Elukey):

[machinelearning/liftwing/inference-services@main] readability: add entrypoint to set environment variables

https://gerrit.wikimedia.org/r/1012398

Tried setting the number of threads via torch's library directly in the code, but unfortunately it didn't work (at least with 1.13.0). Tried a different road with https://gerrit.wikimedia.org/r/1012398, that in theory is more generic and future proof.

calbon set the point value for this task to 3.

Change rOMWC1012398c1b09 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] readability: add entrypoint to set environment variables

https://gerrit.wikimedia.org/r/1012398

Change 1012700 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] ml-services: update Docker image for Readability

https://gerrit.wikimedia.org/r/1012700

Change 1012701 had a related patch set uploaded (by Elukey; author: Elukey):

[machinelearning/liftwing/inference-services@main] Rename entrypoint.sh to ci_entrypoint.sh

https://gerrit.wikimedia.org/r/1012701

Change 1012711 had a related patch set uploaded (by Elukey; author: Elukey):

[machinelearning/liftwing/inference-services@main] Set most of the model servers to run a specific entrypoint.sh

https://gerrit.wikimedia.org/r/1012711

Change 1012700 merged by Elukey:

[operations/deployment-charts@master] ml-services: update Docker image for Readability

https://gerrit.wikimedia.org/r/1012700

Change 1012701 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] Rename entrypoint.sh to ci_entrypoint.sh

https://gerrit.wikimedia.org/r/1012701

We decided to go for https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1012711

The idea is to source common env variables from a bash file, and add another general purpose one to run the model server (as entrypoint in the blubber configs). This should allow to be future proof: by default we'll use the generic entrypoint.sh, and if a new model server will require a more custom/specific one, we'll just source the common bash file in there as well.

Next steps:

  • Deploy the new images to staging and verify that everything works as expected.
  • Rollout to prod.

Change 1012711 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] Set most of the model servers to run a specific entrypoint.sh

https://gerrit.wikimedia.org/r/1012711

Change #1017292 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] ml-services: update RR ML/Wikidata's Docker images

https://gerrit.wikimedia.org/r/1017292

Change #1017292 merged by Elukey:

[operations/deployment-charts@master] ml-services: update RR ML/Wikidata's Docker images

https://gerrit.wikimedia.org/r/1017292

Thanks to Aiko that fixed some issues with RR Wikidata and ML, the new code is now deployed to all the model servers that used to have OMP_NUM_THREADS explicitly stated in deployment-charts. The model servers work fine and their performance is good.

In the team meeting we agreed not to rollout the new entrypoint.sh change everywhere now, but to slowly and automatically do it when future deployments are needed.

The new endpoint has been rolled out as part of the migration to the mw-int-ro endpoint, task done!