Page MenuHomePhabricator

Test revertrisk-multilingual with GPU
Open, Needs TriagePublic5 Estimated Story Points

Description

In this task, our aim is to enable GPU usage for inference on the RevertRisk-Multilingual model. The objective is to perform load testing and compare the improvements in inference time with using only CPU for inference.

In task T355656, we implement batch inference for revertrisk-multilingual. Once these tasks are completed, we plan to test batch inference with a GPU.

Event Timeline

achou set the point value for this task to 5.Jan 30 2024, 3:24 PM
achou moved this task from Unsorted to Backlog WikiGPT on the Machine-Learning-Team board.
achou moved this task from Backlog WikiGPT to In Progress on the Machine-Learning-Team board.

Change 995214 had a related patch set uploaded (by AikoChou; author: AikoChou):

[machinelearning/liftwing/inference-services@main] revertrisk: use GPU for revertrisk-multilingual

https://gerrit.wikimedia.org/r/995214

Change 995214 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] revertrisk: use GPU for revertrisk-multilingual

https://gerrit.wikimedia.org/r/995214

Change 1005486 had a related patch set uploaded (by AikoChou; author: AikoChou):

[machinelearning/liftwing/inference-services@main] revertrisk-multilingual: reorder requirements.txt

https://gerrit.wikimedia.org/r/1005486

Change 1005486 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] revertrisk-multilingual: reorder requirements.txt

https://gerrit.wikimedia.org/r/1005486

The latest changes to requirements.txt still resulted in a failed docker image build. Therefore, the torch version conflict between the knowledge integrity and inference services repo was not the cause of the failure.

The build process lasted over 30 minutes, similar to previous attempts, leading to a timeout failure. See https://integration.wikimedia.org/ci/job/inference-services-pipeline-revertrisk-multilingual-publish/69/console

We need to investigate why the build process is taking such a long time in this case.

Change 1006909 had a related patch set uploaded (by AikoChou; author: AikoChou):

[machinelearning/liftwing/inference-services@main] revertrisk-multilingual: bump torch and transormers version

https://gerrit.wikimedia.org/r/1006909

Change 1006909 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] revertrisk-multilingual: bump torch and transormers version

https://gerrit.wikimedia.org/r/1006909

Change 1008858 had a related patch set uploaded (by AikoChou; author: AikoChou):

[machinelearning/liftwing/inference-services@main] revertrisk: add env var USE_GPU

https://gerrit.wikimedia.org/r/1008858

To keep archives happy - we are discussing about the macro problem in T359067

TL;DR: when we pip install torch for ROCm, a ton of .so libraries are shipped with the package, spanning to several GBs (it varies between releases, but we are always 6+ GBs). When we do pip install in a Docker image, a correspondent layer is created and at the moment there is a limitation (see T359067#9602091) that imposes a maximum layer size of 2GBs when uploading to the Docker Registry (last step after build).

@achou something interesting! I checked the CI output of https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1006909:

https://integration.wikimedia.org/ci/job/inference-services-pipeline-revertrisk-multilingual/332/consoleFull

I don't see torch rocm installed, but the vanilla one with nvidia packages. This matches with the command that I run locally (and that CI uses IIUC):

DOCKER_BUILDKIT=1 docker build --target production -f .pipeline/revertrisk/multilingual.yaml --platform=linux/amd64  . -t rr-ml

I recall that you said that docker buildx installs the right torch variant, so maybe buildx does something extra nice with poetry (in knowledge integrity) that doesn't work using DOCKER_BUILDKIT=1? What do you think?

As I see torch is being downloaded from pypi. Although I don't know exactly why this happens but it seems that the extra index (source in terms of pyproject.toml file) isn't respected so pip just sees the dependency and fetches it from PyPI.
To overcome this we can do the following: add an extra-index and torch in the requirements file before knowledge integrity is installed. That way it will already exist and downloaded correctly.
example:

--extra-index-url https://download.pytorch.org/whl/rocm5.4.2
torch==2.0.1

Change rECOL1009312305af had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] revertrisk-multilingual: add extra index for torch rocm

https://gerrit.wikimedia.org/r/1009312

Change rECOL1009312305af merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] revertrisk-multilingual: add extra index for torch rocm

https://gerrit.wikimedia.org/r/1009312

Change 1008858 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] revertrisk-ml: add a RevertRiskMultilingualGPU object

https://gerrit.wikimedia.org/r/1008858

I built a RRML image locally using the Pytorch 2.2.x base image from T360638.

The image size is 13.6GB. Here are the layers:

% docker history rrml-gpu:1
IMAGE          CREATED          CREATED BY                                      SIZE      COMMENT
62d649c981e7   7 minutes ago    LABEL blubber.variant=production blubber.ver…   0B        buildkit.dockerfile.v0
<missing>      7 minutes ago    ENTRYPOINT ["./entrypoint.sh"]                  0B        buildkit.dockerfile.v0
<missing>      7 minutes ago    COPY common_settings.sh common_settings.sh #…   1.16kB    buildkit.dockerfile.v0
<missing>      7 minutes ago    COPY model_server_entrypoint.sh entrypoint.s…   294B      buildkit.dockerfile.v0
<missing>      7 minutes ago    COPY /opt/lib/python/site-packages /opt/lib/…   1.22GB    buildkit.dockerfile.v0
<missing>      12 minutes ago   COPY python python/ # buildkit                  43.8kB    buildkit.dockerfile.v0
<missing>      12 minutes ago   COPY revert_risk_model/model_server model_se…   51kB      buildkit.dockerfile.v0
<missing>      12 minutes ago   ENV PATH=/opt/lib/python/site-packages/bin:/…   0B        buildkit.dockerfile.v0
<missing>      12 minutes ago   ENV PYTHONPATH=/srv/revert_risk_model           0B        buildkit.dockerfile.v0
<missing>      12 minutes ago   WORKDIR /srv/revert_risk_model                  0B        buildkit.dockerfile.v0
<missing>      12 minutes ago   ENV HOME=/home/somebody                         0B        buildkit.dockerfile.v0
<missing>      12 minutes ago   USER 65533                                      0B        buildkit.dockerfile.v0
<missing>      12 minutes ago   RUN |6 LIVES_AS=somebody LIVES_UID=65533 LIV…   9.07kB    buildkit.dockerfile.v0
<missing>      12 minutes ago   ARG RUNS_GID=900                                0B        buildkit.dockerfile.v0
<missing>      12 minutes ago   ARG RUNS_UID=900                                0B        buildkit.dockerfile.v0
<missing>      12 minutes ago   ARG RUNS_AS=runuser                             0B        buildkit.dockerfile.v0
<missing>      12 minutes ago   RUN |3 LIVES_AS=somebody LIVES_UID=65533 LIV…   0B        buildkit.dockerfile.v0
<missing>      12 minutes ago   ARG LIVES_GID=65533                             0B        buildkit.dockerfile.v0
<missing>      12 minutes ago   ARG LIVES_UID=65533                             0B        buildkit.dockerfile.v0
<missing>      12 minutes ago   ARG LIVES_AS=somebody                           0B        buildkit.dockerfile.v0
<missing>      12 minutes ago   RUN /bin/sh -c apt-get update && apt-get ins…   7.44kB    buildkit.dockerfile.v0
<missing>      12 minutes ago   ENV DEBIAN_FRONTEND=noninteractive              0B        buildkit.dockerfile.v0
<missing>      12 minutes ago   ENV HOME=/root                                  0B        buildkit.dockerfile.v0
<missing>      12 minutes ago   USER 0                                          0B        buildkit.dockerfile.v0
<missing>      4 days ago       |0 /bin/sh -c /usr/bin/pip3 install --target…   12.2GB    
<missing>      4 days ago       /bin/sh -c #(nop)  USER 65533                   0B        
<missing>      4 days ago       |0 /bin/sh -c echo 'Acquire::http::Proxy "ht…   1.54MB    
<missing>      9 days ago       |0 /bin/sh -c echo 'Acquire::http::Proxy "ht…   69.8MB    
<missing>      9 days ago       /bin/sh -c #(nop)  CMD ["/bin/bash"]            0B        
<missing>      9 days ago       /bin/sh -c #(nop)  ENV LC_ALL=C.UTF-8           0B        
<missing>      9 days ago       /bin/sh -c #(nop) ADD file:4d8f8923252d099a4…   122MB

The size of the layer COPY /opt/lib/python/site-packages ... is 1.22GB.
There is another layer |0 /bin/sh -c /usr/bin/pip3 install --target… with a size of 12.2GB, which originates from the base image.

Change #1018240 had a related patch set uploaded (by AikoChou; author: AikoChou):

[machinelearning/liftwing/inference-services@main] revertrisk: use the Pytorch base image for RRML GPU inference

https://gerrit.wikimedia.org/r/1018240

Change #1018240 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] revertrisk: use the Pytorch base image for RRML GPU inference

https://gerrit.wikimedia.org/r/1018240