Test revertrisk-multilingual with GPU
Open, Needs TriagePublic5 Estimated Story Points
Actions

Assigned To

Authored By

	achou
	Jan 29 2024, 12:44 PM

Description

In this task, our aim is to enable GPU usage for inference on the RevertRisk-Multilingual model. The objective is to perform load testing and compare the improvements in inference time with using only CPU for inference.

In task T355656, we implement batch inference for revertrisk-multilingual. Once these tasks are completed, we plan to test batch inference with a GPU.

Details

Subject	Repo	Branch	Lines +/-
revertrisk: use the Pytorch base image for RRML GPU inference	machinelearning/liftwing/inference-services	main	+16 -2
revertrisk-ml: add a RevertRiskMultilingualGPU object	machinelearning/liftwing/inference-services	main	+67 -28
revertrisk-multilingual: add extra index for torch rocm	machinelearning/liftwing/inference-services	main	+2 -0
revertrisk-multilingual: bump torch and transormers version	machinelearning/liftwing/inference-services	main	+3 -4
revertrisk-multilingual: reorder requirements.txt	machinelearning/liftwing/inference-services	main	+4 -2
revertrisk: use GPU for revertrisk-multilingual	machinelearning/liftwing/inference-services	main	+26 -1

Customize query in gerrit

Related Objects

Mentioned In: rMLIS901a1b20990b: revertrisk: use the Pytorch base image for RRML GPU inference
rMLIS764d93c97325: revertrisk-ml: add a RevertRiskMultilingualGPU object
rMLISe5f33d0ee8c4: revertrisk-multilingual: add extra index for torch rocm
rMLIS33bd6fc3af60: revertrisk-multilingual: bump torch and transormers version
rMLISa787e43f587c: revertrisk-multilingual: reorder requirements.txt
rMLIS31b4d86059c2: revertrisk: use GPU for revertrisk-multilingual
Mentioned Here: T360638: Create a Pytorch base image
rECOL1009312305af
T359067: Find an efficient strategy to add Pytorch and ROCm packages to our Docker images
T355656: Investigate how to implement batch inference for revertrisk-multilingual

Event Timeline

achou created this task.Jan 29 2024, 12:44 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 29 2024, 12:44 PM

calbon assigned this task to achou.Jan 30 2024, 3:21 PM

achou set the point value for this task to 5.Jan 30 2024, 3:24 PM

achou moved this task from Unsorted to Backlog WikiGPT on the Machine-Learning-Team board.

achou moved this task from Backlog WikiGPT to In Progress on the Machine-Learning-Team board.

Change 995214 had a related patch set uploaded (by AikoChou; author: AikoChou):

[machinelearning/liftwing/inference-services@main] revertrisk: use GPU for revertrisk-multilingual

https://gerrit.wikimedia.org/r/995214

gerritbot added a project: Patch-For-Review.Feb 2 2024, 2:36 PM

Change 995214 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] revertrisk: use GPU for revertrisk-multilingual

https://gerrit.wikimedia.org/r/995214

achou mentioned this in rMLIS31b4d86059c2: revertrisk: use GPU for revertrisk-multilingual.Feb 14 2024, 10:22 AM

Maintenance_bot removed a project: Patch-For-Review.Feb 14 2024, 10:30 AM

Change 1005486 had a related patch set uploaded (by AikoChou; author: AikoChou):

[machinelearning/liftwing/inference-services@main] revertrisk-multilingual: reorder requirements.txt

https://gerrit.wikimedia.org/r/1005486

gerritbot added a project: Patch-For-Review.Feb 21 2024, 11:53 AM

Change 1005486 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] revertrisk-multilingual: reorder requirements.txt

https://gerrit.wikimedia.org/r/1005486

achou mentioned this in rMLISa787e43f587c: revertrisk-multilingual: reorder requirements.txt.Feb 21 2024, 12:49 PM

Maintenance_bot removed a project: Patch-For-Review.Feb 21 2024, 1:30 PM

The latest changes to requirements.txt still resulted in a failed docker image build. Therefore, the torch version conflict between the knowledge integrity and inference services repo was not the cause of the failure.

The build process lasted over 30 minutes, similar to previous attempts, leading to a timeout failure. See https://integration.wikimedia.org/ci/job/inference-services-pipeline-revertrisk-multilingual-publish/69/console

We need to investigate why the build process is taking such a long time in this case.

achou moved this task from In Progress to Blocked on the Machine-Learning-Team board.Feb 23 2024, 12:35 PM

Change 1006909 had a related patch set uploaded (by AikoChou; author: AikoChou):

[machinelearning/liftwing/inference-services@main] revertrisk-multilingual: bump torch and transormers version

https://gerrit.wikimedia.org/r/1006909

gerritbot added a project: Patch-For-Review.Feb 27 2024, 11:49 AM

Change 1006909 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] revertrisk-multilingual: bump torch and transormers version

https://gerrit.wikimedia.org/r/1006909

Maintenance_bot removed a project: Patch-For-Review.Feb 27 2024, 1:30 PM

achou mentioned this in rMLIS33bd6fc3af60: revertrisk-multilingual: bump torch and transormers version.Feb 27 2024, 1:31 PM

Change 1008858 had a related patch set uploaded (by AikoChou; author: AikoChou):

[machinelearning/liftwing/inference-services@main] revertrisk: add env var USE_GPU

https://gerrit.wikimedia.org/r/1008858

gerritbot added a project: Patch-For-Review.Mar 5 2024, 1:12 PM

To keep archives happy - we are discussing about the macro problem in T359067

TL;DR: when we pip install torch for ROCm, a ton of .so libraries are shipped with the package, spanning to several GBs (it varies between releases, but we are always 6+ GBs). When we do pip install in a Docker image, a correspondent layer is created and at the moment there is a limitation (see T359067#9602091) that imposes a maximum layer size of 2GBs when uploading to the Docker Registry (last step after build).

@achou something interesting! I checked the CI output of https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1006909:

https://integration.wikimedia.org/ci/job/inference-services-pipeline-revertrisk-multilingual/332/consoleFull

I don't see torch rocm installed, but the vanilla one with nvidia packages. This matches with the command that I run locally (and that CI uses IIUC):

DOCKER_BUILDKIT=1 docker build --target production -f .pipeline/revertrisk/multilingual.yaml --platform=linux/amd64  . -t rr-ml

I recall that you said that docker buildx installs the right torch variant, so maybe buildx does something extra nice with poetry (in knowledge integrity) that doesn't work using DOCKER_BUILDKIT=1? What do you think?

As I see torch is being downloaded from pypi. Although I don't know exactly why this happens but it seems that the extra index (source in terms of pyproject.toml file) isn't respected so pip just sees the dependency and fetches it from PyPI.
To overcome this we can do the following: add an extra-index and torch in the requirements file before knowledge integrity is installed. That way it will already exist and downloaded correctly.
example:

--extra-index-url https://download.pytorch.org/whl/rocm5.4.2
torch==2.0.1

Change rECOL1009312305af had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] revertrisk-multilingual: add extra index for torch rocm

https://gerrit.wikimedia.org/r/1009312

Change rECOL1009312305af merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] revertrisk-multilingual: add extra index for torch rocm

https://gerrit.wikimedia.org/r/1009312

isarantopoulos mentioned this in rMLISe5f33d0ee8c4: revertrisk-multilingual: add extra index for torch rocm.Mar 7 2024, 2:26 PM

Change 1008858 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] revertrisk-ml: add a RevertRiskMultilingualGPU object

https://gerrit.wikimedia.org/r/1008858

achou mentioned this in rMLIS764d93c97325: revertrisk-ml: add a RevertRiskMultilingualGPU object.Mar 14 2024, 12:08 PM

Maintenance_bot removed a project: Patch-For-Review.Mar 14 2024, 12:30 PM

achou moved this task from Blocked to Ready To Go on the Machine-Learning-Team board.Apr 4 2024, 7:54 PM

achou moved this task from Ready To Go to In Progress on the Machine-Learning-Team board.Apr 9 2024, 8:24 AM

I built a RRML image locally using the Pytorch 2.2.x base image from T360638.

The image size is 13.6GB. Here are the layers:

% docker history rrml-gpu:1
IMAGE          CREATED          CREATED BY                                      SIZE      COMMENT
62d649c981e7   7 minutes ago    LABEL blubber.variant=production blubber.ver…   0B        buildkit.dockerfile.v0
<missing>      7 minutes ago    ENTRYPOINT ["./entrypoint.sh"]                  0B        buildkit.dockerfile.v0
<missing>      7 minutes ago    COPY common_settings.sh common_settings.sh #…   1.16kB    buildkit.dockerfile.v0
<missing>      7 minutes ago    COPY model_server_entrypoint.sh entrypoint.s…   294B      buildkit.dockerfile.v0
<missing>      7 minutes ago    COPY /opt/lib/python/site-packages /opt/lib/…   1.22GB    buildkit.dockerfile.v0
<missing>      12 minutes ago   COPY python python/ # buildkit                  43.8kB    buildkit.dockerfile.v0
<missing>      12 minutes ago   COPY revert_risk_model/model_server model_se…   51kB      buildkit.dockerfile.v0
<missing>      12 minutes ago   ENV PATH=/opt/lib/python/site-packages/bin:/…   0B        buildkit.dockerfile.v0
<missing>      12 minutes ago   ENV PYTHONPATH=/srv/revert_risk_model           0B        buildkit.dockerfile.v0
<missing>      12 minutes ago   WORKDIR /srv/revert_risk_model                  0B        buildkit.dockerfile.v0
<missing>      12 minutes ago   ENV HOME=/home/somebody                         0B        buildkit.dockerfile.v0
<missing>      12 minutes ago   USER 65533                                      0B        buildkit.dockerfile.v0
<missing>      12 minutes ago   RUN |6 LIVES_AS=somebody LIVES_UID=65533 LIV…   9.07kB    buildkit.dockerfile.v0
<missing>      12 minutes ago   ARG RUNS_GID=900                                0B        buildkit.dockerfile.v0
<missing>      12 minutes ago   ARG RUNS_UID=900                                0B        buildkit.dockerfile.v0
<missing>      12 minutes ago   ARG RUNS_AS=runuser                             0B        buildkit.dockerfile.v0
<missing>      12 minutes ago   RUN |3 LIVES_AS=somebody LIVES_UID=65533 LIV…   0B        buildkit.dockerfile.v0
<missing>      12 minutes ago   ARG LIVES_GID=65533                             0B        buildkit.dockerfile.v0
<missing>      12 minutes ago   ARG LIVES_UID=65533                             0B        buildkit.dockerfile.v0
<missing>      12 minutes ago   ARG LIVES_AS=somebody                           0B        buildkit.dockerfile.v0
<missing>      12 minutes ago   RUN /bin/sh -c apt-get update && apt-get ins…   7.44kB    buildkit.dockerfile.v0
<missing>      12 minutes ago   ENV DEBIAN_FRONTEND=noninteractive              0B        buildkit.dockerfile.v0
<missing>      12 minutes ago   ENV HOME=/root                                  0B        buildkit.dockerfile.v0
<missing>      12 minutes ago   USER 0                                          0B        buildkit.dockerfile.v0
<missing>      4 days ago       |0 /bin/sh -c /usr/bin/pip3 install --target…   12.2GB    
<missing>      4 days ago       /bin/sh -c #(nop)  USER 65533                   0B        
<missing>      4 days ago       |0 /bin/sh -c echo 'Acquire::http::Proxy "ht…   1.54MB    
<missing>      9 days ago       |0 /bin/sh -c echo 'Acquire::http::Proxy "ht…   69.8MB    
<missing>      9 days ago       /bin/sh -c #(nop)  CMD ["/bin/bash"]            0B        
<missing>      9 days ago       /bin/sh -c #(nop)  ENV LC_ALL=C.UTF-8           0B        
<missing>      9 days ago       /bin/sh -c #(nop) ADD file:4d8f8923252d099a4…   122MB

The size of the layer COPY /opt/lib/python/site-packages ... is 1.22GB.
There is another layer |0 /bin/sh -c /usr/bin/pip3 install --target… with a size of 12.2GB, which originates from the base image.

Change #1018240 had a related patch set uploaded (by AikoChou; author: AikoChou):

[machinelearning/liftwing/inference-services@main] revertrisk: use the Pytorch base image for RRML GPU inference

https://gerrit.wikimedia.org/r/1018240

gerritbot added a project: Patch-For-Review.Apr 9 2024, 11:44 AM

Change #1018240 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] revertrisk: use the Pytorch base image for RRML GPU inference

https://gerrit.wikimedia.org/r/1018240

achou mentioned this in rMLIS901a1b20990b: revertrisk: use the Pytorch base image for RRML GPU inference.Apr 11 2024, 3:35 PM

achou moved this task from In Progress to Ready To Go on the Machine-Learning-Team board.Apr 23 2024, 12:03 PM

Maintenance_bot removed a project: Patch-For-Review.Apr 26 2024, 6:14 PM

achou moved this task from Ready To Go to Blocked on the Machine-Learning-Team board.Wed, May 15, 12:21 PM

isarantopoulos moved this task from Blocked to Ready To Go on the Machine-Learning-Team board.Tue, Jun 4, 2:38 PM

Test revertrisk-multilingual with GPUOpen, Needs TriagePublic5 Estimated Story PointsActions

Description

Details

Related Objects

Event Timeline

Test revertrisk-multilingual with GPU
Open, Needs TriagePublic5 Estimated Story Points
Actions