Page MenuHomePhabricator

Create a Pytorch base image
Closed, ResolvedPublic3 Estimated Story Points

Description

In T359067 it was decided to create a base image with Pytorch ROCm installed, to use in various blubber files and avoid too much duplication (in Docker layers).

High level idea:

  1. The base image will probably be in the production-images repository.
  2. The image can probably go directly to Debian Bookworm, in theory our stack should support Python 3.11, and this base image will be used by relatively new things (RR multi-lingual, Hugging face, etc..)
  3. We should decide on a ROCm version to use, maybe the 5.x serie is ok for the moment?

Event Timeline

Change #1013335 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/docker-images/production-images@master] Add the amd-pytorch base image for ML workloads

https://gerrit.wikimedia.org/r/1013335

calbon set the point value for this task to 3.Mar 26 2024, 2:23 PM

Change #1013335 merged by Elukey:

[operations/docker-images/production-images@master] Add the amd-pytorch base image for ML workloads

https://gerrit.wikimedia.org/r/1013335

To keep archives happy:

  • Me and Aiko tested the Revert Risk ML Docker image using Pytorch's base image and ran it locally, it worked fine!
  • The new image was pushed to the registry, so we can start using it.

Use case to test:

  • Blubber model server using the Pytorch base image
  • torch stated in one of the model server's requirements.txt files (same version and a different one).

For the same version use case, we hope that pip will not install torch again (if so we have a problem). For the different version use case, pip should go ahead and install the other torch version and that would take the priority over the base image one (since it is deployed under /opt).

There is an obstacle with the current approach that I didn't think about. In the current setup, this happens:

  • We pip install pytorch-rocm in the base image, so a layer is created (~10G). If we want to do things properly and follow what Blubber does, we'd need to pip install the torch package under /opt/lib/python/site-packages as the somebody user.
  • The build image in Blubber is created, that pip installs all the packages required in requirements.txt. If torch is requested/listed explicitly, or as a transitive dep, it will not be re-installed since another version is already present (the pytorch-rocm one from the base image).
  • Then we COPY /opt/lib/python/site-packages from the build image to the final production one, that is based upon the base one. This creates another layer of ~10G+ containing torch and all the other packages.

So we end up with an image that is double the desired size (~20G) because of two layers with pytorch-rocm.

I noticed something odd in the base image.
When I import torch inside the image I get a warning about numpy missing:

/usr/local/lib/python3.11/dist-packages/torch/nn/modules/transformer.py:20: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:84.)
  device: torch.device = torch.device(torch._C._get_default_device()),  # torch.device('cpu'),

I verified that numpy doesn't exist in the image which is odd because it is a dependency for torch as defined in the pyproject.toml. I saw that the same happens for requests and pyyaml (and may be the case for the other dependencies as well).
The only way this will happen is if you install a package without its dependencies(--no-dependencies) which is not what we are doing.

Change #1015530 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/docker-images/production-images@master] Rework the amd-pytorch22's image

https://gerrit.wikimedia.org/r/1015530

The above "issue" with numpy seems that it is not an issue after all. Numpy was removed as a requirement after torch 1.9 but they do maintain an aggressive warning as I read in an issue.

I've built a base image with pytorch 2.1.2 and rocm5.5 which is 10.7GB and tested it with the huggingface image which ends up being 13.9GB (biggest layer is 3.63 GB) compared to the 15.7 GB without the base-packages -> site-packages symlink trick (biggest layer is 5.68GB) . I do end up with some nvidia stuff in there but the improvement is obvious! Nice work @elukey !
I'm testing some modifications to the requirements of the huggingface image to see I can make the size even smaller.

Change #1016798 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::builder: add the somebody user's UID

https://gerrit.wikimedia.org/r/1016798

Change #1016798 merged by Elukey:

[operations/puppet@production] role::builder: add the somebody user's UID

https://gerrit.wikimedia.org/r/1016798

Change #1015530 merged by Elukey:

[operations/docker-images/production-images@master] Rework the amd-pytorch22's image

https://gerrit.wikimedia.org/r/1015530

Change #1016807 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/docker-images/production-images@master] amd-pytorch22: move comments to a README file

https://gerrit.wikimedia.org/r/1016807

Change #1016807 merged by Elukey:

[operations/docker-images/production-images@master] amd-pytorch22: move comments to a README file

https://gerrit.wikimedia.org/r/1016807

We have created two base images, one for Pytorch 2.2.x and one for 2.1.x, they will be tested and used with Revert Risk ML and Hugging face's model server.