Page MenuHomePhabricator

Update Pytorch base image to 2.3.0
Closed, ResolvedPublic1 Estimated Story Points

Description

I want to create a new pytorch base image in production-images so that I can use the latest Huggingface server whihc lists 2.3.0 version as a requirement. This will also allow us to use latest ROCm version as there is a build for torch2.30-rocm6.0 in https://download.pytorch.org/whl/rocm6.0/torch/

Event Timeline

Change #1032725 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/docker-images/production-images@master] Add new version for amd-pytorch image (torch 2.3.0 - rocm 6.0)

https://gerrit.wikimedia.org/r/1032725

Change #1032777 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] huggingface: upgrade kserve to 0.13-rc0

https://gerrit.wikimedia.org/r/1032777

Unfortunately pytorch package seems to get bigger and bigger after each release. Same for ROCm.

Pytorch versionROCm versionraw image size (GB)Compressed image size (GB)
2.1.25.710.23.28
2.3.05.713.94.29
2.3.06.015.94.86

The only pre-built pytorch ROCm binaries available that are supported by huggingfaceserver are the 5.7 and 6.0 ROCm as we
We need a pytorch version of at least 2.3.0 and which leaves us with ROCm 5.7 and 6.0 as the only options for now ( from the pre-built pytorch ROCm binaries available)

Images seem to become more bloated so I am exploring the option to install pytorch-rocm with --no-dependencies option and handle dependencies manually either at the production images repo or on the inference services side. It is a long shot but I think it is worth to try from our side at least to cross it out if it can't be done.
Whether this approach is feasible or not will depend on:

  • the need to include all pytorch dependencies: perhaps some of the dependencies in the list are not needed.
  • the upgrade process: if upgrading the requirements manually is too much of a burden it terms of complexity

As it turns out the above approach won't cut it. Even without the dependencies the compressed image with pytorch 2.3.0 and rocm 6.0 is 4.36GB.
This is the list of packages under /opt/lib/site-packages

functorch  
torch  
torch-2.3.0+rocm6.0.dist-info  
torchgen

Also seems that torch-ROCm by itself is ~12GB, so it is indeed getting bigger and bigger:

somebody@2b71fb785583:/opt/lib/python$ du -hs /opt/lib/python/site-packages/torch/lib/* | sort -h | tail
240M	/opt/lib/python/site-packages/torch/lib/librccl.so
466M	/opt/lib/python/site-packages/torch/lib/libtorch_cpu.so
643M	/opt/lib/python/site-packages/torch/lib/libmagma.so
806M	/opt/lib/python/site-packages/torch/lib/librocblas.so
892M	/opt/lib/python/site-packages/torch/lib/libMIOpen.so
1.2G	/opt/lib/python/site-packages/torch/lib/librocsparse.so
1.3G	/opt/lib/python/site-packages/torch/lib/libtorch_hip.so
1.5G	/opt/lib/python/site-packages/torch/lib/hipblaslt
1.5G	/opt/lib/python/site-packages/torch/lib/librocsolver.so
2.5G	/opt/lib/python/site-packages/torch/lib/rocblas

The entry torch-2.3.0+rocm6.0.dist-info doesn't take much space (<10MB) and holds metadata.

calbon set the point value for this task to 1.May 21 2024, 2:33 PM
calbon moved this task from Unsorted to Ready To Go on the Machine-Learning-Team board.

Change #1032725 merged by Klausman:

[operations/docker-images/production-images@master] Add new version for amd-pytorch image (torch 2.3.0 - rocm 6.0)

https://gerrit.wikimedia.org/r/1032725

Change #1034975 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/docker-images/production-images@master] fix: remove --cache-dir from pytorch image

https://gerrit.wikimedia.org/r/1034975

We had forgotten the .pip dir inside the docker image which increased its size by more than 2GB (the size of the packages since torch compressed is really big by itself).
New image is now 13.5GB and 2.5GB when compressed which allows us to publish it in our docker registry.

Change #1034975 merged by Klausman:

[operations/docker-images/production-images@master] fix: remove --cache-dir from pytorch image

https://gerrit.wikimedia.org/r/1034975

# build-production-images --select '*pytorch23*'
== Step 0: scanning /srv/images/production-images/images ==
Will build the following images:
* docker-registry.discovery.wmnet/amd-pytorch23:2.3.0rocm6.0-1
== Step 1: building images ==
* Built image docker-registry.discovery.wmnet/amd-pytorch23:2.3.0rocm6.0-1
== Step 2: publishing ==
Successfully published image docker-registry.discovery.wmnet/amd-pytorch23:2.3.0rocm6.0-1
== Build done! ==
You can see the logs at ./docker-pkg-build.log
== Step 0: scanning /srv/images/production-images/istio ==
Will build the following images:
== Step 1: building images ==
== Step 2: publishing ==
== Build done! ==
You can see the logs at ./docker-pkg-build.log
== Step 0: scanning /srv/images/production-images/cert-manager ==
Will build the following images:
== Step 1: building images ==
== Step 2: publishing ==
== Build done! ==
You can see the logs at ./docker-pkg-build.log
#

and:

$ docker pull docker-registry.wikimedia.org/amd-pytorch23
Using default tag: latest
latest: Pulling from amd-pytorch23
9e94c62ce5a2: Already exists 
7bd1fb5b4955: Already exists 
253ad1301e1a: Pull complete 
e32d8a205d5b: Pull complete 
Digest: sha256:cff85430a98674eae970e9f0a30531388b1deb5c229d77c2f7711a8f3b4b89df
Status: Downloaded newer image for docker-registry.wikimedia.org/amd-pytorch23:latest
docker-registry.wikimedia.org/amd-pytorch23:latest
$ docker images
REPOSITORY                                               TAG              IMAGE ID       CREATED          SIZE
docker-registry.wikimedia.org/amd-pytorch23              latest           54fa55e17951   28 minutes ago   13.5GB
[...]
$

Change #1032777 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] huggingface: upgrade kserve to 0.13-rc0

https://gerrit.wikimedia.org/r/1032777