Page MenuHomePhabricator

Investigate if it is possible to reduce torch's package size
Closed, ResolvedPublic10 Estimated Story Points

Description

The main issue is that pip installing torch with rocm support brings in ~8.4GB of libs with ROCm version 5.4.0

This is how the space is used:

somebody@4b8595f6b014:/srv/revert_risk_model$ du -hs /opt/lib/python/site-packages/torch/lib/* | sort -h | tail
390M	/opt/lib/python/site-packages/torch/lib/libMIOpen.so
497M	/opt/lib/python/site-packages/torch/lib/libtorch_cpu.so
596M	/opt/lib/python/site-packages/torch/lib/librocsparse.so
693M	/opt/lib/python/site-packages/torch/lib/librocfft-device-0.so
696M	/opt/lib/python/site-packages/torch/lib/librocfft-device-1.so
715M	/opt/lib/python/site-packages/torch/lib/libtorch_hip.so
721M	/opt/lib/python/site-packages/torch/lib/librocfft-device-2.so
764M	/opt/lib/python/site-packages/torch/lib/librocfft-device-3.so
1.1G	/opt/lib/python/site-packages/torch/lib/librocsolver.so
1.4G	/opt/lib/python/site-packages/torch/lib/rocblas

In particular, the rocblas directory contains libs to support all AMD GPU cards available, something that we don't really need.

This task should investigate if it is possible to reduce the size of the package, maybe building it with our own infrastructure.

Useful links:

Event Timeline

Ran a little test to see if stripping symbols from ROCm libraries could give us some space benefit:

elukey@ml-staging2001:~$ du -hs /opt/rocm-5.4.0/lib/* | sort -h | tail
283M	/opt/rocm-5.4.0/lib/librocblas.so.0.1.50400
384M	/opt/rocm-5.4.0/lib/libMIOpen.so.1.0.50400
594M	/opt/rocm-5.4.0/lib/librocsparse.so.0.1.50400
689M	/opt/rocm-5.4.0/lib/librocfft-device-0.so.0.1.50400
692M	/opt/rocm-5.4.0/lib/librocfft-device-1.so.0.1.50400
717M	/opt/rocm-5.4.0/lib/librocfft-device-2.so.0.1.50400
759M	/opt/rocm-5.4.0/lib/librocfft-device-3.so.0.1.50400
768M	/opt/rocm-5.4.0/lib/rocfft_kernel_cache.db
1.1G	/opt/rocm-5.4.0/lib/librocsolver.so.0.1.50400
1.5G	/opt/rocm-5.4.0/lib/rocblas

elukey@ml-staging2001:~$ nm /opt/rocm-5.4.0/lib/librocfft-device-3.so.0.1.50400 | wc -l
26761
elukey@ml-staging2001:~$ nm /opt/rocm-5.4.0/lib/librocsolver.so.0.1.50400 | wc -l
14634

elukey@ml-staging2001:~$ strip /opt/rocm-5.4.0/lib/librocfft-device-3.so.0.1.50400 -o /tmp/librocfft-device-3.so.0.1.50400-stripped
elukey@ml-staging2001:~$ nm /tmp/librocfft-device-3.so.0.1.50400-stripped
nm: /tmp/librocfft-device-3.so.0.1.50400-stripped: no symbols

elukey@ml-staging2001:~$ strip /opt/rocm-5.4.0/lib/librocsolver.so.0.1.50400 -o /tmp/librocsolver.so.0.1.50400-stripped 
elukey@ml-staging2001:~$ nm /tmp/librocsolver.so.0.1.50400-stripped
nm: /tmp/librocsolver.so.0.1.50400-stripped: no symbols

elukey@ml-staging2001:~$ du -hs /tmp/librocsolver.so.0.1.50400-stripped
1.1G	/tmp/librocsolver.so.0.1.50400-stripped

elukey@ml-staging2001:~$ du -hs /tmp/librocfft-device-3.so.0.1.50400-stripped
753M	/tmp/librocfft-device-3.so.0.1.50400-stripped

Overall there seems to be only a few MBs gain in one library :(

Debian Trixie (testing) offers ROCm 5.5 packages, and so far the size seems better than vanilla upstream ones:

root@e055b7f3f246:/# du -hs /usr/lib/x86_64-linux-gnu/librocblas.so.0.1 
911M	/usr/lib/x86_64-linux-gnu/librocblas.so.0.1

root@e055b7f3f246:/# du -hs /usr/lib/x86_64-linux-gnu/librocsolver.so.0.1 
232M	/usr/lib/x86_64-linux-gnu/librocsolver.so.0.1

Using Trixie may not be viable for the moment, plus it would lock us to a specific version of ROCm, but surely let's keep it in mind for the future.

elukey updated the task description. (Show Details)

Followed some guides and afaics most of manual builds end up with a Python wheel that doesn't contain the extra ROCm libs, like the upstream pytorch ones. I guess that they do the packaging in a special build process (when they release), so the next step would be to test a manually built package on a WMF host with a GPU to verify what I wrote above.

I tried with stat1008 (the only one with a GPU, see https://phabricator.wikimedia.org/T358763) but it runs Debian 10, and even with pyenv + Python 3.11 installed separately, other build deps are tool old (like cmake).

isarantopoulos set the point value for this task to 10.Mar 12 2024, 2:39 PM

I found this build script:

https://github.com/pytorch/builder/blob/main/manywheel/build_rocm.sh

It should be how upstream packages the giant wheel files, including/shipping the ROCm libs within the pytorch wheel. IIUC the build instructions that upstream provides (like the AMD one) should allow the creation of a small package that refers to system libs already installed (but I haven't verified it yet).

During some experimentation with various approaches of generating the Docker images differently, and stripping out unneeded information, I have tried the following things:

  • Build the images using the manywheel approach mentioned in T359569#9627061
    • reducing the number of targeted GPU archs (from half a dozen to two) does make a difference in size, but not a big one, maybe 5-10% at best.
    • The build takes about as long as the other image approaches (i.e. well over than an hour, even on a fast machine).
    • These build manywheel scripts are somewhat brittle, and it hard to say if we could adapt them to newer/different versions of PyTorch and ROCm when (not if) we need them.
  • Try and reduce the size of the binaries after building them, using strip (that was already covered by Luca) and upx (a tool for transparent binary compression, at the cost of startup time)
    • This gains even less since the big chunks of the libraries (these are not .so files) can't be handled by upx, and the .so files, while not trivial, result in maybe 100-200M in image size reduction
  • use a CUDA-specific tool (nvprune) to reduce the number of supported GPU arch without having to rebuild everything. As it turns out, this tool unsurprisingly does not work with ROCm binaries.

Given that we have a way to unblock ourselves on the effort that prompted these reduction attempts (changing the Docker registry), I think we have spent enough time on this. At least we now know what doesn't work for us.