The ML team is going to heavily invest in the usage of Pytorch on ROCm GPUs in the near future. The main side effect of this choice is that the related Docker images (using both to run our model servers) end up being really huge (~10GBs+ in size) and this poses a challenge for CI and the Docker Registry.
Some high level details:
- The Pypi package for torch has a variant for ROCm, namely it ships with all the ROCm libraries (.so-s etc..) bundled to a specific version. This is very handy but it ends up generating a layer that is huge, ~4/5 GBs in size (the layer is related to the pip install action for example).
- The hostPath way (or similar) offered by Kubernetes may be used to install the ROCm libs on the worker node, and expose them to the containers. While this poses some compatibility challenges (like worker OS vs container OS, etc..), it is also explicitly forbidden by KNative Serving, that prohibits hostPath for security reasons.
In our current tests via Blubber and CI we often hit limits, the most noticeable ones are:
- CI nodes end up exhausting space due to the big Docker images built (partially solved, but it may get worse over time).
- When CI tries to push to the Docker registry we hit the limit of the nginx's tmpfs filesize (2GBs), and eventually the whole operation fails with a 500. We may increase the tmpfs size but it could be something recurrent over time.
Creating base images could help this problem, adding some ideas to discuss them:
- If torch upstream offers a way to use the OS ROCm libraries, we may be able to create a base image with the libs and "only" install vanilla Pytorch via pip, ending up in smaller layer sizes for sure. We may need custom built torch tough, that is not ideal.
- We could create a base image containing the torch package installed under a system path (like it was installed via deb package on Bookworm, for example) so PYTHONPATH would pick it up. Then we'd use that image in our blubber file, install all the packages other than torch (question mark about what happens if a package requires torch in requirements.txt, hopefully pip should work fine).
Even with the base image ideas above, the nginx tmpfs issue (capping filesize to ~2GBs) seems to be still something to solve.
Finally, ServiceOps should be aware of this mess and we should get their sign-off, since there is the possibility of adding too much load to the Docker Registry.