Page MenuHomePhabricator

Find an efficient strategy to add Pytorch and ROCm packages to our Docker images
Closed, ResolvedPublic

Description

The ML team is going to heavily invest in the usage of Pytorch on ROCm GPUs in the near future. The main side effect of this choice is that the related Docker images (using both to run our model servers) end up being really huge (~10GBs+ in size) and this poses a challenge for CI and the Docker Registry.

Some high level details:

  • The Pypi package for torch has a variant for ROCm, namely it ships with all the ROCm libraries (.so-s etc..) bundled to a specific version. This is very handy but it ends up generating a layer that is huge, ~4/5 GBs in size (the layer is related to the pip install action for example).
  • The hostPath way (or similar) offered by Kubernetes may be used to install the ROCm libs on the worker node, and expose them to the containers. While this poses some compatibility challenges (like worker OS vs container OS, etc..), it is also explicitly forbidden by KNative Serving, that prohibits hostPath for security reasons.

In our current tests via Blubber and CI we often hit limits, the most noticeable ones are:

  • CI nodes end up exhausting space due to the big Docker images built (partially solved, but it may get worse over time).
  • When CI tries to push to the Docker registry we hit the limit of the nginx's tmpfs filesize (2GBs), and eventually the whole operation fails with a 500. We may increase the tmpfs size but it could be something recurrent over time.

Creating base images could help this problem, adding some ideas to discuss them:

  • If torch upstream offers a way to use the OS ROCm libraries, we may be able to create a base image with the libs and "only" install vanilla Pytorch via pip, ending up in smaller layer sizes for sure. We may need custom built torch tough, that is not ideal.
  • We could create a base image containing the torch package installed under a system path (like it was installed via deb package on Bookworm, for example) so PYTHONPATH would pick it up. Then we'd use that image in our blubber file, install all the packages other than torch (question mark about what happens if a package requires torch in requirements.txt, hopefully pip should work fine).

Even with the base image ideas above, the nginx tmpfs issue (capping filesize to ~2GBs) seems to be still something to solve.

Finally, ServiceOps should be aware of this mess and we should get their sign-off, since there is the possibility of adding too much load to the Docker Registry.

Event Timeline

Previous discussion with Service Ops on IRC:

15:32  <aiko> o/ hi from ml-team, I need some help with a 500 error when CI is pushing the model image. here is the log: 
https://integration.wikimedia.org/ci/job/inference-services-pipeline-revertrisk-multilingual-publish/69/execution/node/59
/log/ 
15:32  <aiko> the new image size increased ~2G, wondering if the error is due to hitting a layer limit
15:44  <jayme> aiko: do you know the ~ max layer size?
16:01 :: Join: apergos (~apergos@wiktionary/ArielGlenn) to #wikimedia-serviceops
16:06  <aiko> jayme: ~4.5G
16:06  <jayme> ouch :|
16:07  <aiko> what is the layer limit?
16:08  <jayme> there is a limit to the "filesize" that can be uploaded to the nginx in front of the registry, basically enforced by the size of a tmpfs, which is 2GB
16:14  <aiko> ok I see. can it be adjusted?
16:14  <aiko> I'm also looking if i can reduce the layer size
16:16  <jayme> for reference, this was the issue we had for this https://phabricator.wikimedia.org/T288198
16:17  <jayme> yes we could increase, but not without effort (as it's a tmpfs - we'd need more ram)
16:17  <jayme> alternatively we could move that cache to disk (slow) which would need more testing (and a bigger disk :))
16:39  <jayme> aiko: are you in a rush with this or can it wait until next week?
16:43  <aiko> no I'm not in a rush. thanks for the info!
16:44  <aiko> I'll see if I can reduce the layer size first. will let you know if we really need a bigger size (open a ticket for further discussion etc)
16:44  <jayme> ok, cool
16:45  <aiko> thank you :)

Hi,

So this is a difficult one to tackle. From what I gather images (and layers) can end up being really large, close to 10GB. I have questions regarding how a pip install ends up consuming 10GB of disk space of course but the main issue here is probably that this is going to cause issue down the road anyway. So that is probably unsustainable long term.

While bumping tmpfs for /var/lib/nginx on the registry hosts to 10GB is quite possibly out of the question (I hope it is obvious why we aren't going to spend 40GB RAM and have to deal with the operational aftermath of such a decision) there is a bigger question as to what we want and can support here.

The registry has been designed since the start on the idea that images need to be lean and small to both allow for fast deployments as well as avoiding timeouts, saturation of various resources, etc. To the point that when we met a similar issue in WikiKube (images in there are up to 5GB, however layers are smaller. The limit is effectively on the layer size, not the image size), we spent a considerable amount of time investing into a peer-to-peer docker image distribution system. We did this out of the necessity and we do want to revisit the strategy for MediaWiki images once we are done with mw-on-k8s and we can support multiversion in the infrastructure, relieving MediaWiki from the need to do so.

A quick solution to the layer size problem would be to split the way the image is created. Different RUN directives create different layers so in theory, if there are multiple pip install packages to be installed they can be split into many RUN directives. Normally, one doesn't want many RUN directives as the creation of extra layers causes inefficiencies, especially if there many small layers. It could be a solution here however as splitting a 10GB images into a couple of 2GB to 3GB (with maybe an minor configuration change in the registries) ones would allow you to avoid the registry hurdle.

However:

  • it wouldn't solve CI issues
  • if blubber is being used then this isn't an option as one of the design goals of blubber was to avoid exactly that (we did not anticipate a single layer reaching several GB of course).
  • It's admittedly an ugly hack. It could work for a while but it's not really sustainable long term either.
  • It's unpredictable and unclear when it breaks. We don't control the size of the images or the layers anyway, we can expect that we are one pytorch upgrade away from it breaking some carefully crafted image layer.

It might help however kick the ball down the road for a bit.

A more permanent solution would be to revisit the architectural decisions in the design of the registry. We 've been planning to revisit our decisions past July 2024 depending on how annual planning process ends up cause we need to see what other uses cases we 'll have to support for developers. @elukey what kind of a time line do you have in mind for seeing this addressed ?

Hi! Thanks a lot for the response, replying inline :)

Hi,

So this is a difficult one to tackle. From what I gather images (and layers) can end up being really large, close to 10GB. I have questions regarding how a pip install ends up consuming 10GB of disk space of course but the main issue here is probably that this is going to cause issue down the road anyway. So that is probably unsustainable long term.

Definitely it is a weird use case. Pytorch decided to ship their version for AMD GPUs with all the .so libs from ROCm that are needed for the most common use cases to run. This is, IIUC, fixed and every new release of Pytorch is bundled with a specific version of ROCm (see example). As mentioned in the task's description trying to share those binaries from the k8s worker node is, atm, close to impossible (modulo a ton of patches to upstream etc.., that wouldn't leave compatibility aside between worker node OS and container OS).

This is an example of what gets deployed:

somebody@d80a6ff978a9:/$ du -hs /opt/lib/python/site-packages/torch/lib/* | sort -h
8.0K	/opt/lib/python/site-packages/torch/lib/libroctx64.so
20K	/opt/lib/python/site-packages/torch/lib/libtorch_global_deps.so
[..cut..]
596M	/opt/lib/python/site-packages/torch/lib/librocsparse.so
693M	/opt/lib/python/site-packages/torch/lib/librocfft-device-0.so
696M	/opt/lib/python/site-packages/torch/lib/librocfft-device-1.so
715M	/opt/lib/python/site-packages/torch/lib/libtorch_hip.so
721M	/opt/lib/python/site-packages/torch/lib/librocfft-device-2.so
764M	/opt/lib/python/site-packages/torch/lib/librocfft-device-3.so
1.1G	/opt/lib/python/site-packages/torch/lib/librocsolver.so
1.4G	/opt/lib/python/site-packages/torch/lib/rocblas

I didn't find a way from Pytorch to deploy subsets of those libs, and honestly I am not 100% sure if upstream will ever be open to it. Newest versions trimmed ~2/3GBs from the libs, maybe removing some duplicates, need to investigate more (but the size it around 6/7GBs anyway).

Pytorch will probably be one of the tools (if not, the tool) that will power most of our newer models (like LLMs for example), so finding a solution to this problem is critical for ML :(

While bumping tmpfs for /var/lib/nginx on the registry hosts to 10GB is quite possibly out of the question (I hope it is obvious why we aren't going to spend 40GB RAM and have to deal with the operational aftermath of such a decision) there is a bigger question as to what we want and can support here.

Definitely I agree 100%. I am wondering if we could experiment with a different nginx instance (pointing to the same registry backend) that uses disk partitions for /var/lib/nginx, for use cases like this one. It will be incredibly slower for sure and it doesn't resolve all your concerns, but I am raising the idea just as a thought.

The registry has been designed since the start on the idea that images need to be lean and small to both allow for fast deployments as well as avoiding timeouts, saturation of various resources, etc. To the point that when we met a similar issue in WikiKube (images in there are up to 5GB, however layers are smaller. The limit is effectively on the layer size, not the image size), we spent a considerable amount of time investing into a peer-to-peer docker image distribution system. We did this out of the necessity and we do want to revisit the strategy for MediaWiki images once we are done with mw-on-k8s and we can support multiversion in the infrastructure, relieving MediaWiki from the need to do so.

Are the Dragonfly's supernodes sharable between clusters? We are interesting in adding Dragonfly (it is in our backlog since a long time), but I don't get from the docs if we should create a p2p network shared between clusters or not.

A quick solution to the layer size problem would be to split the way the image is created. Different RUN directives create different layers so in theory, if there are multiple pip install packages to be installed they can be split into many RUN directives. Normally, one doesn't want many RUN directives as the creation of extra layers causes inefficiencies, especially if there many small layers. It could be a solution here however as splitting a 10GB images into a couple of 2GB to 3GB (with maybe an minor configuration change in the registries) ones would allow you to avoid the registry hurdle.

However:

  • it wouldn't solve CI issues
  • if blubber is being used then this isn't an option as one of the design goals of blubber was to avoid exactly that (we did not anticipate a single layer reaching several GB of course).
  • It's admittedly an ugly hack. It could work for a while but it's not really sustainable long term either.
  • It's unpredictable and unclear when it breaks. We don't control the size of the images or the layers anyway, we can expect that we are one pytorch upgrade away from it breaking some carefully crafted image layer.

Agree with all points :( the use case highlighted above wouldn't work even with different RUNs :(

A more permanent solution would be to revisit the architectural decisions in the design of the registry. We 've been planning to revisit our decisions past July 2024 depending on how annual planning process ends up cause we need to see what other uses cases we 'll have to support for developers. @elukey what kind of a time line do you have in mind for seeing this addressed ?

We are currently in the process of buying hardware with GPUs for Lift Wing, in theory in the next couple of quarters it would be nice to run GPUs in production for some models etc.. We do have some GPUs on k8s now though, and we are currently doing tests to set up a good baseline (like implement batching on GPUs, LLMs, etc..) and without Pytorch on ROCm we are really blocked :(

Definitely it is a weird use case. Pytorch decided to ship their version for AMD GPUs with all the .so libs from ROCm that are needed for the most common use cases to run. This is, IIUC, fixed and every new release of Pytorch is bundled with a specific version of ROCm (see example). As mentioned in the task's description trying to share those binaries from the k8s worker node is, atm, close to impossible (modulo a ton of patches to upstream etc.., that wouldn't leave compatibility aside between worker node OS and container OS).

It wouldn't be a very good architecture anyways, hardly coupling what gets somehow deployed on the nodes to the code running on the containers. We 've grudgingly done hostPath in MediaWiki for GeoIP, but admittedly it was data, not code (like in this case). Once we got a proper service that can be utilized instead, we 'd happily move away from that approach.

This is an example of what gets deployed:

somebody@d80a6ff978a9:/$ du -hs /opt/lib/python/site-packages/torch/lib/* | sort -h
8.0K	/opt/lib/python/site-packages/torch/lib/libroctx64.so
20K	/opt/lib/python/site-packages/torch/lib/libtorch_global_deps.so
[..cut..]
596M	/opt/lib/python/site-packages/torch/lib/librocsparse.so

I am sorry, what? A .so that is 500MB ? (and apparently it gets worse with a .so that is 1.1G?). What are they even bundling in such shared object files? Are they stripped btw? If they are not, stripping them might reduce size.

I didn't find a way from Pytorch to deploy subsets of those libs, and honestly I am not 100% sure if upstream will ever be open to it. Newest versions trimmed ~2/3GBs from the libs, maybe removing some duplicates, need to investigate more (but the size it around 6/7GBs anyway).

Pytorch will probably be one of the tools (if not, the tool) that will power most of our newer models (like LLMs for example), so finding a solution to this problem is critical for ML :(

Unfortunately you will probably find down the road the registry failing you in other ways too.

While bumping tmpfs for /var/lib/nginx on the registry hosts to 10GB is quite possibly out of the question (I hope it is obvious why we aren't going to spend 40GB RAM and have to deal with the operational aftermath of such a decision) there is a bigger question as to what we want and can support here.

Definitely I agree 100%. I am wondering if we could experiment with a different nginx instance (pointing to the same registry backend) that uses disk partitions for /var/lib/nginx, for use cases like this one. It will be incredibly slower for sure and it doesn't resolve all your concerns, but I am raising the idea just as a thought.

This isn't very difficult to do and we could also experiment with doing it on one of the existing registries. But the fact it exists and we haven't bothered raising it on purpose is indicative of the overall situation with the current registry implementation. There's a number of issues in the backlog regarding the registry that you will stumble upon:

  • We don't do any kind of periodic cleaning in the backend, as it is essentially non functional. T242604, T335333, T354786, T242775
  • It's not redundant if caching REDIS fails. T215809
  • Redundancy is just across 2 AZ per DC
  • Throughput isn't particularly high
  • It's surprisingly easy to saturate all the backing infrastructure, T264209, https://wikitech.wikimedia.org/wiki/User:JMeybohm/Docker-Registry-Stresstest
  • The architecture has been in question for some time now: T209271
  • It has no clear organization of resources. Whatever users were added and whatever authn/authz exists is done as an after thought and via minor "hacks" (as in smart ideas) that are organically accumulating. We make due each time a new need shows up, but the lack of a "users" as a 1st class concept is evident each time.
  • The homepage is a home grown one off T179696
  • There's some cross-DC interesting sync issues which I can't find the task for.

Jokingly, in a video game this would be the first of the "gates" and each successive level is more difficult to handle.

Are the Dragonfly's supernodes sharable between clusters? We are interesting in adding Dragonfly (it is in our backlog since a long time), but I don't get from the docs if we should create a p2p network shared between clusters or not.

@JMeybohm Care to answer this one? ^. My impression is no, but I am not sure.

Agree with all points :( the use case highlighted above wouldn't work even with different RUNs :(

I am not surprised :-(

A more permanent solution would be to revisit the architectural decisions in the design of the registry. We 've been planning to revisit our decisions past July 2024 depending on how annual planning process ends up cause we need to see what other uses cases we 'll have to support for developers. @elukey what kind of a time line do you have in mind for seeing this addressed ?

We are currently in the process of buying hardware with GPUs for Lift Wing, in theory in the next couple of quarters it would be nice to run GPUs in production for some models etc.. We do have some GPUs on k8s now though, and we are currently doing tests to set up a good baseline (like implement batching on GPUs, LLMs, etc..) and without Pytorch on ROCm we are really blocked :(

It probably makes sense then that we collaborate in the next quarters to make the situation with the registry better overall. Even if we do solve your immediate problem, that is being unable to push, at those images sizes you 'll probably meet issues when trying to download images, leading to stalled deployments and we 'll meet problem due to the layer sizes themselves down the line.

Are the Dragonfly's supernodes sharable between clusters? We are interesting in adding Dragonfly (it is in our backlog since a long time), but I don't get from the docs if we should create a p2p network shared between clusters or not.

@JMeybohm Care to answer this one? ^. My impression is no, but I am not sure.

Yep, that's on the move already T359416: Add Dragonfly to the ML k8s clusters

@akosiaris thanks a lot for all the details, really appreciated, now I have a better understanding of the problem :)

I have a proposal to unblock my team, let me know what you think about it. On the ML side, we are doing the following:

  • Try to reduce the pytorch's size, understanding if we can drop something (for example, support less GPUs etc..). We are logging work in T359569, but I am not super confident that we'll be able to get a significant reduction without coming up with a very complicated and difficult-to-maintain custom build process (like custom Python wheel to store somewhere, long build times to recreate pytorch when needed in CI, etc).
  • Deploy Dragonfly to ml clusters, basically already done (see subtask).

After some reading I understood that the main problem is the wire size of the Docker image layers, not the one that docker history or similar commands may return locally. So I did a little test, using docker save to create a .tar file containing one of the "problematic" ML Docker images, and I got a ~10GB file. Then I gzipped it, and the file became ~2.1GB. Of course this test is not great since we loose the layer granularity, but I wanted to get a ballpark measure of the total compressed size (also in the aforementioned Docker image there is a layer of ~10G, that basically accounts for all the overall image size, so the approximation is not totally off). IIUC and if the above is correct, we are not able to push that image because the registry's tmpfs is 2GB and we want to upload a file/layer that *compressed* is a little more than 2GB. One thing that could work in the short to medium term might be:

  • We add +2GB of ram to all the registry VMs, plus we increase their /var/lib/nginx tmpfs size to 4G.
  • ML works on a base image (in production-images for example) that contains the ROCm Pytorch package deployed, that we'll use in Blubber for all our Pytorch use cases.

The latter together with Dragonfly should, in theory, reduce significantly the risk of hammering the Registry and it would allow ML to avoid duplication of pytorch deployments in various layers/images (not shared etc..). I am aware that the +2GB bump is only a band-aid, because you may object that ML could come back in 3 months time with another +2GB request, and this is totally true. My point is that there seems to be no easy solution, and this compromise could allow ML to keep testing their images on GPUs etc.., and in the meantime we could work together to come up with a long term strategy for the Registry. Let me know if the above makes some sense, I am not very proud of the proposed solution but I can't think about a better idea :(

@akosiaris thanks a lot for all the details, really appreciated, now I have a better understanding of the problem :)

I have a proposal to unblock my team, let me know what you think about it. On the ML side, we are doing the following:

  • Try to reduce the pytorch's size, understanding if we can drop something (for example, support less GPUs etc..). We are logging work in T359569, but I am not super confident that we'll be able to get a significant reduction without coming up with a very complicated and difficult-to-maintain custom build process (like custom Python wheel to store somewhere, long build times to recreate pytorch when needed in CI, etc).

Thanks for filing that task. It impressed me that somehow there is a set of stripped .so files that are hundreds of MB in size.

  • Deploy Dragonfly to ml clusters, basically already done (see subtask).

Cool.

After some reading I understood that the main problem is the wire size of the Docker image layers, not the one that docker history or similar commands may return locally. So I did a little test, using docker save to create a .tar file containing one of the "problematic" ML Docker images, and I got a ~10GB file. Then I gzipped it, and the file became ~2.1GB. Of course this test is not great since we loose the layer granularity, but I wanted to get a ballpark measure of the total compressed size (also in the aforementioned Docker image there is a layer of ~10G, that basically accounts for all the overall image size, so the approximation is not totally off).

Your understanding is mostly correct. Indeed what the registry sees is the compressed blobs, of which as you point out there is no easy/clear way to get the size of one (the registry does have it, but a local docker daemon doesn't)

There is a complication that you 'll probably avoid thanks to dragonfly+SSDs. This might manifest like this:

  1. Deployment starts. The timeout is set to the default of 600 seconds
  2. Nodes rush to fetch the images from the registry
  3. The registry nodes are bottlenecked and can only send traffic to the nodes at a low rate
  4. All nodes finally get the entire image, in varying times, but by then we are into the 500+ seconds area.
  5. Nodes are trying to decompress the layers, but this is a process that is single threaded and can't be rushed. It happens as fast as 1 CPU and the IO layer are able to sustain.
  6. The 600 seconds threshold is reached, deployment is marked as failed and rolled back
  7. You try again, it succeeds this time around. You start scratching your head, but shake it off
  8. The next deployer with a deployment that necessitates a new fetch of that layer, meets the exact same issue and opens a task.

Again, chances are that dragonfly+SSDs will make things fast enough to avoid the above pitfall.

IIUC and if the above is correct, we are not able to push that image because the registry's tmpfs is 2GB and we want to upload a file/layer that *compressed* is a little more than 2GB. One thing that could work in the short to medium term might be:

  • We add +2GB of ram to all the registry VMs, plus we increase their /var/lib/nginx tmpfs size to 4G.
  • ML works on a base image (in production-images for example) that contains the ROCm Pytorch package deployed, that we'll use in Blubber for all our Pytorch use cases.

That would work.

The latter together with Dragonfly should, in theory, reduce significantly the risk of hammering the Registry and it would allow ML to avoid duplication of pytorch deployments in various layers/images (not shared etc..). I am aware that the +2GB bump is only a band-aid, because you may object that ML could come back in 3 months time with another +2GB request, and this is totally true.

I promise I will object to another increase, but this should unblock you for now.

My point is that there seems to be no easy solution, and this compromise could allow ML to keep testing their images on GPUs etc.., and in the meantime we could work together to come up with a long term strategy for the Registry.

There is no easy solution for sure. Overall the registry requires some more work from all of us as it is becoming way important to our workloads and we should find the time to work on figuring out what we can do about it.

Let me know if the above makes some sense, I am not very proud of the proposed solution but I can't think about a better idea :(

It's good enough to unblock the team, we 'll have to make due for now.

All subtasks completed, wrapping up the task, thanks to all for feedback/help/support! <3