Page MenuHomePhabricator

Python torch fills disk of CI Jenkins instances
Open, Needs TriagePublic

Description

The Python torch project is a dependency of some of the PipelineLib job. Its disk usage is large enough it ends up filing a WMCS CI instance causing it be put offline:

13:45:31 <wmf-insecte> maintenance-disconnect-full-disks build 497950 integration-agent-docker-1036 (/: 29%, /srv: 10%, /var/lib/docker: 95%): still OFFLINE due to disk space

Checking on integration-agent-docker-1036:

$ sudo docker system df
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
...
Build Cache     85        0         20.92GB   20.92GB

With -v and grepping for GB I found a large one responsible for most of the disk usage:

yzaiei2cl172   regular        13.9GB    3 hours ago         3 hours ago         1         false

Event Timeline

Looking at the diff overlay which is at /var/lib/docker/overlay2/yzaiei2cl172qsj37gazfomm2/diff gives fun:

1466 ./diff/home/somebody/.cache/pip/http/2/2/3/0/c

Which is some dependency fetched by pip. The start of the file has the PK signature and functorch/_C.cpython-39-x86_64-linux-gnu.so. Not sure why it is 1466MBytes though.

And expanded in the application installation (in MBytes):

...
218	opt/lib/python/site-packages/triton/_C
254	opt/lib/python/site-packages/triton/third_party
472	opt/lib/python/site-packages/triton
8465	opt/lib/python/site-packages/torch/lib
8563	opt/lib/python/site-packages/torch

Deeper inspection of that torch/lib:

-rwxr-xr-x  1 65533 65533 1.1G Jun  7 08:56 librocsolver.so
-rwxr-xr-x  1 65533 65533 764M Jun  7 08:56 librocfft-device-3.so
-rwxr-xr-x  1 65533 65533 721M Jun  7 08:56 librocfft-device-2.so
-rwxr-xr-x  1 65533 65533 715M Jun  7 08:56 libtorch_hip.so
-rwxr-xr-x  1 65533 65533 696M Jun  7 08:56 librocfft-device-1.so
-rwxr-xr-x  1 65533 65533 693M Jun  7 08:56 librocfft-device-0.so
-rwxr-xr-x  1 65533 65533 596M Jun  7 08:56 librocsparse.so
-rwxr-xr-x  1 65533 65533 497M Jun  7 08:56 libtorch_cpu.so
-rwxr-xr-x  1 65533 65533 390M Jun  7 08:56 libMIOpen.so
-rwxr-xr-x  1 65533 65533 284M Jun  7 08:56 librocblas.so
-rwxr-xr-x  1 65533 65533 241M Jun  7 08:56 libmagma.so
-rwxr-xr-x  1 65533 65533 159M Jun  7 08:56 librccl.so
-rwxr-xr-x  1 65533 65533 127M Jun  7 08:56 libamd_comgr.so

+ the rocblas/library which is 1.4GBytes...

That looks like something related to machine learning.

The Build Cache is for Buildkit which is "hidden" from regular docker but actable on via docker buildx.

From docker buildx du --verbose:

ID:             yzaiei2cl172qsj37gazfomm2
Created at:     2023-06-07 08:53:08.069562681 +0000 UTC
Mutable:        true
Reclaimable:    true
Shared:         false
Size:           13.89GB
Description:    mount / from exec /bin/sh -c python3 "-m" "pip" "wheel" "-r" "bloom/model-server/requirements.txt" && python3 "-m" "pip" "install" "--target" "/opt/lib/python/site-packages" "-r" "bloom/model-server/requirements.txt"
Usage count:    1
Last used:      4 hours ago
Type:           regular

So the huge torch comes from some bloom/model-server/requirements.txt which comes from https://gerrit.wikimedia.org/g/machinelearning/liftwing/inference-services and:

bloom/model-server/requirements.txt
accelerate==0.19.0
einops==0.6.1
kserve==0.10.0
--extra-index-url https://download.pytorch.org/whl/rocm5.4.2
torch==2.0.1
transformers==4.28.1

Which probably leads to https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/927733 (which adds the --extra-index-url.

I guess we might need to have Release Pipeline (Blubber) to clear out the pip cache and a solution to be able to hold such a large build layer (it is larger than a fully installed mediawiki integration test job).

Mentioned in SAL (#wikimedia-releng) [2023-06-07T12:43:52Z] <hashar> integration-agent-docker-1036: docker buildx prune to reclaim 21G of disk space # T338317

Some info related to the above change: We have switched to this specific pytorch build (the one defined via the extra-index-url) in order to add GPU support to some models.

hashar claimed this task.
hashar added a subscriber: elukey.

I do not know how large the layer was before that change installing pytorch from https://download.pytorch.org/whl/rocm5.4.2 . Regardless the layer is 13GB while the WMCS instance partition for Docker is 24G.

We have another 36G partition for the Jenkins jobs workspace which is where the repositories are cloned and result of installation written to. It is that size in order to old 3 concurrent build of a full MediaWiki + extensions installation.

I merely filed it to take notes on my finding and inform the Machine Learning team about it. If there are way to somehow shrink the torch install that would be great (and result in a small image). The the solution is properly to throw more disk space at the problem, since we have over cases leading to the disk filing and they would be addressed this way as well.

There is no disk space available in LVM so we need to build larger instances (which I want to do eventually for other reasons).

I am marking this resolved since I chatted a bit with @elukey about it and @isarantopoulos has read the task which was my original intention :]

Reopening since the disks keep filing and I also reopened the task to resize the instances (T340070).

Not sure when I can resize the instances, regardless we can dig a bit and find a way to reduce the layer size.

On a filed instance I go with docker buildx du|head which gives the largest layers (they are ordered by size). That gives the checksum of the layer:

ID						RECLAIMABLE	SIZE		LAST ACCESSED
s4gy9daxeees4v9df6t8yosz1*              	true 		14.09GB   	6 hours ago

Then I find the diff overlay for that id with find /var/lib/docker -maxdepth 3 | grep s4gy9daxeees4v9df6t8yosz1 eg /var/lib/docker/overlay2/s4gy9daxeees4v9df6t8yosz1/diff.

After some du:

pip has a 1.5GB http cached file: ./home/somebody/.cache/pip/http/2/2/3/0/c/2230c39dc3629ed3e84c4f13f9b51c5b78ae7710dc9485b3a39ea266 . Thus I suspect we could ask Blubber to delete the cache after installation.

9864 ./opt/lib/python/site-packages

Most of that from:

472 ./opt/lib/python/site-packages/triton
1433 ./opt/lib/python/site-packages/torch/lib/rocblas
8465 ./opt/lib/python/site-packages/torch/lib

Then I guess you need everything from torch / rocblas? :-\

Mentioned in SAL (#wikimedia-releng) [2023-06-27T20:38:53Z] <hashar> integration-agent-docker-1035: docker buildx prune T338317

The intended use of this image is to be used with GPUs, so torch/lib/rocblas is def needed for operations on AMD GPU. The rocm pytorch (torch) package itself is 1.4 GB, that is why we start with such a big image.
Also +1 from me for deleting the cache. I'll look into it.

There isn't any cache in the image since only the specific files are copied in the production variant. By inspecting du inside the image I see this

root@e87fcadcaac0:/opt/lib/python/site-packages/torch/lib# du -sh -- * | sort -h -r
1.4G	rocblas
1.1G	librocsolver.so
764M	librocfft-device-3.so
721M	librocfft-device-2.so
715M	libtorch_hip.so
696M	librocfft-device-1.so
693M	librocfft-device-0.so
596M	librocsparse.so
497M	libtorch_cpu.so

with the torch directory taking 8.4 GB of space. I don't see us being able to remove something from the image at the moment. In the future we could think of alternative ways if we use a stable torch + rocm version (at least for some while)

@hashar is there a clean up command that we (ML SREs) can run to help the clean up while we get bigger partitions? (to avoid pinging you every time, I know it is annoying, sorry :( )

I more or less fire fought some of them by heading to the instance and issuing sudo docker buildx prune --force.

Yesterday I cleaned them all using the local cumin which should be:

ssh integration-cumin.integration.eqiad1.wikimedia.cloud sudo cumin --force 'name:docker' 'docker buildx prune --force'

There isn't any cache in the image since only the specific files are copied in the production variant. By inspecting du inside the image I see this

root@e87fcadcaac0:/opt/lib/python/site-packages/torch/lib# du -sh -- * | sort -h -r
1.4G	rocblas
1.1G	librocsolver.so
764M	librocfft-device-3.so
721M	librocfft-device-2.so
715M	libtorch_hip.so
696M	librocfft-device-1.so
693M	librocfft-device-0.so
596M	librocsparse.so
497M	libtorch_cpu.so

with the torch directory taking 8.4 GB of space. I don't see us being able to remove something from the image at the moment. In the future we could think of alternative ways if we use a stable torch + rocm version (at least for some while)

Thanks for the investigation, at least it confirms we can really shrink that huge layer :] I am wondering how the 1.4G downloaded pytorch ends up occupying so much disk space, I am guessing those so are easily compressible.

I reopened it merely to investigate whether maybe pytorch layer could be shrunk somehow but that does not seem to be easily possible (if at all). Thus I am marking this task resolved again, the real resolution would be to grow the instance disk space which is T340070: Rebuild WMCS integration instances to larger flavor.

As an update, I have rebuild all the Jenkins agent instances last week. The disk space allocated to Docker went from 24G to 45G and we haven't got any alarm since then. So I guess that is solved for now :-]

And that broke multiple CI Jenkins agents again:

hashar@integration-agent-docker-1047:~$ sudo docker buildx du
ID						RECLAIMABLE	SIZE		LAST ACCESSED
yjbqaqxsadir0u8xlycebw4f1*              	true 		14.9GB    	About an hour ago
qnzfpjgzzierjqd48iqwrepqa*              	true 		13.91GB   	3 hours ago
yapuknx0q5aq06vn1k8bepbp2               	true 		10.11GB   	3 hours ago
s1g5p8yj0ombh9w9uail4ncsb*              	true 		5.304GB   	About an hour ago
xjlgpqvh2skmjjizml1r6ch7w               	true 		364.3MB   	About an hour ago

Those builds are too large, we would need a better solution. Maybe move it to a dedicated instance or find a way for Pipeline lib to clear the build cache upon completion.

Mentioned in SAL (#wikimedia-releng) [2024-03-12T15:00:18Z] <hashar> integration: clearing Docker build cache on all Jenkins agents due to T338317 | sudo cumin --force 'name:docker' 'docker buildx prune --force'

From what I remember about PipelineLib, the idea was to keep some kind of layers caches to speed up future builds. Although the cache is different between instance, that gives a bit of a speed up boost when a build occurs on an agent that already build that specific image/repo. I guess it is fine, until pytorch came in play with a large layer which is sometime duplicated cause a parent layer is slightly different.

On top of the build cache cleanup, we might be able to move the jobs for that repository to a dedicated Jenkins agent, possibly with a VM with a bit more disk.