Page MenuHomePhabricator

Use rocm/vllm image on Lift Wing
Open, Needs TriagePublic

Description

As an engineer,
I'd like to use the optimized rocm/vllm image on Lift Wing, so that I can take utilize the required software without having to build and maintain everything on our own.

After some work done in T370149: [LLM] Use vllm for ROCm in huggingface image , it seems that building and maintaining our own set of packages offer significant challenges and workload for us while at the same time doesn't seem like an ideal choice to use in production as the frequent changes in this ecosystem of packages makes us prone to errors and incompatibility among their dependencies.

The latest image (rocm/vllm:v710inference_rocm6.3-release_ubuntu22.04_py3.10_pytorch_release-2.6) is a great improvement in size as it is only 7.7GB compressed while the previous ones were >20GB. Here is it important to note that our base pytorch images are ~18GB. If the resulting image size doesn't go above 10GB it would offer a significant reduction in image sizes.

The image has the following software versions: (full pip freeze result)

softwareversion
ROCm6.3
torch2.6.0 (built from source for ROCm)
vllm0.6.5 (built from source for ROCm)
triton3.0.0 (built from source for ROCm)

The dockerfiles for this image are defined in the ROCm/vllm repo . rocm-vllm dockerfile, rocm/vllm-dev. The rocm/vllm image uses rocm/vllm-dev as its base image.

Related Objects

Event Timeline

The image rocm/vllm:v710inference_rocm6.3-release_ubuntu22.04_py3.10_pytorch_release-2.6 I mentioned above has disappeared from dockerhub and the link results in a 404 error.

There is a new image avaialable: rocm/vllm:rocm6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6. It is 7.73GB compressed on dockerhub. It is based on Ubuntu 22:03 and and uses python 3.12. Although the tag name mentions mi300, all build instructions use PYTORCH_ROCM_ARCH=gfx90a;gfx942 indicating that both Mi210 & Mi300 are supported (gfx90a is the architecture for our current Mi210 ).
The image has vllm & flash attention installed as shown in the requirements below

1pip freeze
2DEPRECATION: Loading egg at /usr/local/lib/python3.12/dist-packages/HipMarker-1.0-py3.12-linux-x86_64.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330
3DEPRECATION: Loading egg at /usr/local/lib/python3.12/dist-packages/rpdTracer-1.0-py3.12.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330
4accelerate==1.3.0
5aiohappyeyeballs==2.4.4
6aiohttp==3.11.11
7aiosignal==1.3.2
8airportsdata==20241001
9amdsmi @ file:///install/amdsmi-24.7.1%2B0012a68-py3-none-any.whl#sha256=ff16869c4a9b1e21ef3ba487f185779562133391e2acc43e523b2df2a8c45e47
10annotated-types==0.7.0
11anyio==4.8.0
12astor==0.8.1
13attrs==24.3.0
14awscli==1.37.5
15blake3==1.0.2
16blinker==1.4
17boto3==1.36.5
18botocore==1.36.5
19certifi==2024.12.14
20charset-normalizer==3.4.1
21click==8.1.8
22cloudpickle==3.1.1
23cmake==3.31.4
24colorama==0.4.6
25compressed-tensors==0.8.1
26cryptography==3.4.8
27Cython==3.0.11
28datasets==3.2.0
29dbus-python==1.2.18
30depyf==0.18.0
31dill==0.3.8
32diskcache==5.6.3
33distro==1.7.0
34distro-info==1.1+ubuntu0.2
35docutils==0.16
36einops==0.8.0
37fastapi==0.115.7
38filelock==3.17.0
39flash_attn @ file:///install/flash_attn-2.7.2-cp312-cp312-linux_x86_64.whl#sha256=a89d56e6d554adf7a537d47a7ef38787ebe00fe5462f0adcd2abd4393cadffbd
40frozenlist==1.5.0
41fsspec==2024.9.0
42gguf==0.10.0
43h11==0.14.0
44HipMarker==1.0
45hiredis==3.1.0
46httpcore==1.0.7
47httplib2==0.20.2
48httptools==0.6.4
49httpx==0.28.1
50huggingface-hub==0.27.1
51idna==3.10
52importlib_metadata==8.6.1
53iniconfig==2.0.0
54inquirerpy==0.3.4
55interegular==0.3.3
56jeepney==0.7.1
57Jinja2==3.1.5
58jiter==0.8.2
59jmespath==1.0.1
60jsonschema==4.23.0
61jsonschema-specifications==2024.10.1
62keyring==23.5.0
63lark==1.2.2
64launchpadlib==1.10.16
65lazr.restfulclient==0.14.4
66lazr.uri==1.0.6
67libnacl==2.1.0
68lm-format-enforcer==0.10.9
69MarkupSafe==3.0.2
70mistral_common==1.5.2
71more-itertools==8.10.0
72mpmath==1.3.0
73msgpack==1.1.0
74msgspec==0.19.0
75multidict==6.1.0
76multiprocess==0.70.16
77nest-asyncio==1.6.0
78networkx==3.4.2
79ninja==1.11.1.3
80numpy==1.26.4
81oauthlib==3.2.0
82openai==1.60.0
83opencv-python-headless==4.11.0.86
84outlines==0.1.11
85outlines_core==0.1.26
86packaging==24.2
87pandas==2.2.3
88partial-json-parser==0.2.1.1.post5
89peft==0.14.0
90pfzy==0.3.4
91pillow==10.4.0
92pluggy==1.5.0
93prettytable==3.12.0
94prometheus-fastapi-instrumentator==7.0.2
95prometheus_client==0.21.1
96prompt_toolkit==3.0.50
97propcache==0.2.1
98protobuf==5.29.3
99psutil==6.1.1
100py-cpuinfo==9.0.0
101pyarrow==19.0.0
102pyasn1==0.6.1
103pybind11==2.13.6
104pycountry==24.6.1
105pydantic==2.10.5
106pydantic_core==2.27.2
107PyGObject==3.42.1
108PyJWT==2.3.0
109pyparsing==2.4.7
110pytest==8.3.4
111pytest-asyncio==0.25.2
112python-apt==2.4.0+ubuntu4
113python-dateutil==2.9.0.post0
114python-dotenv==1.0.1
115pytz==2024.2
116PyYAML @ file:///install/PyYAML-6.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl#sha256=80bab7bfc629882493af4aa31a4cfa43a4c57c83813253626916b8c7ada83476
117pyzmq==26.2.0
118ray==2.41.0
119redis==5.2.1
120referencing==0.36.1
121regex==2024.11.6
122requests==2.32.3
123rocpd @ file:///app/rocmProfileData/rocpd_python
124rpds-py==0.22.3
125rpdTracer==1.0
126rsa==4.7.2
127s3transfer==0.11.2
128safetensors==0.5.2
129SecretStorage==3.3.1
130sentencepiece==0.2.0
131setuptools==75.8.0
132setuptools-scm==8.1.0
133six==1.17.0
134sniffio==1.3.1
135starlette==0.45.2
136sympy==1.13.1
137tensorizer==2.9.1
138tiktoken==0.7.0
139tokenizers==0.21.0
140torch @ file:///install/torch-2.7.0a0%2Bgit3a58512-cp312-cp312-linux_x86_64.whl#sha256=5cb93ea595414542f86d4f2e70fb9c26fbac8b46db0f356cfbbd2255ba2a8e4c
141torchvision @ file:///install/torchvision-0.19.1a0%2B6194369-cp312-cp312-linux_x86_64.whl#sha256=3e3aecaca2933ed8ff817ba2ef872c22aa3f4491124bc07f64ef7101de6a909e
142tqdm==4.67.1
143transformers==4.48.1
144triton @ file:///install/triton-3.2.0%2Bgite5be006a-cp312-cp312-linux_x86_64.whl#sha256=ba318f1621450495b74353cf0c544ae5ac9488624f6304823a2a908045029088
145typing_extensions==4.12.2
146tzdata==2025.1
147unattended-upgrades==0.1
148urllib3==2.3.0
149uvicorn==0.34.0
150uvloop==0.21.0
151vllm @ file:///install/vllm-0.6.7.dev220%2Bg84f5d47b.rocm631-cp312-cp312-linux_x86_64.whl#sha256=8da3ad2847178bcfba58ad12e7a2fcf4ca2dcfc522c9cf291a7dfec2a439a204
152# Editable install with no version control (vllm_test_utils==0.1)
153-e /app/vllm/tests/vllm_test_utils
154wadllib==1.3.6
155watchfiles==1.0.4
156wcwidth==0.2.13
157websockets==14.2
158wheel==0.45.1
159xgrammar==0.1.10
160xxhash==3.5.0
161yarl==1.18.3
162zipp==3.21.0

Looking at the compressed layer sizes on Dockerhub the largest layer is 5.5GB which would not allow us to push it to the registry with the current limitations.

At the moment I can think of 4 options to unblock this work:

  1. Use upstream provided docker image as is
  2. Build our own image based on debian bookworm. Building our own modified image based on the original instructions poses a challenge as it requires significant resources. I had some failures running out of memory (using 40GB) trying to build the vllm-dev image (rocm_base dockerfile).
  3. Use a pytorch base image as we do now and use a different CI process to build the additional wheels for our GPUs. The wheels we need are: flash-attention, vllm, triton.
  4. Build upstream docker images which is based on ubuntu on one of our hosts since the build process requires a lot of memory (e.g. ml-lab) and push them to the registry. (base dockerfile, vllm dockerfile)

We want to proceed with Option 1 as Options 2 and 3 will require significant effort to build and update everytime a new release of any of the underlying components is needed (ROCm or any of the python packages).
In all the options we will face the same limitation related to the docker registry compressed layer size limit (4GB). A recent update in the base torch image failed T385531#10521237 and the patch for it was reverted.

@akosiaris

  • What is your opinion on us hosting the upstream docker image?
  • Is there any way we can overcome the registry layer size limitation for this specific image? I see two potential ways that this could be used:
    1. use the upstream image as a base image and built our image based on that. This will allow us to install kserve and any additional thing that we want but means building more images that are big.
    2. use the upstream image as is. This will allow us to maintain only one image variant/tag which seems more friendly to the docker registry. The downside is that it doesn't allow us to customize the code which means that we'll need to see what we need to do on our charts to serve models using plain vllm (and not through kserve)

cc: @elukey

@isarantopoulos thanks for the summary, I have to say that using the upstream docker image directly from github (even if importing it to our registry) is not a great option, we wouldn't really be able to vet what's running on it and it could represent a big security issue. I totally get the pain that you are going through now, but something like option 2/3 is surely better from the SRE point of view. I glanced the paste with the failures and it seems related to clang being killed while compiling, so maybe we could try to find a place with a lot of RAM where to build it instead? For example, if it builds on ml-lab1001 we could think about allowing it to push images to the registry (we only allow gitlab and build2001 atm, but this is a completely new use case). If you want we can have a quick meeting when I am back from the summit and explore new options, lemme know.

For the registry limitation - the limit is on nginx and it is a generic one, we cannot bypass it for specific images in my opinion. There is always the option to raise it, but it is a never ending game and it could put some strain in the registry's capabilities (affecting other workloads like mediawiki etc..). I'll have a chat with Alex about it!

@elukey Thank you for the response!

I understand your concern about Option1 and hosting the prebuilt image directly from dockerhub.
I think that Options 2 and 3 will require significant maintenance cost that our team will not be able to keep up with.
Based on your suggestion to build the image from one of our hosts (ml-lab for example) I added a 4th option to the above list which would be to build the image ourselves based on the dockerfiles provided by AMD.

I don't know what we can do with the registry limitation which seems to be a bottleneck in all of the solutions. Let's have a meeting when you are back and figure out a solution together!

I've managed to build the base vllm image on ml-lab and it is 34GB.
Compressing the image brings it down to 7.0GB.

I'm proceeding to also build the final image which is based on the vllm-base one and will report back here.

I've replicated building locally on ml-lab both images using the Dockerfiles. The final image is 35.7GB and 7.6GB compressed (as expected cause this is also the upstream image we are discussing about).

Below we can see the uncompressed layer sizes for both images. going to get you the full created_by column as well but I have to modify the output a bit to paste it here because --no-trunc argument produces a mess

docker history vllm:rocm-base
IMAGE          CREATED        CREATED BY                                      SIZE      COMMENT
1f3d11408597   19 hours ago   RUN |13 BASE_IMAGE=rocm/dev-ubuntu-22.04:6.3…   504B      buildkit.dockerfile.v0
<missing>      19 hours ago   ARG FA_REPO                                     0B        buildkit.dockerfile.v0
<missing>      19 hours ago   ARG FA_BRANCH                                   0B        buildkit.dockerfile.v0
<missing>      19 hours ago   ARG PYTORCH_VISION_REPO                         0B        buildkit.dockerfile.v0
<missing>      19 hours ago   ARG PYTORCH_REPO                                0B        buildkit.dockerfile.v0
<missing>      19 hours ago   ARG PYTORCH_VISION_BRANCH                       0B        buildkit.dockerfile.v0
<missing>      19 hours ago   ARG PYTORCH_BRANCH                              0B        buildkit.dockerfile.v0
<missing>      19 hours ago   ARG TRITON_REPO                                 0B        buildkit.dockerfile.v0
<missing>      19 hours ago   ARG TRITON_BRANCH                               0B        buildkit.dockerfile.v0
<missing>      19 hours ago   ARG RCCL_REPO                                   0B        buildkit.dockerfile.v0
<missing>      19 hours ago   ARG RCCL_BRANCH                                 0B        buildkit.dockerfile.v0
<missing>      19 hours ago   ARG LEGACY_HIPBLASLT_OPTION                     0B        buildkit.dockerfile.v0
<missing>      19 hours ago   ARG HIPBLASLT_BRANCH                            0B        buildkit.dockerfile.v0
<missing>      19 hours ago   ARG BASE_IMAGE                                  0B        buildkit.dockerfile.v0
<missing>      19 hours ago   RUN /bin/sh -c pip install /install/*.whl # …   1.53GB    buildkit.dockerfile.v0
<missing>      19 hours ago   RUN /bin/sh -c pip install /install/*.whl # …   5.25MB    buildkit.dockerfile.v0                                                                                                                                                                                                        [0/1371]
<missing>      19 hours ago   RUN /bin/sh -c pip install /install/*.whl # …   660MB     buildkit.dockerfile.v0
<missing>      19 hours ago   RUN /bin/sh -c dpkg -i /install/*deb     && …   70.7MB    buildkit.dockerfile.v0
<missing>      19 hours ago   RUN /bin/sh -c dpkg -i /install/*deb     && …   1.68GB    buildkit.dockerfile.v0
<missing>      21 hours ago   ENV HTTPS_PROXY=http://webproxy:8080            0B        buildkit.dockerfile.v0
<missing>      21 hours ago   ENV HTTP_PROXY=http://webproxy:8080             0B        buildkit.dockerfile.v0
<missing>      21 hours ago   ENV https_proxy=http://webproxy:8080            0B        buildkit.dockerfile.v0
<missing>      21 hours ago   ENV http_proxy=http://webproxy:8080             0B        buildkit.dockerfile.v0
<missing>      21 hours ago   RUN |2 PYTORCH_ROCM_ARCH=gfx90a;gfx942 PYTHO…   129MB     buildkit.dockerfile.v0
<missing>      21 hours ago   RUN |2 PYTORCH_ROCM_ARCH=gfx90a;gfx942 PYTHO…   258MB     buildkit.dockerfile.v0
<missing>      21 hours ago   ENV DEBIAN_FRONTEND=noninteractive              0B        buildkit.dockerfile.v0
<missing>      21 hours ago   WORKDIR /app                                    0B        buildkit.dockerfile.v0
<missing>      21 hours ago   RUN |2 PYTORCH_ROCM_ARCH=gfx90a;gfx942 PYTHO…   0B        buildkit.dockerfile.v0
<missing>      21 hours ago   ARG PYTHON_VERSION=3.12                         0B        buildkit.dockerfile.v0
<missing>      21 hours ago   ENV PYTORCH_ROCM_ARCH=gfx90a;gfx942             0B        buildkit.dockerfile.v0
<missing>      21 hours ago   ARG PYTORCH_ROCM_ARCH=gfx90a;gfx942             0B        buildkit.dockerfile.v0
<missing>      21 hours ago   ENV LD_LIBRARY_PATH=/opt/rocm/lib:/usr/local…   0B        buildkit.dockerfile.v0
<missing>      21 hours ago   ENV ROCM_PATH=/opt/rocm                         0B        buildkit.dockerfile.v0
<missing>      21 hours ago   ENV PATH=/opt/rocm/llvm/bin:/usr/local/sbin:…   0B        buildkit.dockerfile.v0
<missing>      21 hours ago   ENV HTTPS_PROXY=http://webproxy:8080            0B        buildkit.dockerfile.v0
<missing>      21 hours ago   ENV HTTP_PROXY=http://webproxy:8080             0B        buildkit.dockerfile.v0
<missing>      21 hours ago   ENV https_proxy=http://webproxy:8080            0B        buildkit.dockerfile.v0
<missing>      21 hours ago   ENV http_proxy=http://webproxy:8080             0B        buildkit.dockerfile.v0
<missing>      2 months ago   RUN |3 ROCM_VERSION=6.3.1 AMDGPU_VERSION=6.3…   1.67kB    buildkit.dockerfile.v0
<missing>      2 months ago   RUN |3 ROCM_VERSION=6.3.1 AMDGPU_VERSION=6.3…   29.6GB    buildkit.dockerfile.v0
<missing>      2 months ago   RUN |3 ROCM_VERSION=6.3.1 AMDGPU_VERSION=6.3…   60B       buildkit.dockerfile.v0
<missing>      2 months ago   ARG APT_PREF                                    0B        buildkit.dockerfile.v0
<missing>      2 months ago   ARG AMDGPU_VERSION=5.3                          0B        buildkit.dockerfile.v0
<missing>      2 months ago   ARG ROCM_VERSION=5.3                            0B        buildkit.dockerfile.v0
<missing>      2 months ago   LABEL maintainer=dl.mlsedevops@amd.com          0B        buildkit.dockerfile.v0
<missing>      5 months ago   /bin/sh -c #(nop)  CMD ["/bin/bash"]            0B
<missing>      5 months ago   /bin/sh -c #(nop) ADD file:ebe009f86035c175b…   77.9MB
<missing>      5 months ago   /bin/sh -c #(nop)  LABEL org.opencontainers.…   0B
<missing>      5 months ago   /bin/sh -c #(nop)  LABEL org.opencontainers.…   0B
<missing>      5 months ago   /bin/sh -c #(nop)  ARG LAUNCHPAD_BUILD_ARCH     0B
<missing>      5 months ago   /bin/sh -c #(nop)  ARG RELE
docker history vllm:rocm
IMAGE          CREATED        CREATED BY                                      SIZE      COMMENT
89065af75ff7   2 hours ago    CMD ["/bin/bash"]                               0B        buildkit.dockerfile.v0
<missing>      2 hours ago    ENV HIP_FORCE_DEV_KERNARG=1                     0B        buildkit.dockerfile.v0
<missing>      2 hours ago    ENV TOKENIZERS_PARALLELISM=false                0B        buildkit.dockerfile.v0
<missing>      2 hours ago    ENV RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVI…   0B        buildkit.dockerfile.v0
<missing>      2 hours ago    COPY /examples /app/vllm/examples # buildkit    371kB     buildkit.dockerfile.v0
<missing>      2 hours ago    COPY /benchmarks /app/vllm/benchmarks # buil…   394kB     buildkit.dockerfile.v0
<missing>      2 hours ago    ARG COMMON_WORKDIR                              0B        buildkit.dockerfile.v0
<missing>      2 hours ago    RUN |1 BUILD_RPD=1 /bin/sh -c cd /install   …   1.32GB    buildkit.dockerfile.v0
<missing>      2 hours ago    RUN |1 BUILD_RPD=1 /bin/sh -c if [ ${BUILD_R…   17.5MB    buildkit.dockerfile.v0
<missing>      2 hours ago    ARG BUILD_RPD                                   0B        buildkit.dockerfile.v0
<missing>      2 hours ago    RUN /bin/sh -c python3 -m pip install --upgr…   13.3MB    buildkit.dockerfile.v0
<missing>      2 hours ago    RUN /bin/sh -c case "$(which python3)" in   …   0B        buildkit.dockerfile.v0
<missing>      2 hours ago    RUN /bin/sh -c python3 -m pip install --upgr…   1.78kB    buildkit.dockerfile.v0
<missing>      2 hours ago    WORKDIR /app                                    0B        buildkit.dockerfile.v0
<missing>      2 hours ago    ARG COMMON_WORKDIR                              0B        buildkit.dockerfile.v0
<missing>      2 hours ago    RUN |1 ARG_PYTORCH_ROCM_ARCH= /bin/sh -c apt…   0B        buildkit.dockerfile.v0
<missing>      2 hours ago    RUN |1 ARG_PYTORCH_ROCM_ARCH= /bin/sh -c pyt…   346kB     buildkit.dockerfile.v0
<missing>      2 hours ago    RUN |1 ARG_PYTORCH_ROCM_ARCH= /bin/sh -c apt…   362MB     buildkit.dockerfile.v0
<missing>      2 hours ago    ENV PYTORCH_ROCM_ARCH=gfx90a;gfx942             0B        buildkit.dockerfile.v0
<missing>      2 hours ago    ARG ARG_PYTORCH_ROCM_ARCH                       0B        buildkit.dockerfile.v0
<missing>      19 hours ago   RUN |13 BASE_IMAGE=rocm/dev-ubuntu-22.04:6.3…   504B      buildkit.dockerfile.v0
<missing>      19 hours ago   ARG FA_REPO                                     0B        buildkit.dockerfile.v0
<missing>      19 hours ago   ARG FA_BRANCH                                   0B        buildkit.dockerfile.v0
<missing>      19 hours ago   ARG PYTORCH_VISION_REPO                         0B        buildkit.dockerfile.v0
<missing>      19 hours ago   ARG PYTORCH_REPO                                0B        buildkit.dockerfile.v0
<missing>      19 hours ago   ARG PYTORCH_VISION_BRANCH                       0B        buildkit.dockerfile.v0
<missing>      19 hours ago   ARG PYTORCH_BRANCH                              0B        buildkit.dockerfile.v0
<missing>      19 hours ago   ARG TRITON_REPO                                 0B        buildkit.dockerfile.v0
<missing>      19 hours ago   ARG TRITON_BRANCH                               0B        buildkit.dockerfile.v0
<missing>      19 hours ago   ARG RCCL_REPO                                   0B        buildkit.dockerfile.v0
<missing>      19 hours ago   ARG RCCL_BRANCH                                 0B        buildkit.dockerfile.v0
<missing>      19 hours ago   ARG LEGACY_HIPBLASLT_OPTION                     0B        buildkit.dockerfile.v0
<missing>      19 hours ago   ARG HIPBLASLT_BRANCH                            0B        buildkit.dockerfile.v0
<missing>      19 hours ago   ARG BASE_IMAGE                                  0B        buildkit.dockerfile.v0
<missing>      19 hours ago   RUN /bin/sh -c pip install /install/*.whl # …   1.53GB    buildkit.dockerfile.v0
<missing>      19 hours ago   RUN /bin/sh -c pip install /install/*.whl # …   5.25MB    buildkit.dockerfile.v0
<missing>      19 hours ago   RUN /bin/sh -c pip install /install/*.whl # …   660MB     buildkit.dockerfile.v0
<missing>      19 hours ago   RUN /bin/sh -c dpkg -i /install/*deb     && …   70.7MB    buildkit.dockerfile.v0
<missing>      19 hours ago   RUN /bin/sh -c dpkg -i /install/*deb     && …   1.68GB    buildkit.dockerfile.v0
<missing>      21 hours ago   ENV HTTPS_PROXY=http://webproxy:8080            0B        buildkit.dockerfile.v0
<missing>      21 hours ago   ENV HTTP_PROXY=http://webproxy:8080             0B        buildkit.dockerfile.v0
<missing>      21 hours ago   ENV https_proxy=http://webproxy:8080            0B        buildkit.dockerfile.v0
<missing>      21 hours ago   ENV http_proxy=http://webproxy:8080             0B        buildkit.dockerfile.v0
<missing>      21 hours ago   RUN |2 PYTORCH_ROCM_ARCH=gfx90a;gfx942 PYTHO…   129MB     buildkit.dockerfile.v0
<missing>      21 hours ago   RUN |2 PYTORCH_ROCM_ARCH=gfx90a;gfx942 PYTHO…   258MB     buildkit.dockerfile.v0
<missing>      21 hours ago   ENV DEBIAN_FRONTEND=noninteractive              0B        buildkit.dockerfile.v0
<missing>      21 hours ago   WORKDIR /app                                    0B        buildkit.dockerfile.v0
<missing>      21 hours ago   RUN |2 PYTORCH_ROCM_ARCH=gfx90a;gfx942 PYTHO…   0B        buildkit.dockerfile.v0
<missing>      21 hours ago   ARG PYTHON_VERSION=3.12                         0B        buildkit.dockerfile.v0
<missing>      21 hours ago   ENV PYTORCH_ROCM_ARCH=gfx90a;gfx942             0B        buildkit.dockerfile.v0
<missing>      21 hours ago   ARG PYTORCH_ROCM_ARCH=gfx90a;gfx942             0B        buildkit.dockerfile.v0
<missing>      21 hours ago   ENV LD_LIBRARY_PATH=/opt/rocm/lib:/usr/local…   0B        buildkit.dockerfile.v0
<missing>      21 hours ago   ENV ROCM_PATH=/opt/rocm                         0B        buildkit.dockerfile.v0
<missing>      21 hours ago   ENV PATH=/opt/rocm/llvm/bin:/usr/local/sbin:…   0B        buildkit.dockerfile.v0
<missing>      21 hours ago   ENV HTTPS_PROXY=http://webproxy:8080            0B        buildkit.dockerfile.v0
<missing>      21 hours ago   ENV HTTP_PROXY=http://webproxy:8080             0B        buildkit.dockerfile.v0
<missing>      21 hours ago   ENV https_proxy=http://webproxy:8080            0B        buildkit.dockerfile.v0
<missing>      21 hours ago   ENV http_proxy=http://webproxy:8080             0B        buildkit.dockerfile.v0
<missing>      2 months ago   RUN |3 ROCM_VERSION=6.3.1 AMDGPU_VERSION=6.3…   1.67kB    buildkit.dockerfile.v0
<missing>      2 months ago   RUN |3 ROCM_VERSION=6.3.1 AMDGPU_VERSION=6.3…   29.6GB    buildkit.dockerfile.v0
<missing>      2 months ago   RUN |3 ROCM_VERSION=6.3.1 AMDGPU_VERSION=6.3…   60B       buildkit.dockerfile.v0
<missing>      2 months ago   ARG APT_PREF                                    0B        buildkit.dockerfile.v0
<missing>      2 months ago   ARG AMDGPU_VERSION=5.3                          0B        buildkit.dockerfile.v0
<missing>      2 months ago   ARG ROCM_VERSION=5.3                            0B        buildkit.dockerfile.v0
<missing>      2 months ago   LABEL maintainer=dl.mlsedevops@amd.com          0B        buildkit.dockerfile.v0
<missing>      5 months ago   /bin/sh -c #(nop)  CMD ["/bin/bash"]            0B
<missing>      5 months ago   /bin/sh -c #(nop) ADD file:ebe009f86035c175b…   77.9MB
<missing>      5 months ago   /bin/sh -c #(nop)  LABEL org.opencontainers.…   0B
<missing>      5 months ago   /bin/sh -c #(nop)  LABEL org.opencontainers.…   0B
<missing>      5 months ago   /bin/sh -c #(nop)  ARG LAUNCHPAD_BUILD_ARCH     0B
<missing>      5 months ago   /bin/sh -c #(nop)  ARG RELEASE                  0B

this is the base image layers with the full instructions for each one. It seems that there is the 29GB layer that we could break into smaller ones.

IMAGE                                                                     CREATED        CREATED BY                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       SIZE      COMMENT
sha256:1f3d114085976ba42ff625c93e9699bff33b2614e5e17c9846a02b72000f4b6a   21 hours ago   RUN |13 BASE_IMAGE=rocm/dev-ubuntu-22.04:6.3.1-complete HIPBLASLT_BRANCH=4d40e36 LEGACY_HIPBLASLT_OPTION= RCCL_BRANCH=648a58d RCCL_REPO=https://github.com/ROCm/rccl TRITON_BRANCH=e5be006 TRITON_REPO=https://github.com/triton-lang/triton.git PYTORCH_BRANCH=3a585126 PYTORCH_VISION_BRANCH=v0.19.1 PYTORCH_REPO=https://github.com/pytorch/pytorch.git PYTORCH_VISION_REPO=https://github.com/pytorch/vision.git FA_BRANCH=b7d29fb FA_REPO=https://github.com/ROCm/flash-attention.git /bin/sh -c echo "BASE_IMAGE: ${BASE_IMAGE}" > /app/versions.txt     && echo "HIPBLAS_COMMON_BRANCH: ${HIPBLAS_COMMON_BRANCH}" >> /app/versions.txt     && echo "HIPBLASLT_BRANCH: ${HIPBLASLT_BRANCH}" >> /app/versions.txt     && echo "LEGACY_HIPBLASLT_OPTION: ${LEGACY_HIPBLASLT_OPTION}" >> /app/versions.txt     && echo "RCCL_BRANCH: ${RCCL_BRANCH}" >> /app/versions.txt     && echo "RCCL_REPO: ${RCCL_REPO}" >> /app/versions.txt     && echo "TRITON_BRANCH: ${TRITON_BRANCH}" >> /app/versions.txt     && echo "TRITON_REPO: ${TRITON_REPO}" >> /app/versions.txt     && echo "PYTORCH_BRANCH: ${PYTORCH_BRANCH}" >> /app/versions.txt     && echo "PYTORCH_VISION_BRANCH: ${PYTORCH_VISION_BRANCH}" >> /app/versions.txt     && echo "PYTORCH_REPO: ${PYTORCH_REPO}" >> /app/versions.txt     && echo "PYTORCH_VISION_REPO: ${PYTORCH_VISION_REPO}" >> /app/versions.txt     && echo "FA_BRANCH: ${FA_BRANCH}" >> /app/versions.txt     && echo "FA_REPO: ${FA_REPO}" >> /app/versions.txt # buildkit   504B      buildkit.dockerfile.v0
<missing>                                                                 21 hours ago   ARG FA_REPO                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      0B        buildkit.dockerfile.v0
<missing>                                                                 21 hours ago   ARG FA_BRANCH                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    0B        buildkit.dockerfile.v0
<missing>                                                                 21 hours ago   ARG PYTORCH_VISION_REPO                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          0B        buildkit.dockerfile.v0
<missing>                                                                 21 hours ago   ARG PYTORCH_REPO                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 0B        buildkit.dockerfile.v0
<missing>                                                                 21 hours ago   ARG PYTORCH_VISION_BRANCH                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        0B        buildkit.dockerfile.v0
<missing>                                                                 21 hours ago   ARG PYTORCH_BRANCH                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               0B        buildkit.dockerfile.v0
<missing>                                                                 21 hours ago   ARG TRITON_REPO                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  0B        buildkit.dockerfile.v0
<missing>                                                                 21 hours ago   ARG TRITON_BRANCH                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                0B        buildkit.dockerfile.v0
<missing>                                                                 21 hours ago   ARG RCCL_REPO                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    0B        buildkit.dockerfile.v0
<missing>                                                                 21 hours ago   ARG RCCL_BRANCH                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  0B        buildkit.dockerfile.v0
<missing>                                                                 21 hours ago   ARG LEGACY_HIPBLASLT_OPTION                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      0B        buildkit.dockerfile.v0
<missing>                                                                 21 hours ago   ARG HIPBLASLT_BRANCH                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             0B        buildkit.dockerfile.v0
<missing>                                                                 21 hours ago   ARG BASE_IMAGE                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   0B        buildkit.dockerfile.v0
<missing>                                                                 21 hours ago   RUN /bin/sh -c pip install /install/*.whl # buildkit                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             1.53GB    buildkit.dockerfile.v0
<missing>                                                                 21 hours ago   RUN /bin/sh -c pip install /install/*.whl # buildkit                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             5.25MB    buildkit.dockerfile.v0
<missing>                                                                 21 hours ago   RUN /bin/sh -c pip install /install/*.whl # buildkit                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             660MB     buildkit.dockerfile.v0
<missing>                                                                 21 hours ago   RUN /bin/sh -c dpkg -i /install/*deb     && sed -i 's/, rccl-dev \(.*\), rocalution/, rocalution/g' /var/lib/dpkg/status     && sed -i 's/, rccl \(.*\), rocalution/, rocalution/g' /var/lib/dpkg/status # buildkit                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              70.7MB    buildkit.dockerfile.v0
<missing>                                                                 21 hours ago   RUN /bin/sh -c dpkg -i /install/*deb     && sed -i 's/, hipblaslt-dev \(.*\), hipcub-dev/, hipcub-dev/g' /var/lib/dpkg/status     && sed -i 's/, hipblaslt \(.*\), hipfft/, hipfft/g' /var/lib/dpkg/status # buildkit                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            1.68GB    buildkit.dockerfile.v0
<missing>                                                                 23 hours ago   ENV HTTPS_PROXY=http://webproxy:8080                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             0B        buildkit.dockerfile.v0
<missing>                                                                 23 hours ago   ENV HTTP_PROXY=http://webproxy:8080                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              0B        buildkit.dockerfile.v0
<missing>                                                                 23 hours ago   ENV https_proxy=http://webproxy:8080                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             0B        buildkit.dockerfile.v0
<missing>                                                                 23 hours ago   ENV http_proxy=http://webproxy:8080                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              0B        buildkit.dockerfile.v0
<missing>                                                                 23 hours ago   RUN |2 PYTORCH_ROCM_ARCH=gfx90a;gfx942 PYTHON_VERSION=3.12 /bin/sh -c pip install -U packaging cmake ninja wheel setuptools pybind11 Cython # buildkit                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           129MB     buildkit.dockerfile.v0
<missing>                                                                 23 hours ago   RUN |2 PYTORCH_ROCM_ARCH=gfx90a;gfx942 PYTHON_VERSION=3.12 /bin/sh -c apt-get update -y     && apt-get install -y software-properties-common git curl sudo vim less     && add-apt-repository ppa:deadsnakes/ppa     && apt-get update -y     && apt-get install -y python${PYTHON_VERSION} python${PYTHON_VERSION}-dev python${PYTHON_VERSION}-venv        python${PYTHON_VERSION}-lib2to3 python-is-python3      && update-alternatives --install /usr/bin/python3 python3 /usr/bin/python${PYTHON_VERSION} 1     && update-alternatives --set python3 /usr/bin/python${PYTHON_VERSION}     && ln -sf /usr/bin/python${PYTHON_VERSION}-config /usr/bin/python3-config     && curl -sS https://bootstrap.pypa.io/get-pip.py | python${PYTHON_VERSION}     && python3 --version && python3 -m pip --version # buildkit                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           258MB     buildkit.dockerfile.v0
<missing>                                                                 23 hours ago   ENV DEBIAN_FRONTEND=noninteractive                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               0B        buildkit.dockerfile.v0
<missing>                                                                 23 hours ago   WORKDIR /app                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     0B        buildkit.dockerfile.v0
<missing>                                                                 23 hours ago   RUN |2 PYTORCH_ROCM_ARCH=gfx90a;gfx942 PYTHON_VERSION=3.12 /bin/sh -c mkdir -p /app # buildkit                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   0B        buildkit.dockerfile.v0
<missing>                                                                 23 hours ago   ARG PYTHON_VERSION=3.12                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          0B        buildkit.dockerfile.v0
<missing>                                                                 23 hours ago   ENV PYTORCH_ROCM_ARCH=gfx90a;gfx942                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              0B        buildkit.dockerfile.v0
<missing>                                                                 23 hours ago   ARG PYTORCH_ROCM_ARCH=gfx90a;gfx942                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              0B        buildkit.dockerfile.v0
<missing>                                                                 23 hours ago   ENV LD_LIBRARY_PATH=/opt/rocm/lib:/usr/local/lib:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                0B        buildkit.dockerfile.v0
<missing>                                                                 23 hours ago   ENV ROCM_PATH=/opt/rocm                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          0B        buildkit.dockerfile.v0
<missing>                                                                 23 hours ago   ENV PATH=/opt/rocm/llvm/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         0B        buildkit.dockerfile.v0
<missing>                                                                 23 hours ago   ENV HTTPS_PROXY=http://webproxy:8080                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             0B        buildkit.dockerfile.v0
<missing>                                                                 23 hours ago   ENV HTTP_PROXY=http://webproxy:8080                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              0B        buildkit.dockerfile.v0
<missing>                                                                 23 hours ago   ENV https_proxy=http://webproxy:8080                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             0B        buildkit.dockerfile.v0
<missing>                                                                 23 hours ago   ENV http_proxy=http://webproxy:8080                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              0B        buildkit.dockerfile.v0
<missing>                                                                 2 months ago   RUN |3 ROCM_VERSION=6.3.1 AMDGPU_VERSION=6.3.1 APT_PREF=Package: *\nPin: release o=repo.radeon.com\nPin-Priority: 600 /bin/sh -c groupadd -g 109 render # buildkit                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               1.67kB    buildkit.dockerfile.v0
<missing>                                                                 2 months ago   RUN |3 ROCM_VERSION=6.3.1 AMDGPU_VERSION=6.3.1 APT_PREF=Package: *\nPin: release o=repo.radeon.com\nPin-Priority: 600 /bin/sh -c apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends ca-certificates curl libnuma-dev gnupg   && curl -sL https://repo.radeon.com/rocm/rocm.gpg.key | apt-key add -   && printf "deb [arch=amd64] https://repo.radeon.com/rocm/apt/$ROCM_VERSION/ jammy main" | tee /etc/apt/sources.list.d/rocm.list   && printf "deb [arch=amd64] https://repo.radeon.com/amdgpu/$AMDGPU_VERSION/ubuntu jammy main" | tee /etc/apt/sources.list.d/amdgpu.list   && apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends   sudo   libelf1   kmod   file   python3-dev   python3-pip   rocm-dev   rocm-libs   build-essential &&   apt-get clean &&   rm -rf /var/lib/apt/lists/* # buildkit                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      29.6GB    buildkit.dockerfile.v0
<missing>                                                                 2 months ago   RUN |3 ROCM_VERSION=6.3.1 AMDGPU_VERSION=6.3.1 APT_PREF=Package: *\nPin: release o=repo.radeon.com\nPin-Priority: 600 /bin/sh -c echo "$APT_PREF" > /etc/apt/preferences.d/rocm-pin-600 # buildkit                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               60B       buildkit.dockerfile.v0
<missing>                                                                 2 months ago   ARG APT_PREF                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     0B        buildkit.dockerfile.v0
<missing>                                                                 2 months ago   ARG AMDGPU_VERSION=5.3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           0B        buildkit.dockerfile.v0
<missing>                                                                 2 months ago   ARG ROCM_VERSION=5.3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             0B        buildkit.dockerfile.v0
<missing>                                                                 2 months ago   LABEL maintainer=dl.mlsedevops@amd.com                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           0B        buildkit.dockerfile.v0
<missing>                                                                 5 months ago   /bin/sh -c #(nop)  CMD ["/bin/bash"]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             0B
<missing>                                                                 5 months ago   /bin/sh -c #(nop) ADD file:ebe009f86035c175ba244badd298a2582914415cf62783d510eab3a311a5d4e1 in /                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 77.9MB
<missing>                                                                 5 months ago   /bin/sh -c #(nop)  LABEL org.opencontainers.image.version=22.04                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  0B
<missing>                                                                 5 months ago   /bin/sh -c #(nop)  LABEL org.opencontainers.image.ref.name=ubuntu                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                0B
<missing>                                                                 5 months ago   /bin/sh -c #(nop)  ARG LAUNCHPAD_BUILD_ARCH                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      0B
<missing>                                                                 5 months ago   /bin/sh -c #(nop)  ARG RELEASE                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   0B

Looking at the large layer and the thing it installs we focus on rocm-dev and rocm-libs which are meta packages and contain the following:

apt-cache depends rocm-dev
rocm-dev
  Depends: amd-smi-lib
  Depends: comgr
  Depends: hip-doc
  Depends: hipify-clang
  Depends: hip-runtime-amd
  Depends: hip-samples
  Depends: hipcc
  Depends: hsa-rocr
  Depends: hsa-amd-aqlprofile
  Depends: rocm-llvm
  Depends: openmp-extras-runtime
  Depends: rocm-cmake
  Depends: rocm-dbgapi
  Depends: rocm-debug-agent
  Depends: rocm-device-libs
  Depends: rocm-gdb
  Depends: rocm-smi-lib
  Depends: rocm-utils
  Depends: rocm-core
  Depends: rocm-opencl
  Depends: rocprofiler-register
  Depends: rocprofiler
  Depends: rocprofiler-plugins
  Depends: roctracer
  Depends: rocprofiler-sdk
  Depends: rocprofiler-sdk-roctx
  Depends: hip-dev
  Depends: hsa-rocr-dev
  Depends: rocprofiler-register
  Depends: rocprofiler-dev
  Depends: roctracer-dev
  Depends: openmp-extras-dev
  Depends: rocm-opencl-dev


apt-cache depends rocm-libs
rocm-libs
  Depends: hipblas
  Depends: hipfft-dev
  Depends: hipsolver-dev
  Depends: hipsparse-dev
  Depends: hiptensor-dev
  Depends: miopen-hip-dev
  Depends: rocalution-dev
  Depends: rocblas-dev
  Depends: rocfft-dev
  Depends: rocprim-dev
  Depends: rocrand-dev
  Depends: hiprand-dev
  Depends: rocsolver-dev
  Depends: rocsparse-dev
  Depends: rocthrust-dev
  Depends: rocwmma-dev
  Depends: hipsparselt-dev

Then we look at the packages top20 packages in terms of size:

dpkg-query -Wf '${Package}\t${Installed-Size}\n' | grep roc | sort -k2 -nr | awk '{printf "%-40s %.2f MB\n", $1, $2/1024}' | head -n 20
rocblas                                  3843.87 MB
rocsolver                                3299.95 MB
rocsparse                                2721.65 MB
rocfft                                   2615.09 MB
rocrand                                  338.07 MB
rocalution                               191.44 MB
rocprofiler-sdk                          32.16 MB
hsa-rocr                                 10.63 MB
rocm-smi-lib                             8.17 MB
rocm-dbgapi                              7.23 MB
rocthrust-dev                            5.56 MB
rocm-opencl                              4.94 MB
rocprim-dev                              4.17 MB
rocprofiler-plugins                      4.14 MB
rocprofiler                              4.13 MB
rocrand-dev                              3.71 MB
rocm-device-libs                         3.27 MB
rocblas-dev                              2.82 MB
rocsparse-dev                            1.92 MB
rocprofiler-register                     1.67 MB

After a brief discussion with @elukey about it, we agreed that if we can narrow down which of these packages we don't need at runtime we can exclude them from the final image (using the same multistage build as the original dockerfile).

@isarantopoulos if you have time could you add in here how you build/test the image on ml-lab? I can try to create another version of the image playing with the Dockerfile while you are afk, to see if I can get a lighter image :)

Note: I haven't gotten to successfully test it yet -- there are error logs at the bottom of this message attached but the procedure is the following on ml-lab1002

I cloned the repo https://github.com/vllm-project/vllm
and build the 2 images modifying the dockerfiles to include the proxy env args & in the second image to use the image:tag I created instead of the standard one (I could have just named the first one with the same image:tag though).

1. building the images

DOCKER_BUILDKIT=1 docker build -f Dockerfile.rocm_base --platform=linux/amd64 -t vllm:rocm-base . 2>&1 | ts '%Y-%m-%d %H:%M:%S' > build_base_2.log
DOCKER_BUILDKIT=1 docker build -f Dockerfile.rocm --platform=linux/amd64 -t vllm:rocm . 2>&1 | ts '%Y-%m-%d %H:%M:%S' > build_base_final.log

2. Attach a shell to the container

Here docker compose would be useful.

docker run -p 8000:8000 --device=/dev/kfd --device=/dev/dri -e https_proxy=http://webproxy:8080 -e http_proxy=http://webproxy:8080 -e HSA_VISIBLE_DEVICES=0 -it --entrypoint /bin/bash vllm:rocm

alternatively we can also try to run the upstream one.

docker run -p 8000:8000 --device=/dev/kfd --device=/dev/dri -e https_proxy=http://webproxy:8080 -e http_proxy=http://webproxy:8080 -e HSA_VISIBLE_DEVICES=0 -it --entrypoint /bin/bash rocm/vllm:rocm6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6

3. download and serve a model in the container (I just chose a model that is in the MB range and is supported by vllm). Official docs

vllm serve facebook/opt-125m

4. Make a request:

you should be able to see the deployed model(s) with:

curl http://localhost:8000/v1/models

make a request with the following:

curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "facebook/opt-125m",
        "prompt": "San Francisco is a",
        "max_tokens": 7,
        "temperature": 0
    }'

At the moment I haven't managed to run it successfully (I tried more than one model). I get different types of errors when testing the upstream image and the image I built. I have the full stack trace available for both in a paste.
logs for upstream image ------ logs for custom built image

I was able to run vLLM in the docker image: rocm/vllm:rocm6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6 by customizing instructions from this ROCm blog: https://rocm.blogs.amd.com/software-tools-optimization/vllm-container/README.html as shown below:

Confirm ml-lab1002 meets requirements

I checked ml-lab1002 to make sure it meets the minimum requirements listed in the ROCm vLLM blog linked above and it meets 3/4 requirements as shown below.

1.ROCm: required version is 6.3.0 or later but we have 6.1.0.60100-82~22.04

$ apt show rocm-libs -a

Package: rocm-libs
Version: 6.1.0.60100-82~22.04
Priority: optional
Section: devel
Maintainer: ROCm Dev Support <rocm-dev.support@amd.com>
Installed-Size: 13.3 kB
Depends: hipblas (= 2.1.0.60100-82~22.04), hipblaslt (= 0.7.0.60100-82~22.04), hipfft (= 1.0.14.60100-82~22.04), hipsolver (= 2.1.0.60100-82~22.04), hipsparse (= 3.0.1.60100-82~22.04), hiptensor (= 1.2.0.60100-82~22.04), miopen-hip (= 3.1.0.60100-82~22.04), half (= 1.12.0.60100-82~22.04), rccl (= 2.18.6.60100-82~22.04), rocalution (= 3.1.1.60100-82~22.04), rocblas (= 4.1.0.60100-82~22.04), rocfft (= 1.0.27.60100-82~22.04), rocrand (= 3.0.1.60100-82~22.04), hiprand (= 2.10.16.60100-82~22.04), rocsolver (= 3.25.0.60100-82~22.04), rocsparse (= 3.1.2.60100-82~22.04), rocm-core (= 6.1.0.60100-82~22.04), hipsparselt (= 0.1.0.60100-82~22.04), composablekernel-dev (= 1.1.0.60100-82~22.04), hipblas-dev (= 2.1.0.60100-82~22.04), hipblaslt-dev (= 0.7.0.60100-82~22.04), hipcub-dev (= 3.1.0.60100-82~22.04), hipfft-dev (= 1.0.14.60100-82~22.04), hipsolver-dev (= 2.1.0.60100-82~22.04), hipsparse-dev (= 3.0.1.60100-82~22.04), hiptensor-dev (= 1.2.0.60100-82~22.04), miopen-hip-dev (= 3.1.0.60100-82~22.04), rccl-dev (= 2.18.6.60100-82~22.04), rocalution-dev (= 3.1.1.60100-82~22.04), rocblas-dev (= 4.1.0.60100-82~22.04), rocfft-dev (= 1.0.27.60100-82~22.04), rocprim-dev (= 3.1.0.60100-82~22.04), rocrand-dev (= 3.0.1.60100-82~22.04), hiprand-dev (= 2.10.16.60100-82~22.04), rocsolver-dev (= 3.25.0.60100-82~22.04), rocsparse-dev (= 3.1.2.60100-82~22.04), rocthrust-dev (= 3.0.1.60100-82~22.04), rocwmma-dev (= 1.4.0.60100-82~22.04), hipsparselt-dev (= 0.1.0.60100-82~22.04)
Homepage: https://github.com/RadeonOpenCompute/ROCm
Download-Size: 1,052 B
APT-Manual-Installed: yes
APT-Sources: http://apt.wikimedia.org/wikimedia bookworm-wikimedia/thirdparty/amd-rocm61 amd64 Packages
Description: Radeon Open Compute (ROCm) Runtime software stack

2.GPU: required model is MI300X or other ROCm-supported GPUs and we have MI200

$ rocm-smi

====================================== ROCm System Management Interface ======================================
================================================ Concise Info ================================================
Device  [Model : Revision]    Temp    Power  Partitions      SCLK    MCLK     Fan  Perf  PwrCap  VRAM%  GPU%  
        Name (20 chars)       (Edge)  (Avg)  (Mem, Compute)                                                   
==============================================================================================================
0       [0x0c34 : 0x02]       43.0°C  38.0W  N/A, N/A        800Mhz  1600Mhz  0%   auto  300.0W    0%   0%    
        Aldebaran/MI200 [Ins                                                                                  
1       [0x0c34 : 0x02]       48.0°C  40.0W  N/A, N/A        800Mhz  1600Mhz  0%   auto  300.0W    0%   0%    
        Aldebaran/MI200 [Ins                                                                                  
==============================================================================================================
============================================ End of ROCm SMI Log =============================================

3.Docker: required version is 20.10 or later with buildx support and we have 20.10.24+dfsg1

$ docker --version

Docker version 20.10.24+dfsg1, build 297e128

4.Python: Version 3.8 or later

$ python3 --version

Python 3.11.2
Run vLLM in container hosted on ml-lab1002

1.Pull the ROCm vLLM docker image

$ docker pull rocm/vllm:rocm6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6

2.Run the container interactively with GPU access, WMF proxy set, and FlashAttention disabled to avoid compatibility issues I faced in: https://phabricator.wikimedia.org/P74816$326. This command also maps the required groups, shared memory, and HuggingFace cache volume as shown in: https://rocm.blogs.amd.com/software-tools-optimization/vllm-container/README.html#creating-a-docker-run-alias

$ docker run --network=host -it \
-e http_proxy=http://webproxy.eqiad.wmnet:8080 \
-e https_proxy=http://webproxy.eqiad.wmnet:8080 \
-e VLLM_USE_TRITON_FLASH_ATTN=0 \
--device=/dev/kfd --device=/dev/dri \
--group-add=$(getent group video | cut -d: -f3) \
--group-add=$(getent group render | cut -d: -f3) \
--ipc=host \
--security-opt seccomp=unconfined \
-v /srv/hf-cache:/home/vllm/.cache/huggingface \
--entrypoint=/bin/bash \
rocm/vllm:rocm6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6

3.Start the vLLM server running the facebook/opt-125m model

$ vllm serve facebook/opt-125m

INFO 04-09 13:20:05 __init__.py:179] Automatically detected platform rocm.
WARNING 04-09 13:20:05 rocm.py:34] `fork` method is not supported by ROCm. VLLM_WORKER_MULTIPROC_METHOD is overridden to `spawn` instead.
INFO 04-09 13:20:06 api_server.py:768] vLLM API server version 0.6.7.dev220+g84f5d47b
INFO 04-09 13:20:06 api_server.py:769] args: Namespace(subparser='serve', model_tag='facebook/opt-125m', config='', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='facebook/opt-125m', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=None, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, calculate_kv_scales=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, dispatch_function=<function serve at 0x7fe214684a40>)
INFO 04-09 13:20:06 api_server.py:195] Started engine process with PID 84
config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 651/651 [00:00<00:00, 4.05MB/s]
INFO 04-09 13:20:09 __init__.py:179] Automatically detected platform rocm.
INFO 04-09 13:20:18 config.py:513] This model supports multiple tasks: {'score', 'classify', 'generate', 'embed', 'reward'}. Defaulting to 'generate'.
INFO 04-09 13:20:21 config.py:513] This model supports multiple tasks: {'score', 'reward', 'classify', 'generate', 'embed'}. Defaulting to 'generate'.
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 685/685 [00:00<00:00, 5.91MB/s]
vocab.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 899k/899k [00:00<00:00, 34.4MB/s]
merges.txt: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 26.1MB/s]
special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 441/441 [00:00<00:00, 3.51MB/s]
INFO 04-09 13:20:26 engine.py:72] Initializing an LLM engine (v0.6.7.dev220+g84f5d47b) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=facebook/opt-125m, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True, 
generation_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 137/137 [00:00<00:00, 1.85MB/s]
INFO 04-09 13:20:27 rocm.py:124] None is not supported in AMD GPUs.
INFO 04-09 13:20:27 rocm.py:125] Using ROCmFlashAttention backend.
INFO 04-09 13:20:27 model_runner.py:1095] Starting to load model facebook/opt-125m...
INFO 04-09 13:20:28 weight_utils.py:251] Using model weights format ['*.bin']
pytorch_model.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 251M/251M [00:00<00:00, 533MB/s]
Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.83it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.82it/s]

INFO 04-09 13:20:29 model_runner.py:1100] Loading model weights took 0.2389 GB
INFO 04-09 13:20:29 worker.py:266] Memory profiling takes 0.67 seconds
INFO 04-09 13:20:29 worker.py:266] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (0.90) = 57.59GiB
INFO 04-09 13:20:29 worker.py:266] model weights take 0.24GiB; non_torch_memory takes 0.59GiB; PyTorch activation peak memory takes 0.46GiB; the rest of the memory reserved for KV Cache is 56.30GiB.
INFO 04-09 13:20:29 executor_base.py:107] # CUDA blocks: 102483, # CPU blocks: 7281
INFO 04-09 13:20:29 executor_base.py:112] Maximum concurrency for 2048 tokens per request: 800.65x
INFO 04-09 13:20:30 model_runner.py:1409] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:10<00:00,  3.46it/s]
INFO 04-09 13:20:40 model_runner.py:1536] Graph capturing finished in 10 secs, took 0.17 GiB
INFO 04-09 13:20:40 engine.py:72] init engine (profile, create kv cache, warmup model) took 11.96 seconds
INFO 04-09 13:20:42 api_server.py:692] Using supplied chat template:
INFO 04-09 13:20:42 api_server.py:692] None
INFO 04-09 13:20:42 launcher.py:19] Available routes are:
INFO 04-09 13:20:42 launcher.py:27] Route: /openapi.json, Methods: HEAD, GET
INFO 04-09 13:20:42 launcher.py:27] Route: /docs, Methods: HEAD, GET
INFO 04-09 13:20:42 launcher.py:27] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 04-09 13:20:42 launcher.py:27] Route: /redoc, Methods: HEAD, GET
INFO 04-09 13:20:42 launcher.py:27] Route: /health, Methods: GET
INFO 04-09 13:20:42 launcher.py:27] Route: /ping, Methods: GET, POST
INFO 04-09 13:20:42 launcher.py:27] Route: /tokenize, Methods: POST
INFO 04-09 13:20:42 launcher.py:27] Route: /detokenize, Methods: POST
INFO 04-09 13:20:42 launcher.py:27] Route: /v1/models, Methods: GET
INFO 04-09 13:20:42 launcher.py:27] Route: /version, Methods: GET
INFO 04-09 13:20:42 launcher.py:27] Route: /v1/chat/completions, Methods: POST
INFO 04-09 13:20:42 launcher.py:27] Route: /v1/completions, Methods: POST
INFO 04-09 13:20:42 launcher.py:27] Route: /v1/embeddings, Methods: POST
INFO 04-09 13:20:42 launcher.py:27] Route: /pooling, Methods: POST
INFO 04-09 13:20:42 launcher.py:27] Route: /score, Methods: POST
INFO 04-09 13:20:42 launcher.py:27] Route: /v1/score, Methods: POST
INFO 04-09 13:20:42 launcher.py:27] Route: /invocations, Methods: POST
INFO:     Started server process [13]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

4.In a second terminal, open a shell in the running container

$ docker exec -it 40966eb1a92e /bin/bash

5.List the loaded models to verify the server is up and the required model exists.

$ curl --noproxy '*' http://localhost:8000/v1/models

{"object":"list","data":[{"id":"facebook/opt-125m","object":"model","created":1744204998,"owned_by":"vllm","root":"facebook/opt-125m","parent":null,"max_model_len":2048,"permission":[{"id":"modelperm-f1d4c50a93b84d2985946daeb9a4fe83","object":"model_permission","created":1744204998,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}

6.Send a test completion request to confirm inference works. Just like in step 5 above, this command also bypasses the https_proxy=http://webproxy.eqiad.wmnet:8080 we had set since we don't want the inference service to use the proxy.

$ time curl --noproxy '*' http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
      "model": "facebook/opt-125m",
      "prompt": "Once upon a time,",
      "max_tokens": 50,
      "temperature": 0.7
    }'

{"id":"cmpl-cb15233f2ff440499f7419196411c9dd","object":"text_completion","created":1744209803,"model":"facebook/opt-125m","choices":[{"index":0,"text":" the US government was in the wrong place at the wrong time. It is now. The president needs to get his act together and take responsibility.\nThe US is an authoritarian state. Why does the US have to be a totalitarian state to be a","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":6,"total_tokens":56,"completion_tokens":50,"prompt_tokens_details":null}}
real	0m0.269s
user	0m0.003s
sys	0m0.006s

Using the same ROCm vLLM container from T385173#10726495, I was able to run inference for both the aya-expanse-8b and aya-expanse-32b models as shown below:

aya-expanse-8b

1.Start vLLM server hosting the aya-expanse-8b model

$ vllm serve CohereForAI/aya-expanse-8b

INFO 04-10 06:19:47 __init__.py:179] Automatically detected platform rocm.
WARNING 04-10 06:19:47 rocm.py:34] `fork` method is not supported by ROCm. VLLM_WORKER_MULTIPROC_METHOD is overridden to `spawn` instead.
INFO 04-10 06:19:48 api_server.py:768] vLLM API server version 0.6.7.dev220+g84f5d47b
INFO 04-10 06:19:48 api_server.py:769] args: Namespace(subparser='serve', model_tag='CohereForAI/aya-expanse-8b', config='', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='CohereForAI/aya-expanse-8b', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=None, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, calculate_kv_scales=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, dispatch_function=<function serve at 0x7f287b884a40>)
INFO 04-10 06:19:48 api_server.py:195] Started engine process with PID 678
config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 634/634 [00:00<00:00, 8.58MB/s]
INFO 04-10 06:19:51 __init__.py:179] Automatically detected platform rocm.
INFO 04-10 06:19:59 config.py:513] This model supports multiple tasks: {'reward', 'generate', 'classify', 'score', 'embed'}. Defaulting to 'generate'.
INFO 04-10 06:20:02 config.py:513] This model supports multiple tasks: {'embed', 'classify', 'reward', 'score', 'generate'}. Defaulting to 'generate'.
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 8.64k/8.64k [00:00<00:00, 86.7MB/s]
tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12.8M/12.8M [00:00<00:00, 205MB/s]
special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 439/439 [00:00<00:00, 3.53MB/s]
INFO 04-10 06:20:07 engine.py:72] Initializing an LLM engine (v0.6.7.dev220+g84f5d47b) with config: model='CohereForAI/aya-expanse-8b', speculative_config=None, tokenizer='CohereForAI/aya-expanse-8b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=CohereForAI/aya-expanse-8b, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True, 
generation_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 137/137 [00:00<00:00, 1.93MB/s]
INFO 04-10 06:20:08 rocm.py:124] None is not supported in AMD GPUs.
INFO 04-10 06:20:08 rocm.py:125] Using ROCmFlashAttention backend.
INFO 04-10 06:20:09 model_runner.py:1095] Starting to load model CohereForAI/aya-expanse-8b...
INFO 04-10 06:20:09 weight_utils.py:251] Using model weights format ['*.safetensors']
model-00004-of-00004.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 1.22G/1.22G [00:19<00:00, 63.7MB/s]
model-00001-of-00004.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 4.92G/4.92G [00:19<00:00, 251MB/s]
model-00003-of-00004.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 5.00G/5.00G [00:19<00:00, 253MB/s]
model-00002-of-00004.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 4.92G/4.92G [02:18<00:00, 35.5MB/s]
model.safetensors.index.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 21.0k/21.0k [00:00<00:00, 98.5kB/s]
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:02<00:06,  2.31s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:04<00:04,  2.19s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:06<00:02,  2.31s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:09<00:00,  2.37s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:09<00:00,  2.33s/it]

INFO 04-10 06:22:38 model_runner.py:1100] Loading model weights took 14.9554 GB
INFO 04-10 06:22:42 worker.py:266] Memory profiling takes 4.49 seconds
INFO 04-10 06:22:42 worker.py:266] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (0.90) = 57.59GiB
INFO 04-10 06:22:42 worker.py:266] model weights take 14.96GiB; non_torch_memory takes 0.61GiB; PyTorch activation peak memory takes 2.38GiB; the rest of the memory reserved for KV Cache is 39.64GiB.
INFO 04-10 06:22:42 executor_base.py:107] # CUDA blocks: 20295, # CPU blocks: 2048
INFO 04-10 06:22:42 executor_base.py:112] Maximum concurrency for 8192 tokens per request: 39.64x
INFO 04-10 06:22:43 model_runner.py:1409] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:12<00:00,  2.70it/s]
INFO 04-10 06:22:56 model_runner.py:1536] Graph capturing finished in 13 secs, took 0.29 GiB
INFO 04-10 06:22:56 engine.py:72] init engine (profile, create kv cache, warmup model) took 18.37 seconds
INFO 04-10 06:22:56 api_server.py:692] Using supplied chat template:
INFO 04-10 06:22:56 api_server.py:692] None
INFO 04-10 06:22:56 launcher.py:19] Available routes are:
INFO 04-10 06:22:56 launcher.py:27] Route: /openapi.json, Methods: GET, HEAD
INFO 04-10 06:22:56 launcher.py:27] Route: /docs, Methods: GET, HEAD
INFO 04-10 06:22:56 launcher.py:27] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 04-10 06:22:56 launcher.py:27] Route: /redoc, Methods: GET, HEAD
INFO 04-10 06:22:56 launcher.py:27] Route: /health, Methods: GET
INFO 04-10 06:22:56 launcher.py:27] Route: /ping, Methods: GET, POST
INFO 04-10 06:22:56 launcher.py:27] Route: /tokenize, Methods: POST
INFO 04-10 06:22:56 launcher.py:27] Route: /detokenize, Methods: POST
INFO 04-10 06:22:56 launcher.py:27] Route: /v1/models, Methods: GET
INFO 04-10 06:22:56 launcher.py:27] Route: /version, Methods: GET
INFO 04-10 06:22:56 launcher.py:27] Route: /v1/chat/completions, Methods: POST
INFO 04-10 06:22:56 launcher.py:27] Route: /v1/completions, Methods: POST
INFO 04-10 06:22:56 launcher.py:27] Route: /v1/embeddings, Methods: POST
INFO 04-10 06:22:56 launcher.py:27] Route: /pooling, Methods: POST
INFO 04-10 06:22:56 launcher.py:27] Route: /score, Methods: POST
INFO 04-10 06:22:56 launcher.py:27] Route: /v1/score, Methods: POST
INFO 04-10 06:22:56 launcher.py:27] Route: /invocations, Methods: POST
INFO:     Started server process [607]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

2.Query the isvc with a test completion request to confirm inference works.

$ time curl --noproxy '*' http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
      "model": "CohereForAI/aya-expanse-8b",
      "prompt": "Once upon a time,",
      "max_tokens": 50,
      "temperature": 0.7
    }'


{"id":"cmpl-40190b4fc38c40829bcd754e10df7612","object":"text_completion","created":1744266366,"model":"CohereForAI/aya-expanse-8b","choices":[{"index":0,"text":" in a land far, far away, there was a prince who loved to study. He was very studious, diligent, and always striving to gain knowledge. He was also a very fair prince, known for his justice and kindness. However, the","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":6,"total_tokens":56,"completion_tokens":50,"prompt_tokens_details":null}}
real	0m0.868s
user	0m0.007s
sys	0m0.004s
aya-expanse-32b

I tried serving the aya-expanse-32b model using the same settings as aya-expanse-8b, and run into VRAM resoure constraints as shown in the logs below:

1$ vllm serve CohereForAI/aya-expanse-32b
2INFO 04-10 06:31:19 __init__.py:179] Automatically detected platform rocm.
3WARNING 04-10 06:31:19 rocm.py:34] `fork` method is not supported by ROCm. VLLM_WORKER_MULTIPROC_METHOD is overridden to `spawn` instead.
4INFO 04-10 06:31:19 api_server.py:768] vLLM API server version 0.6.7.dev220+g84f5d47b
5INFO 04-10 06:31:19 api_server.py:769] args: Namespace(subparser='serve', model_tag='CohereForAI/aya-expanse-32b', config='', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='CohereForAI/aya-expanse-32b', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=None, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, calculate_kv_scales=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, dispatch_function=<function serve at 0x7fb7ef788a40>)
6INFO 04-10 06:31:19 api_server.py:195] Started engine process with PID 229
7config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 637/637 [00:00<00:00, 9.03MB/s]
8INFO 04-10 06:31:23 __init__.py:179] Automatically detected platform rocm.
9INFO 04-10 06:31:31 config.py:513] This model supports multiple tasks: {'score', 'embed', 'reward', 'generate', 'classify'}. Defaulting to 'generate'.
10tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 8.64k/8.64k [00:00<00:00, 49.3MB/s]
11tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12.8M/12.8M [00:00<00:00, 209MB/s]
12special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 439/439 [00:00<00:00, 3.40MB/s]
13INFO 04-10 06:31:39 config.py:513] This model supports multiple tasks: {'reward', 'generate', 'score', 'embed', 'classify'}. Defaulting to 'generate'.
14INFO 04-10 06:31:44 engine.py:72] Initializing an LLM engine (v0.6.7.dev220+g84f5d47b) with config: model='CohereForAI/aya-expanse-32b', speculative_config=None, tokenizer='CohereForAI/aya-expanse-32b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=CohereForAI/aya-expanse-32b, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True,
15generation_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 137/137 [00:00<00:00, 1.90MB/s]
16INFO 04-10 06:31:45 rocm.py:124] None is not supported in AMD GPUs.
17INFO 04-10 06:31:45 rocm.py:125] Using ROCmFlashAttention backend.
18INFO 04-10 06:31:45 model_runner.py:1095] Starting to load model CohereForAI/aya-expanse-32b...
19INFO 04-10 06:31:46 weight_utils.py:251] Using model weights format ['*.safetensors']
20model-00002-of-00014.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 4.93G/4.93G [00:35<00:00, 137MB/s]
21model-00007-of-00014.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 4.93G/4.93G [00:35<00:00, 138MB/s]
22model-00006-of-00014.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 4.93G/4.93G [01:03<00:00, 77.9MB/s]
23model-00010-of-00014.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 4.93G/4.93G [00:35<00:00, 138MB/s]
24model-00004-of-00014.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 4.83G/4.83G [01:17<00:00, 62.7MB/s]
25model-00001-of-00014.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 4.90G/4.90G [01:17<00:00, 63.0MB/s]
26model-00005-of-00014.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 4.93G/4.93G [01:23<00:00, 59.0MB/s]
27model-00008-of-00014.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 4.83G/4.83G [01:26<00:00, 56.0MB/s]
28model-00003-of-00014.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 4.93G/4.93G [01:28<00:00, 56.1MB/s]
29model-00014-of-00014.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 805M/805M [00:04<00:00, 190MB/s]
30model-00009-of-00014.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 4.93G/4.93G [01:01<00:00, 80.7MB/s]
31model-00012-of-00014.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 4.83G/4.83G [00:29<00:00, 164MB/s]
32model-00013-of-00014.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 4.93G/4.93G [00:30<00:00, 162MB/s]
33model-00011-of-00014.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 4.93G/4.93G [00:32<00:00, 154MB/s]
34model.safetensors.index.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 26.2k/26.2k [00:00<00:00, 93.5MB/s]
35Loading safetensors checkpoint shards: 0% Completed | 0/14 [00:00<?, ?it/s]███████████████████████████████████████████████▌ | 4.63G/4.93G [00:29<00:01, 151MB/s]
36Loading safetensors checkpoint shards: 7% Completed | 1/14 [01:31<19:55, 91.93s/it]████████████████████████████████████████████▌| 4.91G/4.93G [00:30<00:00, 227MB/s]
37Loading safetensors checkpoint shards: 14% Completed | 2/14 [01:34<07:51, 39.27s/it]
38Loading safetensors checkpoint shards: 21% Completed | 3/14 [01:36<04:06, 22.43s/it]
39Loading safetensors checkpoint shards: 29% Completed | 4/14 [01:39<02:25, 14.50s/it]
40Loading safetensors checkpoint shards: 36% Completed | 5/14 [01:41<01:31, 10.12s/it]
41Loading safetensors checkpoint shards: 43% Completed | 6/14 [01:43<00:59, 7.49s/it]
42Loading safetensors checkpoint shards: 50% Completed | 7/14 [01:46<00:40, 5.83s/it]
43Loading safetensors checkpoint shards: 57% Completed | 8/14 [01:48<00:28, 4.73s/it]
44Loading safetensors checkpoint shards: 64% Completed | 9/14 [01:50<00:19, 3.99s/it]
45Loading safetensors checkpoint shards: 71% Completed | 10/14 [01:51<00:11, 2.98s/it]
46Loading safetensors checkpoint shards: 79% Completed | 11/14 [01:53<00:08, 2.72s/it]
47Loading safetensors checkpoint shards: 86% Completed | 12/14 [01:56<00:05, 2.61s/it]
48Loading safetensors checkpoint shards: 93% Completed | 13/14 [01:58<00:02, 2.54s/it]
49Loading safetensors checkpoint shards: 100% Completed | 14/14 [02:00<00:00, 2.49s/it]
50Loading safetensors checkpoint shards: 100% Completed | 14/14 [02:00<00:00, 8.64s/it]
51
52INFO 04-10 06:35:52 model_runner.py:1100] Loading model weights took 60.1590 GB
53INFO 04-10 06:35:57 worker.py:266] Memory profiling takes 5.15 seconds
54INFO 04-10 06:35:57 worker.py:266] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (0.90) = 57.59GiB
55INFO 04-10 06:35:57 worker.py:266] model weights take 60.16GiB; non_torch_memory takes 0.71GiB; PyTorch activation peak memory takes 2.44GiB; the rest of the memory reserved for KV Cache is -5.73GiB.
56INFO 04-10 06:35:57 executor_base.py:107] # CUDA blocks: 0, # CPU blocks: 1638
57INFO 04-10 06:35:57 executor_base.py:112] Maximum concurrency for 8192 tokens per request: 0.00x
58ERROR 04-10 06:35:57 engine.py:381] No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.
59ERROR 04-10 06:35:57 engine.py:381] Traceback (most recent call last):
60ERROR 04-10 06:35:57 engine.py:381] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 372, in run_mp_engine
61ERROR 04-10 06:35:57 engine.py:381] engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
62ERROR 04-10 06:35:57 engine.py:381] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
63ERROR 04-10 06:35:57 engine.py:381] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 120, in from_engine_args
64ERROR 04-10 06:35:57 engine.py:381] return cls(ipc_path=ipc_path,
65ERROR 04-10 06:35:57 engine.py:381] ^^^^^^^^^^^^^^^^^^^^^^
66ERROR 04-10 06:35:57 engine.py:381] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 72, in __init__
67ERROR 04-10 06:35:57 engine.py:381] self.engine = LLMEngine(*args, **kwargs)
68ERROR 04-10 06:35:57 engine.py:381] ^^^^^^^^^^^^^^^^^^^^^^^^^^
69ERROR 04-10 06:35:57 engine.py:381] File "vllm/engine/llm_engine.py", line 274, in vllm.engine.llm_engine.LLMEngine.__init__
70ERROR 04-10 06:35:57 engine.py:381] File "vllm/engine/llm_engine.py", line 427, in vllm.engine.llm_engine.LLMEngine._initialize_kv_caches
71ERROR 04-10 06:35:57 engine.py:381] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 118, in initialize_cache
72ERROR 04-10 06:35:57 engine.py:381] self.collective_rpc("initialize_cache",
73ERROR 04-10 06:35:57 engine.py:381] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 49, in collective_rpc
74ERROR 04-10 06:35:57 engine.py:381] answer = run_method(self.driver_worker, method, args, kwargs)
75ERROR 04-10 06:35:57 engine.py:381] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
76ERROR 04-10 06:35:57 engine.py:381] File "vllm/utils.py", line 2379, in vllm.utils.run_method
77ERROR 04-10 06:35:57 engine.py:381] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 293, in initialize_cache
78ERROR 04-10 06:35:57 engine.py:381] raise_if_cache_size_invalid(num_gpu_blocks,
79ERROR 04-10 06:35:57 engine.py:381] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 522, in raise_if_cache_size_invalid
80ERROR 04-10 06:35:57 engine.py:381] raise ValueError("No available memory for the cache blocks. "
81ERROR 04-10 06:35:57 engine.py:381] ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.
82Process SpawnProcess-1:
83Traceback (most recent call last):
84 File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
85 self.run()
86 File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
87 self._target(*self._args, **self._kwargs)
88 File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 383, in run_mp_engine
89 raise e
90 File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 372, in run_mp_engine
91 engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
92 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
93 File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 120, in from_engine_args
94 return cls(ipc_path=ipc_path,
95 ^^^^^^^^^^^^^^^^^^^^^^
96 File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 72, in __init__
97 self.engine = LLMEngine(*args, **kwargs)
98 ^^^^^^^^^^^^^^^^^^^^^^^^^^
99 File "vllm/engine/llm_engine.py", line 274, in vllm.engine.llm_engine.LLMEngine.__init__
100 File "vllm/engine/llm_engine.py", line 427, in vllm.engine.llm_engine.LLMEngine._initialize_kv_caches
101 File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 118, in initialize_cache
102 self.collective_rpc("initialize_cache",
103 File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 49, in collective_rpc
104 answer = run_method(self.driver_worker, method, args, kwargs)
105 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
106 File "vllm/utils.py", line 2379, in vllm.utils.run_method
107 File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 293, in initialize_cache
108 raise_if_cache_size_invalid(num_gpu_blocks,
109 File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 522, in raise_if_cache_size_invalid
110 raise ValueError("No available memory for the cache blocks. "
111ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.
112[rank0]:[W410 06:35:58.249320909 ProcessGroupNCCL.cpp:1487] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
113Task exception was never retrieved
114future: <Task finished name='Task-2' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py:180> exception=ZMQError('Operation not supported')>
115Traceback (most recent call last):
116 File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py", line 186, in run_output_handler_loop
117 while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT
118 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
119 File "/usr/local/lib/python3.12/dist-packages/zmq/_future.py", line 400, in poll
120 raise _zmq.ZMQError(_zmq.ENOTSUP)
121zmq.error.ZMQError: Operation not supported
122Task exception was never retrieved
123future: <Task finished name='Task-3' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py:180> exception=ZMQError('Operation not supported')>
124Traceback (most recent call last):
125 File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py", line 186, in run_output_handler_loop
126 while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT
127 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
128 File "/usr/local/lib/python3.12/dist-packages/zmq/_future.py", line 400, in poll
129 raise _zmq.ZMQError(_zmq.ENOTSUP)
130zmq.error.ZMQError: Operation not supported
131Task exception was never retrieved
132future: <Task finished name='Task-4' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py:180> exception=ZMQError('Operation not supported')>
133Traceback (most recent call last):
134 File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py", line 186, in run_output_handler_loop
135 while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT
136 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
137 File "/usr/local/lib/python3.12/dist-packages/zmq/_future.py", line 400, in poll
138 raise _zmq.ZMQError(_zmq.ENOTSUP)
139zmq.error.ZMQError: Operation not supported
140Task exception was never retrieved
141future: <Task finished name='Task-5' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py:180> exception=ZMQError('Operation not supported')>
142Traceback (most recent call last):
143 File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py", line 186, in run_output_handler_loop
144 while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT
145 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
146 File "/usr/local/lib/python3.12/dist-packages/zmq/_future.py", line 400, in poll
147 raise _zmq.ZMQError(_zmq.ENOTSUP)
148zmq.error.ZMQError: Operation not supported
149Task exception was never retrieved
150future: <Task finished name='Task-6' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py:180> exception=ZMQError('Operation not supported')>
151Traceback (most recent call last):
152 File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py", line 186, in run_output_handler_loop
153 while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT
154 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
155 File "/usr/local/lib/python3.12/dist-packages/zmq/_future.py", line 400, in poll
156 raise _zmq.ZMQError(_zmq.ENOTSUP)
157zmq.error.ZMQError: Operation not supported
158Task exception was never retrieved
159future: <Task finished name='Task-7' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py:180> exception=ZMQError('Operation not supported')>
160Traceback (most recent call last):
161 File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py", line 186, in run_output_handler_loop
162 while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT
163 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
164 File "/usr/local/lib/python3.12/dist-packages/zmq/_future.py", line 400, in poll
165 raise _zmq.ZMQError(_zmq.ENOTSUP)
166zmq.error.ZMQError: Operation not supported
167Task exception was never retrieved
168future: <Task finished name='Task-8' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py:180> exception=ZMQError('Operation not supported')>
169Traceback (most recent call last):
170 File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py", line 186, in run_output_handler_loop
171 while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT
172 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
173 File "/usr/local/lib/python3.12/dist-packages/zmq/_future.py", line 400, in poll
174 raise _zmq.ZMQError(_zmq.ENOTSUP)
175zmq.error.ZMQError: Operation not supported
176Task exception was never retrieved
177future: <Task finished name='Task-9' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py:180> exception=ZMQError('Operation not supported')>
178Traceback (most recent call last):
179 File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py", line 186, in run_output_handler_loop
180 while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT
181 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
182 File "/usr/local/lib/python3.12/dist-packages/zmq/_future.py", line 400, in poll
183 raise _zmq.ZMQError(_zmq.ENOTSUP)
184zmq.error.ZMQError: Operation not supported
185Task exception was never retrieved
186future: <Task finished name='Task-10' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py:180> exception=ZMQError('Operation not supported')>
187Traceback (most recent call last):
188 File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py", line 186, in run_output_handler_loop
189 while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT
190 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
191 File "/usr/local/lib/python3.12/dist-packages/zmq/_future.py", line 400, in poll
192 raise _zmq.ZMQError(_zmq.ENOTSUP)
193zmq.error.ZMQError: Operation not supported
194Task exception was never retrieved
195future: <Task finished name='Task-11' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py:180> exception=ZMQError('Operation not supported')>
196Traceback (most recent call last):
197 File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py", line 186, in run_output_handler_loop
198 while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT
199 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
200 File "/usr/local/lib/python3.12/dist-packages/zmq/_future.py", line 400, in poll
201 raise _zmq.ZMQError(_zmq.ENOTSUP)
202zmq.error.ZMQError: Operation not supported
203Task exception was never retrieved
204future: <Task finished name='Task-12' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py:180> exception=ZMQError('Operation not supported')>
205Traceback (most recent call last):
206 File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py", line 186, in run_output_handler_loop
207 while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT
208 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
209 File "/usr/local/lib/python3.12/dist-packages/zmq/_future.py", line 400, in poll
210 raise _zmq.ZMQError(_zmq.ENOTSUP)
211zmq.error.ZMQError: Operation not supported
212Task exception was never retrieved
213future: <Task finished name='Task-13' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py:180> exception=ZMQError('Operation not supported')>
214Traceback (most recent call last):
215 File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py", line 186, in run_output_handler_loop
216 while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT
217 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
218 File "/usr/local/lib/python3.12/dist-packages/zmq/_future.py", line 400, in poll
219 raise _zmq.ZMQError(_zmq.ENOTSUP)
220zmq.error.ZMQError: Operation not supported
221Task exception was never retrieved
222future: <Task finished name='Task-14' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py:180> exception=ZMQError('Operation not supported')>
223Traceback (most recent call last):
224 File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py", line 186, in run_output_handler_loop
225 while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT
226 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
227 File "/usr/local/lib/python3.12/dist-packages/zmq/_future.py", line 400, in poll
228 raise _zmq.ZMQError(_zmq.ENOTSUP)
229zmq.error.ZMQError: Operation not supported
230Task exception was never retrieved
231future: <Task finished name='Task-15' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py:180> exception=ZMQError('Operation not supported')>
232Traceback (most recent call last):
233 File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py", line 186, in run_output_handler_loop
234 while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT
235 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
236 File "/usr/local/lib/python3.12/dist-packages/zmq/_future.py", line 400, in poll
237 raise _zmq.ZMQError(_zmq.ENOTSUP)
238zmq.error.ZMQError: Operation not supported
239Task exception was never retrieved
240future: <Task finished name='Task-16' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py:180> exception=ZMQError('Operation not supported')>
241Traceback (most recent call last):
242 File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py", line 186, in run_output_handler_loop
243 while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT
244 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
245 File "/usr/local/lib/python3.12/dist-packages/zmq/_future.py", line 400, in poll
246 raise _zmq.ZMQError(_zmq.ENOTSUP)
247zmq.error.ZMQError: Operation not supported
248Task exception was never retrieved
249future: <Task finished name='Task-17' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py:180> exception=ZMQError('Operation not supported')>
250Traceback (most recent call last):
251 File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py", line 186, in run_output_handler_loop
252 while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT
253 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
254 File "/usr/local/lib/python3.12/dist-packages/zmq/_future.py", line 400, in poll
255 raise _zmq.ZMQError(_zmq.ENOTSUP)
256zmq.error.ZMQError: Operation not supported
257Task exception was never retrieved
258future: <Task finished name='Task-18' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py:180> exception=ZMQError('Operation not supported')>
259Traceback (most recent call last):
260 File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py", line 186, in run_output_handler_loop
261 while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT
262 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
263 File "/usr/local/lib/python3.12/dist-packages/zmq/_future.py", line 400, in poll
264 raise _zmq.ZMQError(_zmq.ENOTSUP)
265zmq.error.ZMQError: Operation not supported
266Task exception was never retrieved
267future: <Task finished name='Task-19' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py:180> exception=ZMQError('Operation not supported')>
268Traceback (most recent call last):
269 File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py", line 186, in run_output_handler_loop
270 while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT
271 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
272 File "/usr/local/lib/python3.12/dist-packages/zmq/_future.py", line 400, in poll
273 raise _zmq.ZMQError(_zmq.ENOTSUP)
274zmq.error.ZMQError: Operation not supported
275Task exception was never retrieved
276future: <Task finished name='Task-20' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py:180> exception=ZMQError('Operation not supported')>
277Traceback (most recent call last):
278 File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py", line 186, in run_output_handler_loop
279 while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT
280 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
281 File "/usr/local/lib/python3.12/dist-packages/zmq/_future.py", line 400, in poll
282 raise _zmq.ZMQError(_zmq.ENOTSUP)
283zmq.error.ZMQError: Operation not supported
284Task exception was never retrieved
285future: <Task finished name='Task-21' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py:180> exception=ZMQError('Operation not supported')>
286Traceback (most recent call last):
287 File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py", line 186, in run_output_handler_loop
288 while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT
289 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
290 File "/usr/local/lib/python3.12/dist-packages/zmq/_future.py", line 400, in poll
291 raise _zmq.ZMQError(_zmq.ENOTSUP)
292zmq.error.ZMQError: Operation not supported
293Task exception was never retrieved
294future: <Task finished name='Task-22' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py:180> exception=ZMQError('Operation not supported')>
295Traceback (most recent call last):
296 File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py", line 186, in run_output_handler_loop
297 while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT
298 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
299 File "/usr/local/lib/python3.12/dist-packages/zmq/_future.py", line 400, in poll
300 raise _zmq.ZMQError(_zmq.ENOTSUP)
301zmq.error.ZMQError: Operation not supported
302Task exception was never retrieved
303future: <Task finished name='Task-23' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py:180> exception=ZMQError('Operation not supported')>
304Traceback (most recent call last):
305 File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py", line 186, in run_output_handler_loop
306 while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT
307 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
308 File "/usr/local/lib/python3.12/dist-packages/zmq/_future.py", line 400, in poll
309 raise _zmq.ZMQError(_zmq.ENOTSUP)
310zmq.error.ZMQError: Operation not supported
311Task exception was never retrieved
312future: <Task finished name='Task-24' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py:180> exception=ZMQError('Operation not supported')>
313Traceback (most recent call last):
314 File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py", line 186, in run_output_handler_loop
315 while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT
316 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
317 File "/usr/local/lib/python3.12/dist-packages/zmq/_future.py", line 400, in poll
318 raise _zmq.ZMQError(_zmq.ENOTSUP)
319zmq.error.ZMQError: Operation not supported
320Task exception was never retrieved
321future: <Task finished name='Task-25' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py:180> exception=ZMQError('Operation not supported')>
322Traceback (most recent call last):
323 File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py", line 186, in run_output_handler_loop
324 while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT
325 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
326 File "/usr/local/lib/python3.12/dist-packages/zmq/_future.py", line 400, in poll
327 raise _zmq.ZMQError(_zmq.ENOTSUP)
328zmq.error.ZMQError: Operation not supported
329Task exception was never retrieved
330future: <Task finished name='Task-26' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py:180> exception=ZMQError('Operation not supported')>
331Traceback (most recent call last):
332 File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py", line 186, in run_output_handler_loop
333 while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT
334 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
335 File "/usr/local/lib/python3.12/dist-packages/zmq/_future.py", line 400, in poll
336 raise _zmq.ZMQError(_zmq.ENOTSUP)
337zmq.error.ZMQError: Operation not supported
338Task exception was never retrieved
339future: <Task finished name='Task-27' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py:180> exception=ZMQError('Operation not supported')>
340Traceback (most recent call last):
341 File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py", line 186, in run_output_handler_loop
342 while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT
343 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
344 File "/usr/local/lib/python3.12/dist-packages/zmq/_future.py", line 400, in poll
345 raise _zmq.ZMQError(_zmq.ENOTSUP)
346zmq.error.ZMQError: Operation not supported
347Traceback (most recent call last):
348 File "/usr/local/bin/vllm", line 8, in <module>
349 sys.exit(main())
350 ^^^^^^
351 File "/usr/local/lib/python3.12/dist-packages/vllm/scripts.py", line 201, in main
352 args.dispatch_function(args)
353 File "/usr/local/lib/python3.12/dist-packages/vllm/scripts.py", line 42, in serve
354 uvloop.run(run_server(args))
355 File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run
356 return __asyncio.run(
357 ^^^^^^^^^^^^^^
358 File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
359 return runner.run(main)
360 ^^^^^^^^^^^^^^^^
361 File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
362 return self._loop.run_until_complete(task)
363 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
364 File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
365 File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper
366 return await main
367 ^^^^^^^^^^
368 File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 796, in run_server
369 async with build_async_engine_client(args) as engine_client:
370 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
371 File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
372 return await anext(self.gen)
373 ^^^^^^^^^^^^^^^^^^^^^
374 File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 125, in build_async_engine_client
375 async with build_async_engine_client_from_engine_args(
376 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
377 File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
378 return await anext(self.gen)
379 ^^^^^^^^^^^^^^^^^^^^^
380 File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 219, in build_async_engine_client_from_engine_args
381 raise RuntimeError(
382RuntimeError: Engine process failed to start. See stack trace for the root cause.

Following vLLM tuning guidelines: https://docs.vllm.ai/en/latest/performance/optimization.html, I experimented with both gpu_memory_utilization and max_model_len settings as detailed in P74825#300829 and P74825#300830. Finally, I was able to get the right settings for vLLM to serve aya-expanse-32b as shown below:

1.Start vLLM server hosting the aya-expanse-32b model with tuned settings

$ vllm serve CohereForAI/aya-expanse-32b \
--gpu_memory_utilization=1 \
--max_model_len=5296
INFO 04-10 11:20:12 __init__.py:179] Automatically detected platform rocm.
WARNING 04-10 11:20:12 rocm.py:34] `fork` method is not supported by ROCm. VLLM_WORKER_MULTIPROC_METHOD is overridden to `spawn` instead.
INFO 04-10 11:20:12 api_server.py:768] vLLM API server version 0.6.7.dev220+g84f5d47b
INFO 04-10 11:20:12 api_server.py:769] args: Namespace(subparser='serve', model_tag='CohereForAI/aya-expanse-32b', config='', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='CohereForAI/aya-expanse-32b', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=5296, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=1.0, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, calculate_kv_scales=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, dispatch_function=<function serve at 0x7f019f080a40>)
INFO 04-10 11:20:12 api_server.py:195] Started engine process with PID 5889
INFO 04-10 11:20:15 __init__.py:179] Automatically detected platform rocm.
INFO 04-10 11:20:24 config.py:513] This model supports multiple tasks: {'classify', 'reward', 'generate', 'score', 'embed'}. Defaulting to 'generate'.
INFO 04-10 11:20:28 config.py:513] This model supports multiple tasks: {'classify', 'embed', 'score', 'generate', 'reward'}. Defaulting to 'generate'.
INFO 04-10 11:20:33 engine.py:72] Initializing an LLM engine (v0.6.7.dev220+g84f5d47b) with config: model='CohereForAI/aya-expanse-32b', speculative_config=None, tokenizer='CohereForAI/aya-expanse-32b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=5296, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=CohereForAI/aya-expanse-32b, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True, 
INFO 04-10 11:20:34 rocm.py:124] None is not supported in AMD GPUs.
INFO 04-10 11:20:34 rocm.py:125] Using ROCmFlashAttention backend.
INFO 04-10 11:20:34 model_runner.py:1095] Starting to load model CohereForAI/aya-expanse-32b...
INFO 04-10 11:20:34 weight_utils.py:251] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/14 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   7% Completed | 1/14 [00:02<00:26,  2.08s/it]
Loading safetensors checkpoint shards:  14% Completed | 2/14 [00:04<00:27,  2.26s/it]
Loading safetensors checkpoint shards:  21% Completed | 3/14 [00:06<00:25,  2.32s/it]
Loading safetensors checkpoint shards:  29% Completed | 4/14 [00:09<00:23,  2.35s/it]
Loading safetensors checkpoint shards:  36% Completed | 5/14 [00:11<00:21,  2.36s/it]
Loading safetensors checkpoint shards:  43% Completed | 6/14 [00:14<00:18,  2.37s/it]
Loading safetensors checkpoint shards:  50% Completed | 7/14 [00:14<00:12,  1.81s/it]
Loading safetensors checkpoint shards:  57% Completed | 8/14 [00:16<00:11,  1.90s/it]
Loading safetensors checkpoint shards:  64% Completed | 9/14 [00:19<00:10,  2.04s/it]
Loading safetensors checkpoint shards:  71% Completed | 10/14 [00:21<00:08,  2.14s/it]
Loading safetensors checkpoint shards:  79% Completed | 11/14 [00:23<00:06,  2.21s/it]
Loading safetensors checkpoint shards:  86% Completed | 12/14 [00:26<00:04,  2.25s/it]
Loading safetensors checkpoint shards:  93% Completed | 13/14 [00:28<00:02,  2.28s/it]
Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:30<00:00,  2.32s/it]
Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:30<00:00,  2.21s/it]

INFO 04-10 11:21:06 model_runner.py:1100] Loading model weights took 60.1590 GB
INFO 04-10 11:21:10 worker.py:266] Memory profiling takes 3.46 seconds
INFO 04-10 11:21:10 worker.py:266] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (1.00) = 63.98GiB
INFO 04-10 11:21:10 worker.py:266] model weights take 60.16GiB; non_torch_memory takes 0.61GiB; PyTorch activation peak memory takes 2.40GiB; the rest of the memory reserved for KV Cache is 0.82GiB.
INFO 04-10 11:21:10 executor_base.py:107] # CUDA blocks: 334, # CPU blocks: 1638
INFO 04-10 11:21:10 executor_base.py:112] Maximum concurrency for 5296 tokens per request: 1.01x
INFO 04-10 11:21:11 model_runner.py:1409] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:18<00:00,  1.86it/s]
INFO 04-10 11:21:30 model_runner.py:1536] Graph capturing finished in 19 secs, took 0.32 GiB
INFO 04-10 11:21:30 engine.py:72] init engine (profile, create kv cache, warmup model) took 23.37 seconds
INFO 04-10 11:21:30 api_server.py:692] Using supplied chat template:
INFO 04-10 11:21:30 api_server.py:692] None
INFO 04-10 11:21:30 launcher.py:19] Available routes are:
INFO 04-10 11:21:30 launcher.py:27] Route: /openapi.json, Methods: HEAD, GET
INFO 04-10 11:21:30 launcher.py:27] Route: /docs, Methods: HEAD, GET
INFO 04-10 11:21:30 launcher.py:27] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 04-10 11:21:30 launcher.py:27] Route: /redoc, Methods: HEAD, GET
INFO 04-10 11:21:30 launcher.py:27] Route: /health, Methods: GET
INFO 04-10 11:21:30 launcher.py:27] Route: /ping, Methods: GET, POST
INFO 04-10 11:21:30 launcher.py:27] Route: /tokenize, Methods: POST
INFO 04-10 11:21:30 launcher.py:27] Route: /detokenize, Methods: POST
INFO 04-10 11:21:30 launcher.py:27] Route: /v1/models, Methods: GET
INFO 04-10 11:21:30 launcher.py:27] Route: /version, Methods: GET
INFO 04-10 11:21:30 launcher.py:27] Route: /v1/chat/completions, Methods: POST
INFO 04-10 11:21:30 launcher.py:27] Route: /v1/completions, Methods: POST
INFO 04-10 11:21:30 launcher.py:27] Route: /v1/embeddings, Methods: POST
INFO 04-10 11:21:30 launcher.py:27] Route: /pooling, Methods: POST
INFO 04-10 11:21:30 launcher.py:27] Route: /score, Methods: POST
INFO 04-10 11:21:30 launcher.py:27] Route: /v1/score, Methods: POST
INFO 04-10 11:21:30 launcher.py:27] Route: /invocations, Methods: POST
INFO:     Started server process [5818]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

2.Query the isvc with a test completion request to confirm inference works

$ time curl --noproxy '*' http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
      "model": "CohereForAI/aya-expanse-32b",
      "prompt": "Once upon a time,",
      "max_tokens": 50,
      "temperature": 0.7
    }'


{"id":"cmpl-a5dae6b3d2ec408cafe39aef1b99cbbe","object":"text_completion","created":1744284787,"model":"CohereForAI/aya-expanse-32b","choices":[{"index":0,"text":" in a galaxy far, far away, there was a dog named Star. Star was a very special dog; he was a rescue dog. He was also just a puppy when he was rescued. He was so young that he had to stay with his","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":6,"total_tokens":56,"completion_tokens":50,"prompt_tokens_details":null}}
real	0m3.080s
user	0m0.000s
sys	0m0.010s

Given the challenges associated with low latency and VRAM constraints while serving aya-expanse-32b on a single GPU, the next step will be to explore a multi-GPU configuration with vLLM to further improve performance and resource utilization.

Following the aya-expanse-32b inference speeds in T385173#10729616, we wanted to understand more about how this isvc would perform with different input and output token sizes. We decided to use ROCm's Model Automation and Dashboarding (MAD) framework since it offers benchmarking tools that target AMD GPUs. Below are the steps I followed running these benchmarks in the same container from T385173#10726495.

1.First confirm this environment can serve the aya-expanse-32b before we run benchmarks. This also saves the model in cache so that the benchmark tool doesn't have to re-download it.

$ vllm serve CohereForAI/aya-expanse-32b \
--gpu_memory_utilization=1 \
--max_model_len=5296 \
--dtype float16

2.Clone and set up MAD, specifically targeting the vLLM benchmarking tool to run standalone benchmarks.

$ git clone https://github.com/ROCm/MAD.git
$ cd MAD
$ pip install -r requirements.txt
$ cd scripts/vllm

3.Configure performance settings that will be used to run the benchmark. These options should be able to run in this environment as shown in step 1.

$ vi vllm_benchmark_report.sh

replace:

OPTION_LATENCY=" --gpu-memory-utilization 0.9 "

# latency conditions
Bat="1 2 4 8 16 32 64 128 256"
InLatency="128 2048"
OutLatency="1 128"

with:

OPTION_LATENCY=" --gpu-memory-utilization 1 --max-model-len 5296 "

# latency conditions
Bat="1"
InLatency="1 64 128 256 512 1024 2048"
OutLatency="1 64 128 256 512 1024 2048"

4.Run benchmark for the aya-expanse-32b model on 1 GPU with the float16 data type.

$ ./vllm_benchmark_report.sh -s latency -m CohereForAI/aya-expanse-32b -g 1 -d float16

5.After the benchmarking tool has completed in about 4hrs, it will create a reports dir as shown below:

$ tree reports_float16_vllm_rocm6.3.1/
reports_float16_vllm_rocm6.3.1/
├── aya-expanse-32b_latency_decoding_bs1_in1024_out1024_float16.json
├── aya-expanse-32b_latency_decoding_bs1_in1024_out128_float16.json
├── aya-expanse-32b_latency_decoding_bs1_in1024_out1_float16.json
├── aya-expanse-32b_latency_decoding_bs1_in1024_out2048_float16.json
├── aya-expanse-32b_latency_decoding_bs1_in1024_out256_float16.json
├── aya-expanse-32b_latency_decoding_bs1_in1024_out512_float16.json
├── aya-expanse-32b_latency_decoding_bs1_in1024_out64_float16.json
├── aya-expanse-32b_latency_decoding_bs1_in128_out1024_float16.json
├── aya-expanse-32b_latency_decoding_bs1_in128_out128_float16.json
├── aya-expanse-32b_latency_decoding_bs1_in128_out1_float16.json
├── aya-expanse-32b_latency_decoding_bs1_in128_out2048_float16.json
├── aya-expanse-32b_latency_decoding_bs1_in128_out256_float16.json
├── aya-expanse-32b_latency_decoding_bs1_in128_out512_float16.json
├── aya-expanse-32b_latency_decoding_bs1_in128_out64_float16.json
├── aya-expanse-32b_latency_decoding_bs1_in1_out1024_float16.json
├── aya-expanse-32b_latency_decoding_bs1_in1_out128_float16.json
├── aya-expanse-32b_latency_decoding_bs1_in1_out1_float16.json
├── aya-expanse-32b_latency_decoding_bs1_in1_out2048_float16.json
├── aya-expanse-32b_latency_decoding_bs1_in1_out256_float16.json
├── aya-expanse-32b_latency_decoding_bs1_in1_out512_float16.json
├── aya-expanse-32b_latency_decoding_bs1_in1_out64_float16.json
├── aya-expanse-32b_latency_decoding_bs1_in2048_out1024_float16.json
├── aya-expanse-32b_latency_decoding_bs1_in2048_out128_float16.json
├── aya-expanse-32b_latency_decoding_bs1_in2048_out1_float16.json
├── aya-expanse-32b_latency_decoding_bs1_in2048_out2048_float16.json
├── aya-expanse-32b_latency_decoding_bs1_in2048_out256_float16.json
├── aya-expanse-32b_latency_decoding_bs1_in2048_out512_float16.json
├── aya-expanse-32b_latency_decoding_bs1_in2048_out64_float16.json
├── aya-expanse-32b_latency_decoding_bs1_in256_out1024_float16.json
├── aya-expanse-32b_latency_decoding_bs1_in256_out128_float16.json
├── aya-expanse-32b_latency_decoding_bs1_in256_out1_float16.json
├── aya-expanse-32b_latency_decoding_bs1_in256_out2048_float16.json
├── aya-expanse-32b_latency_decoding_bs1_in256_out256_float16.json
├── aya-expanse-32b_latency_decoding_bs1_in256_out512_float16.json
├── aya-expanse-32b_latency_decoding_bs1_in256_out64_float16.json
├── aya-expanse-32b_latency_decoding_bs1_in512_out1024_float16.json
├── aya-expanse-32b_latency_decoding_bs1_in512_out128_float16.json
├── aya-expanse-32b_latency_decoding_bs1_in512_out1_float16.json
├── aya-expanse-32b_latency_decoding_bs1_in512_out2048_float16.json
├── aya-expanse-32b_latency_decoding_bs1_in512_out256_float16.json
├── aya-expanse-32b_latency_decoding_bs1_in512_out512_float16.json
├── aya-expanse-32b_latency_decoding_bs1_in512_out64_float16.json
├── aya-expanse-32b_latency_decoding_bs1_in64_out1024_float16.json
├── aya-expanse-32b_latency_decoding_bs1_in64_out128_float16.json
├── aya-expanse-32b_latency_decoding_bs1_in64_out1_float16.json
├── aya-expanse-32b_latency_decoding_bs1_in64_out2048_float16.json
├── aya-expanse-32b_latency_decoding_bs1_in64_out256_float16.json
├── aya-expanse-32b_latency_decoding_bs1_in64_out512_float16.json
├── aya-expanse-32b_latency_decoding_bs1_in64_out64_float16.json
└── summary
    └── aya-expanse-32b_latency_report.csv

6.The reports_float16_vllm_rocm6.3.1/summary/aya-expanse-32b_latency_report.csv is what we are interested in and will analyze it using the visualize_latency.py custom script found here:

1"""
2This script analyzes an LLM latency bechmarking report produced by the ROCm MAD framework:
3then generates a bar chart (latency_bar_chart.png) that shows the relationship between latency (in ms)
4and token sizes (both input_len and output_len) for a batch size of 1.
5"""
6
7import pandas as pd
8import matplotlib.pyplot as plt
9import seaborn as sns
10import os
11import matplotlib.cm as cm
12
13# Load the CSV into a DataFrame.
14csv_filename = "aya-expanse-32b_latency_report.csv"
15df = pd.read_csv(csv_filename)
16
17# Ensure output directory exists
18output_dir = "charts"
19os.makedirs(output_dir, exist_ok=True)
20
21# Filter the DataFrame for only one batch (batch_size == 1).
22# May not be required if we set Bat="1" in: vllm_benchmark_report.sh
23df_single_batch = df[df["batch_size"] == 1].copy()
24
25# Create a label that combines input_len and output_len.
26df_single_batch["scenario"] = df_single_batch.apply(
27 lambda row: f"{row['input_len']} / {row['output_len']}", axis=1
28)
29
30# Get the number of unique scenarios to determine the number of colors needed
31num_scenarios = df_single_batch["scenario"].nunique()
32
33# Update to use the new colormap API
34cmap = plt.colormaps['YlGnBu']
35colors = [cmap(i / (num_scenarios - 1)) for i in range(num_scenarios)]
36
37# Option 1: Bar Plot
38fig, ax = plt.subplots(figsize=(10, 6))
39ax = sns.barplot(x="scenario", y="latency (ms)", data=df_single_batch, hue="scenario", dodge=False, palette=colors, legend=False)
40plt.xlabel("Input / Output Length")
41plt.ylabel("Latency (ms)")
42
43# Main title slightly above the graph
44plt.title("LLM Inference Latency vs Input/Output Length", fontsize=12, fontweight='bold', pad=20)
45
46# Centered subtitle below the main title
47fig.text(0.55, 0.93, "(model: aya-expanse-32b, batch: 1, dtype: float16, unquantized, gpu: mi200 x 1)", fontsize=10, ha='center', va='center')
48
49plt.grid(axis="y", linestyle="--", alpha=0.7)
50plt.xticks(rotation=90, ha="center")
51
52# Increase the y-axis limit to accommodate annotations
53max_latency = df_single_batch["latency (ms)"].max()
54plt.ylim(0, max_latency * 1.2) # Add 20% padding above the highest bar
55
56# Rotate latency numbers to 90 degrees
57for bar in ax.patches:
58 height = bar.get_height()
59 ax.annotate(f'{height:.0f}',
60 xy=(bar.get_x() + bar.get_width() / 2, height),
61 xytext=(0, 3), # 3 points vertical offset
62 textcoords="offset points",
63 ha='center', va='bottom', fontsize=8, rotation=90)
64
65plt.tight_layout()
66plt.savefig(f"{output_dir}/latency_bar_chart.png", dpi=300)
67# plt.show()
68print(f"Saved chart: {output_dir}/latency_bar_chart.png")
69plt.close()

to produce a bar chart, run:

$ pip install pandas matplotlib seaborn
$ python visualize_latency.py

7.LLM inference latency bar chart

latency_bar_chart.png (1×3 px, 357 KB)

This chart shows the inference latency (ms) of the aya-expanse-32b model running within the ROCm vLLM docker image on a single ml-lab1002 AMD GPU. It highlights that both input and output token lengths significantly influence inference latency and we will have to optimize these parameters to achieve efficient inference speeds based on specific needs / use cases of product teams.

Nice work Kevin!
@kevinbazira @klausman could we run the same benchmark on the MI300X?

Nice work Kevin!
@kevinbazira @klausman could we run the same benchmark on the MI300X?

I think so, yes. Will give that a shot today.

This comment was removed by klausman.

The full chart for running the benchmark as described by Kevin above, on the SMC-provided MI300X test machine (using one of the 8 GPUs).

latency_bar_chart.png (1×3 px, 321 KB)

And the undelrying CSV data for the above graph:

I have been working on porting the ROCm vLLM image (rocm/vllm:rocm6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6) to use WMF's debian bookworm (docker-registry.wikimedia.org/bookworm:20250413) instead of ubuntu.

1. Building ROCm, PyTorch, and vLLM on WMF Debian Bookworm

After @klausman helped resolve proxy issues on ml-lab (P75284#302759), I ported the ROCm vLLM image using the Dockerfile below, which uses WMF's Debian Bookworm as a base image:

1########################################
2# wmf-debian-vllm: ROCm, PyTorch, vLLM #
3########################################
4ARG BASE_IMAGE=docker-registry.wikimedia.org/bookworm:20250413
5FROM ${BASE_IMAGE} AS builder
6
7# — Set proxy env vars required on ml-lab1008 (see: https://phabricator.wikimedia.org/P75284#302759)
8ENV http_proxy=http://208.80.154.74:8080
9ENV https_proxy=http://208.80.154.74:8080
10ENV HTTP_PROXY=$http_proxy
11ENV HTTPS_PROXY=$https_proxy
12
13COPY apt.conf /etc/apt/apt.conf
14
15# — Mirror upstream: pin ROCm packages and create 'render' group
16ARG ROCM_VERSION=6.3.1
17ARG AMDGPU_VERSION=6.3.1
18ARG APT_PREF="Package: *\nPin: release o=repo.radeon.com\nPin-Priority: 600"
19RUN groupadd -g 109 render \
20 && printf "$APT_PREF" > /etc/apt/preferences.d/rocm-pin-600
21
22# — Add AMD ROCm & AMDGPU repositories and keys
23RUN mkdir -p /etc/apt/keyrings \
24 && apt-get update -q \
25 && apt-get install -q -y --no-install-recommends wget gnupg ca-certificates apt-transport-https \
26 && wget -qO - https://repo.radeon.com/rocm/rocm.gpg.key | gpg --dearmor > /etc/apt/keyrings/rocm.gpg \
27 && echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/amdgpu/${AMDGPU_VERSION}/ubuntu jammy main" > /etc/apt/sources.list.d/amdgpu.list \
28 && echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/rocm/apt/${ROCM_VERSION} jammy main" > /etc/apt/sources.list.d/rocm.list \
29 && apt-get update -q \
30 # Clean up APT lists and packages used only for adding repos in this layer
31 && apt-get purge --auto-remove -y wget gnupg \
32 && rm -rf /var/lib/apt/lists/*
33
34WORKDIR /app
35RUN mkdir -p /app
36
37# — Install ROCm libs & Python tooling
38RUN apt-get update -q \
39 && apt-get install -q -y \
40 rocm \
41 cmake build-essential \
42 python3 python3-pip python3-dev python3-venv \
43 git curl sudo vim \
44 sqlite3 libsqlite3-dev libfmt-dev libmsgpack-dev libsuitesparse-dev \
45 && rm -rf /var/lib/apt/lists/*
46
47# — Set environment for ROCm and vLLM
48ENV ROCM_PATH=/opt/rocm \
49 VLLM_TARGET_DEVICE=rocm \
50 PYTORCH_ROCM_ARCH=gfx90a \
51 PATH=/opt/rocm/llvm/bin:/opt/rocm/bin:/app/venv/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin \
52 LD_LIBRARY_PATH=/opt/rocm/lib:/usr/local/lib:
53
54# — Create a Python virtual environment
55RUN python3 -m venv /app/venv
56ENV PATH="/app/venv/bin:${PATH}"
57
58# — Clone vLLM
59ARG VLLM_REPO=https://github.com/vllm-project/vllm.git
60ARG VLLM_BRANCH=main
61# We used --depth 1 to avoid cloning the full history. Caveat is that the develop install might need tags.
62RUN git clone --depth 1 --branch ${VLLM_BRANCH} ${VLLM_REPO} /app/vllm
63WORKDIR /app/vllm
64# Optional: remove .git if not needed after checkout to save a # of MBs especially if cloned the full history
65# RUN rm -rf .git
66
67# — Create a custom temp directory
68RUN mkdir -p /opt/tmp
69ENV TMPDIR=/opt/tmp
70
71# — Install Python dependencies and ROCm-enabled PyTorch (into the venv)
72RUN pip install --no-cache-dir -r requirements/rocm.txt
73RUN pip install --no-cache-dir --pre torch==2.7.0.dev20250309+rocm6.3 \
74 --index-url https://download.pytorch.org/whl/nightly/rocm6.3
75
76# — Install the AMD SMI Python interface
77RUN pip install --no-cache-dir /opt/rocm/share/amd_smi
78
79# — Build vLLM in-place (using the venv)
80RUN pip install --no-cache-dir setuptools_scm packaging \
81 "cmake<4" ninja wheel setuptools pybind11 Cython
82RUN python3 setup.py develop

2. Testing vLLM in WMF Debian container

To confirm the successful porting and baseline functionality of ROCm, PyTorch, and vLLM, I run inference using the lightweight facebook/opt-125m model, and it works with the wmf-debian-vllm container:

1$ docker run --rm --network=host -it \
2-e VLLM_USE_TRITON_FLASH_ATTN=0 \
3--device=/dev/kfd --device=/dev/dri \
4--group-add=$(getent group video | cut -d: -f3) \
5--group-add=$(getent group render | cut -d: -f3) \
6--ipc=host \
7--security-opt seccomp=unconfined \
8-v /srv/hf-cache:/home/vllm/.cache/huggingface \
9wmf-debian-vllm /app/venv/bin/python -c "
10from vllm import LLM, SamplingParams; \
11llm = LLM('facebook/opt-125m'); \
12print(llm.generate('Hello, world!', SamplingParams(max_tokens=5))[0].outputs[0].text)"
13INFO 04-28 05:48:26 [__init__.py:239] Automatically detected platform rocm.
14config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 651/651 [00:00<00:00, 9.19MB/s]
15INFO 04-28 05:48:40 [config.py:716] This model supports multiple tasks: {'classify', 'reward', 'generate', 'embed', 'score'}. Defaulting to 'generate'.
16INFO 04-28 05:48:40 [arg_utils.py:1691] rocm is experimental on VLLM_USE_V1=1. Falling back to V0 Engine.
17INFO 04-28 05:48:45 [config.py:1770] Disabled the custom all-reduce kernel because it is not supported on current platform.
18INFO 04-28 05:48:45 [llm_engine.py:242] Initializing a V0 LLM engine (v0.1.dev1+g9420a1f) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=facebook/opt-125m, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
19tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 685/685 [00:00<00:00, 8.43MB/s]
20vocab.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 899k/899k [00:00<00:00, 40.1MB/s]
21merges.txt: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 103MB/s]
22special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 441/441 [00:00<00:00, 3.37MB/s]
23generation_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 137/137 [00:00<00:00, 2.10MB/s]
24INFO 04-28 05:48:47 [rocm.py:186] None is not supported in AMD GPUs.
25INFO 04-28 05:48:47 [rocm.py:187] Using ROCmFlashAttention backend.
26INFO 04-28 05:48:47 [parallel_state.py:946] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
27INFO 04-28 05:48:47 [model_runner.py:1120] Starting to load model facebook/opt-125m...
28INFO 04-28 05:48:48 [weight_utils.py:265] Using model weights format ['*.bin']
29Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
30pytorch_model.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 251M/251M [00:00<00:00, 528MB/s]
31INFO 04-28 05:48:48 [weight_utils.py:281] Time spent downloading weights for facebook/opt-125m: 0.573199 seconds
32Loading pt checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
33Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 6.52it/s]
34Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 6.51it/s]
35
36INFO 04-28 05:48:48 [loader.py:458] Loading weights took 0.15 seconds
37INFO 04-28 05:48:48 [model_runner.py:1156] Model loading took 0.3965 GiB and 1.113671 seconds
38INFO 04-28 05:48:50 [worker.py:287] Memory profiling takes 1.90 seconds
39INFO 04-28 05:48:50 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (0.90) = 57.59GiB
40INFO 04-28 05:48:50 [worker.py:287] model weights take 0.40GiB; non_torch_memory takes 0.29GiB; PyTorch activation peak memory takes 0.46GiB; the rest of the memory reserved for KV Cache is 56.44GiB.
41INFO 04-28 05:48:51 [executor_base.py:112] # rocm blocks: 102737, # CPU blocks: 7281
42INFO 04-28 05:48:51 [executor_base.py:117] Maximum concurrency for 2048 tokens per request: 802.63x
43INFO 04-28 05:48:52 [model_runner.py:1466] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
44Capturing CUDA graph shapes: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:11<00:00, 2.94it/s]
45INFO 04-28 05:49:03 [model_runner.py:1608] Graph capturing finished in 12 secs, took 0.12 GiB
46INFO 04-28 05:49:03 [llm_engine.py:443] init engine (profile, create kv cache, warmup model) took 15.01 seconds
47Processed prompts: 100%|████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 32.50it/s, est. speed input: 162.78 toks/s, output: 162.72 toks/s]
48 Where I live, the
49[rank0]:[W428 05:49:05.715310270 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

3. Identifying the source of the large image size

After porting, this image grew from the original ~35GB (upstream ubuntu image) to a whopping ~61GB (WMF debian image):

$ docker images
REPOSITORY                               TAG                                             IMAGE ID       CREATED          SIZE
wmf-debian-vllm                           latest                                          5dc2e54d7438   3 days ago       61.9GB
rocm/vllm                                rocm6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6   d632a062cd17   3 months ago     35.9GB

I identified the heaviest directories in the wmf-debian-vllm container: venv that has both pytorch and vllm build dependencies at ~27GB; and /opt/rocm-6.3.1 at ~28GB:

root@ml-lab1002:/app# du -sh venv
27G	venv
root@ml-lab1002:/app# du -sh /opt/rocm-6.3.1
28G	/opt/rocm-6.3.1
4. Slimming down the large image by identifying essential runtime dependencies

To slim down this image, I wanted to use the docker slim toolkit. But first, I had to figure out runtime dependencies that could be used in the slim image, I used the script below:

1#!/bin/bash
2set -euo pipefail # Exit on error, unset var, pipe fail
3
4# --- Configuration ---
5TIMEOUT_SECONDS=60
6# Define the command to run. Using $'' syntax for easier quote handling.
7# NOTE: Commented out 'hipcc --version' as hipcc is a compiler and likely
8# not needed for runtime inference. Include if your specific runtime
9# process actually invokes hipcc.
10COMMAND_TO_RUN=$(cat <<'EOF'
11rocminfo && \
12rocm-smi && \
13# hipcc --version && \
14/app/venv/bin/python -c "
15import sys
16print(f'--- Python Info ---', file=sys.stderr) # Debug output to stderr
17print(f'Python Exec: {sys.executable}', file=sys.stderr)
18print(f'Sys Path: {sys.path}', file=sys.stderr)
19import torch;
20print('TORCH BUILD:', torch.__version__, torch.version.git_version);
21print('ROCm/HIP Status:', torch.cuda.is_available(), torch.version.hip);
22x = torch.randn(2, 2, device='cuda'); y = x @ x; print('Matrix mul result:', y);
23from vllm import LLM, SamplingParams;
24print('Imported vLLM OK', file=sys.stderr)
25# Make sure model cache exists or is writable if needed!
26llm = LLM('facebook/opt-125m');
27print('LLM Loaded OK', file=sys.stderr)
28result = llm.generate('Hello, world!', SamplingParams(max_tokens=5))[0].outputs[0].text;
29print('Generated Text:', result)
30print('--- Python Script End ---', file=sys.stderr)"
31EOF
32)
33
34# --- Script Logic ---
35if [[ $# -ne 1 ]]; then
36 echo "Usage: $0 <directory_to_scan>"
37 echo " Scans the specified directory, renaming items to <item>.disabled"
38 echo " if they are not required for the test command to succeed."
39 echo
40 echo "Example: $0 /opt/rocm-6.3.1/lib"
41 echo "Example: $0 /app/venv/lib/python3.11/site-packages"
42 exit 1
43fi
44
45BASE_DIR="$1"
46
47if [[ ! -d "$BASE_DIR" ]]; then
48 echo "Error: Base directory '$BASE_DIR' not found."
49 exit 1
50fi
51
52echo "Scanning directory: $BASE_DIR"
53# Navigate to the directory to simplify mv commands
54cd "$BASE_DIR"
55
56# Use find to handle potential special characters in filenames safely
57# -maxdepth 1: only look in the current directory, not subdirs
58# -mindepth 1: don't process '.' itself
59# -print0 / read -d $'\0': null-delimit filenames for safety
60find . -maxdepth 1 -mindepth 1 -print0 | while IFS= read -r -d $'\0' item_path; do
61 # item_path will be like './libfoo.so' or './torch'
62 item=$(basename "$item_path") # Get name like 'libfoo.so' or 'torch'
63
64 # Skip if it already ends with .disabled or isn't a file/dir we can move
65 if [[ "$item" == *.disabled ]] || [[ ! -e "$item" ]]; then
66 # echo "Skipping '$item' (already disabled or invalid type)"
67 continue
68 fi
69
70 disabled="${item}.disabled"
71
72 # Safety check: skip if the .disabled version somehow already exists
73 if [[ -e "$disabled" ]]; then
74 echo "Warning: Target '$disabled' already exists. Skipping '$item'."
75 continue
76 fi
77
78 echo "=== Testing item: $item ==="
79
80 # Disable the item (file or directory)
81 echo "Disabling '$item' -> '$disabled'"
82 mv "$item" "$disabled"
83
84 echo "Running test command (timeout ${TIMEOUT_SECONDS}s)..."
85 command_output=""
86 exit_code=0
87
88 # Run the command with timeout, capture output and exit code.
89 # Use 'bash -c' to execute the complex command string correctly.
90 # Redirect stderr to stdout (2>&1) to capture all output.
91 # Use '|| exit_code=$?' to capture the exit code even if timeout itself fails (though less likely).
92 command_output=$( timeout "$TIMEOUT_SECONDS" bash -c "$COMMAND_TO_RUN" 2>&1 ) || exit_code=$?
93
94 echo "--- Command Output Start ---"
95 # Only print output if it's not empty
96 if [[ -n "$command_output" ]]; then
97 echo "$command_output"
98 else
99 echo "(No command output)"
100 fi
101 echo "--- Command Output End ---"
102 echo "Exit code: $exit_code"
103
104
105 # --- Decision Logic ---
106 # The primary check is the exit code of the command.
107 # A non-zero exit code indicates failure.
108 # Exit code 124 specifically means the 'timeout' command killed the process,
109 # which we also treat as a failure caused by the missing component.
110 if [[ $exit_code -ne 0 ]]; then
111 echo "❌ Test failed (Exit Code: $exit_code). Restoring '$item'..."
112 # Ensure the disabled item still exists before trying to move it back
113 if [[ -e "$disabled" ]]; then
114 mv "$disabled" "$item"
115 echo "Restored '$item'."
116 else
117 # This shouldn't happen unless something external interfered
118 echo "Error: '$disabled' not found. Cannot restore '$item'. Manual check needed."
119 # Consider exiting here if this is critical: exit 1
120 fi
121 else
122 echo "✅ Test succeeded (Exit Code: 0). Leaving '$item' disabled as '$disabled'."
123 # Item remains named $disabled
124 fi
125 echo "============================"
126 echo # Add a blank line for readability
127done
128
129echo "Script finished scanning $BASE_DIR."
130echo "Items ending in '.disabled' are potentially unnecessary."

This script automates the manual runtime dependency identification process by:

  • temporarily disabling individual packages/libraries
  • running a ROCm + PyTorch + vLLM inference test
  • determining if the package/library was essential (by detecting warnings, or errors)

with in the two largest directories:

$ ./test_packages.sh /opt/rocm-6.3.1/lib
$ ./test_packages.sh /app/venv/lib/python3.11/site-packages

Once completed, I had a definitive list of the essential packages and paths, which were then added to an includes.txt:

1/opt/amdgpu
2/opt/rocm-6.3.1/lib/libhsa-runtime64.so.1
3/opt/rocm-6.3.1/lib/libhsa-runtime64.so.1.14.60301
4/opt/rocm-6.3.1/lib/librocm_smi64.so
5/opt/rocm-6.3.1/lib/librocm_smi64.so.7
6/opt/rocm-6.3.1/lib/librocm_smi64.so.7.4.60301
7/opt/rocm-6.3.1/lib/librocprofiler-register.so.0
8/opt/rocm-6.3.1/lib/librocprofiler-register.so.0.4.0
9/opt/rocm-6.3.1/lib/librocprofiler64v2.so
10/opt/tmp
11/usr
12/lib
13/lib64
14/etc
15/dev
16/bin
17/app/venv/lib/python3.11/site-packages/PIL
18/app/venv/lib/python3.11/site-packages/PyYAML-6.0.2.dist-info
19/app/venv/lib/python3.11/site-packages/__pycache__
20/app/venv/lib/python3.11/site-packages/_distutils_hack
21/app/venv/lib/python3.11/site-packages/aiohappyeyeballs
22/app/venv/lib/python3.11/site-packages/aiohttp
23/app/venv/lib/python3.11/site-packages/aiosignal
24/app/venv/lib/python3.11/site-packages/amdsmi
25/app/venv/lib/python3.11/site-packages/annotated_types
26/app/venv/lib/python3.11/site-packages/anyio
27/app/venv/lib/python3.11/site-packages/attr
28/app/venv/lib/python3.11/site-packages/blake3
29/app/venv/lib/python3.11/site-packages/cachetools
30/app/venv/lib/python3.11/site-packages/certifi
31/app/venv/lib/python3.11/site-packages/cloudpickle
32/app/venv/lib/python3.11/site-packages/cpuinfo
33/app/venv/lib/python3.11/site-packages/distro
34/app/venv/lib/python3.11/site-packages/easy-install.pth
35/app/venv/lib/python3.11/site-packages/fastapi
36/app/venv/lib/python3.11/site-packages/filelock
37/app/venv/lib/python3.11/site-packages/filelock-3.18.0.dist-info
38/app/venv/lib/python3.11/site-packages/frozenlist
39/app/venv/lib/python3.11/site-packages/fsspec
40/app/venv/lib/python3.11/site-packages/functorch
41/app/venv/lib/python3.11/site-packages/gguf
42/app/venv/lib/python3.11/site-packages/httpx
43/app/venv/lib/python3.11/site-packages/huggingface_hub
44/app/venv/lib/python3.11/site-packages/huggingface_hub-0.30.2.dist-info
45/app/venv/lib/python3.11/site-packages/idna
46/app/venv/lib/python3.11/site-packages/jinja2
47/app/venv/lib/python3.11/site-packages/jiter
48/app/venv/lib/python3.11/site-packages/markupsafe
49/app/venv/lib/python3.11/site-packages/mpmath
50/app/venv/lib/python3.11/site-packages/msgspec
51/app/venv/lib/python3.11/site-packages/multidict
52/app/venv/lib/python3.11/site-packages/networkx
53/app/venv/lib/python3.11/site-packages/numpy
54/app/venv/lib/python3.11/site-packages/numpy-2.2.5.dist-info
55/app/venv/lib/python3.11/site-packages/numpy.libs
56/app/venv/lib/python3.11/site-packages/openai
57/app/venv/lib/python3.11/site-packages/packaging
58/app/venv/lib/python3.11/site-packages/packaging-25.0.dist-info
59/app/venv/lib/python3.11/site-packages/pillow.libs
60/app/venv/lib/python3.11/site-packages/pkg_resources
61/app/venv/lib/python3.11/site-packages/propcache
62/app/venv/lib/python3.11/site-packages/psutil
63/app/venv/lib/python3.11/site-packages/pydantic
64/app/venv/lib/python3.11/site-packages/pydantic_core
65/app/venv/lib/python3.11/site-packages/pyzmq.libs
66/app/venv/lib/python3.11/site-packages/regex
67/app/venv/lib/python3.11/site-packages/regex-2024.11.6.dist-info
68/app/venv/lib/python3.11/site-packages/requests
69/app/venv/lib/python3.11/site-packages/requests-2.32.3.dist-info
70/app/venv/lib/python3.11/site-packages/safetensors
71/app/venv/lib/python3.11/site-packages/safetensors-0.5.3.dist-info
72/app/venv/lib/python3.11/site-packages/sentencepiece
73/app/venv/lib/python3.11/site-packages/setuptools
74/app/venv/lib/python3.11/site-packages/sniffio
75/app/venv/lib/python3.11/site-packages/starlette
76/app/venv/lib/python3.11/site-packages/sympy
77/app/venv/lib/python3.11/site-packages/tokenizers
78/app/venv/lib/python3.11/site-packages/tokenizers-0.21.1.dist-info
79/app/venv/lib/python3.11/site-packages/torch
80/app/venv/lib/python3.11/site-packages/torch-2.7.0.dev20250309+rocm6.3.dist-info
81/app/venv/lib/python3.11/site-packages/torchgen
82/app/venv/lib/python3.11/site-packages/tqdm
83/app/venv/lib/python3.11/site-packages/tqdm-4.67.1.dist-info
84/app/venv/lib/python3.11/site-packages/transformers
85/app/venv/lib/python3.11/site-packages/triton
86/app/venv/lib/python3.11/site-packages/typing_extensions.py
87/app/venv/lib/python3.11/site-packages/typing_inspection
88/app/venv/lib/python3.11/site-packages/urllib3
89/app/venv/lib/python3.11/site-packages/yaml
90/app/venv/lib/python3.11/site-packages/yarl
91/app/venv/lib/python3.11/site-packages/zmq
92/app/vllm

5. Resulting slim docker image

The slimmed docker image was built using docker-slim with the above includes.txt as shown below:

1$ slim build --network host \
2--target wmf-debian-vllm:latest \
3--tag wmf-debian-vllm:slim \
4--http-probe=false \
5--continue-after=exec \
6--env VLLM_USE_TRITON_FLASH_ATTN=0 \
7--exec="rocminfo && \
8rocm-smi && \
9/app/venv/bin/python -c \"
10import torch; \
11print('TORCH BUILD:', torch.__version__, torch.version.git_version); \
12print('ROCm/HIP Status:', torch.cuda.is_available(), torch.version.hip); \
13x = torch.randn(2, 2, device='cuda'); y = x @ x; print(y); \
14from vllm import LLM, SamplingParams; \
15llm = LLM('facebook/opt-125m'); \
16print(llm.generate('Hello, world!', SamplingParams(max_tokens=5))[0].outputs[0].text)\"" \
17--include-shell=true \
18--include-path-file=includes.txt
19cmd=slim info=params include.path='/dev' message='ignoring'
20cmd=slim state=started
21cmd=slim info=cmd.input.params target.type='image' target.image='wmf-debian-vllm:latest' continue.mode='exec' rt.as.user='true' keep.perms='true' tags='wmf-debian-vllm:slim' image-build-engine='internal'
22cmd=slim state=image.inspection.start
23cmd=slim info=image id='sha256:5dc2e54d7438f2c3f2f78f5b4b67138bae2d6af90407dda504905a95e6f3b98d' size.bytes='61902042673' size.human='62 GB'
24cmd=slim info=image.stack index='0' name='docker-registry.wikimedia.org/bookworm:20250413' id='sha256:76769c10bf7aa98746670cbbb9747f0940c8af78491ef6ab1e44df0761e88586'
25cmd=slim info=image.stack index='1' name='wmf-debian-vllm:latest' id='sha256:5dc2e54d7438f2c3f2f78f5b4b67138bae2d6af90407dda504905a95e6f3b98d'
26cmd=slim state=image.inspection.done
27cmd=slim state=container.inspection.start
28cmd=slim info=sensor location='/home/kevinbazira/WMF_vLLM_image/slimtoolkit/dist_linux/mint-sensor' filemode='-rwxr-xr-x' version='linux/amd64|ALP|x.1.42.2|29e62e7836de7b1004607c51c502537ffe1969f0|2025-01-16_07:48:54AM|x' volume='mint-sensor.x.1.42.2'
29cmd=slim info=container status='created' name='mintk_3858669_20250428050351' id='92e5fbc56048fa7f98272ad7c871c464557f551edde557a475ebc96ca2f10b4f'
30cmd=slim info=container name='mintk_3858669_20250428050351' id='92e5fbc56048fa7f98272ad7c871c464557f551edde557a475ebc96ca2f10b4f' status='running'
31cmd=slim info=container message='obtained IP address' ip='127.0.0.1'
32cmd=slim info=cmd.startmonitor status='sent'
33cmd=slim info=event.startmonitor.done status='received'
34cmd=slim info=container id='92e5fbc56048fa7f98272ad7c871c464557f551edde557a475ebc96ca2f10b4f' target.port.list='' target.port.info='' message='YOU CAN USE THESE PORTS TO INTERACT WITH THE CONTAINER' name='mintk_3858669_20250428050351'
35cmd=slim info=continue.after mode='exec' message='provide the expected input to allow the container inspector to continue its execution'
36cmd=slim info=continue.after mode='exec' shell='rocminfo && rocm-smi && /app/venv/bin/python -c "
37import torch; print('TORCH BUILD:', torch.__version__, torch.version.git_version); print('ROCm/HIP Status:', torch.cuda.is_available(), torch.version.hip); x = torch.randn(2, 2, device='cuda'); y = x @ x; print(y); from vllm import LLM, SamplingParams; llm = LLM('facebook/opt-125m'); print(llm.generate('Hello, world!', SamplingParams(max_tokens=5))[0].outputs[0].text)"'
38mint[slim][exec]: output: ROCk module is loaded
39mint[slim][exec]: output: =====================
40mint[slim][exec]: output: HSA System Attributes
41mint[slim][exec]: output: =====================
42mint[slim][exec]: output: Runtime Version: 1.14
43mint[slim][exec]: output: Runtime Ext Version: 1.6
44mint[slim][exec]: output: System Timestamp Freq.: 1000.000000MHz
45mint[slim][exec]: output: Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
46mint[slim][exec]: output: Machine Model: LARGE
47mint[slim][exec]: output: System Endianness: LITTLE
48mint[slim][exec]: output: Mwaitx: DISABLED
49mint[slim][exec]: output: DMAbuf Support: NO
50mint[slim][exec]: output: ==========
51mint[slim][exec]: output: HSA Agents
52mint[slim][exec]: output: ==========
53mint[slim][exec]: output: *******
54mint[slim][exec]: output: Agent 1
55mint[slim][exec]: output: *******
56mint[slim][exec]: output: Name: AMD EPYC 7643P 48-Core Processor
57mint[slim][exec]: output: Uuid: CPU-XX
58mint[slim][exec]: output: Marketing Name: AMD EPYC 7643P 48-Core Processor
59mint[slim][exec]: output: Vendor Name: CPU
60mint[slim][exec]: output: Feature: None specified
61mint[slim][exec]: output: Profile: FULL_PROFILE
62mint[slim][exec]: output: Float Round Mode: NEAR
63mint[slim][exec]: output: Max Queue Number: 0(0x0)
64mint[slim][exec]: output: Queue Min Size: 0(0x0)
65mint[slim][exec]: output: Queue Max Size: 0(0x0)
66mint[slim][exec]: output: Queue Type: MULTI
67mint[slim][exec]: output: Node: 0
68mint[slim][exec]: output: Device Type: CPU
69mint[slim][exec]: output: Cache Info:
70mint[slim][exec]: output: L1: 32768(0x8000) KB
71mint[slim][exec]: output: Chip ID: 0(0x0)
72mint[slim][exec]: output: ASIC Revision: 0(0x0)
73mint[slim][exec]: output: Cacheline Size: 64(0x40)
74mint[slim][exec]: output: Max Clock Freq. (MHz): 2300
75mint[slim][exec]: output: BDFID: 0
76mint[slim][exec]: output: Internal Node ID: 0
77mint[slim][exec]: output: Compute Unit: 96
78mint[slim][exec]: output: SIMDs per CU: 0
79mint[slim][exec]: output: Shader Engines: 0
80mint[slim][exec]: output: Shader Arrs. per Eng.: 0
81mint[slim][exec]: output: WatchPts on Addr. Ranges:1
82mint[slim][exec]: output: Memory Properties:
83mint[slim][exec]: output: Features: None
84mint[slim][exec]: output: Pool Info:
85mint[slim][exec]: output: Pool 1
86mint[slim][exec]: output: Segment: GLOBAL; FLAGS: FINE GRAINED
87mint[slim][exec]: output: Size: 395878876(0x1798a1dc) KB
88mint[slim][exec]: output: Allocatable: TRUE
89mint[slim][exec]: output: Alloc Granule: 4KB
90mint[slim][exec]: output: Alloc Recommended Granule:4KB
91mint[slim][exec]: output: Alloc Alignment: 4KB
92mint[slim][exec]: output: Accessible by all: TRUE
93mint[slim][exec]: output: Pool 2
94mint[slim][exec]: output: Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
95mint[slim][exec]: output: Size: 395878876(0x1798a1dc) KB
96mint[slim][exec]: output: Allocatable: TRUE
97mint[slim][exec]: output: Alloc Granule: 4KB
98mint[slim][exec]: output: Alloc Recommended Granule:4KB
99mint[slim][exec]: output: Alloc Alignment: 4KB
100mint[slim][exec]: output: Accessible by all: TRUE
101mint[slim][exec]: output: Pool 3
102mint[slim][exec]: output: Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
103mint[slim][exec]: output: Size: 395878876(0x1798a1dc) KB
104mint[slim][exec]: output: Allocatable: TRUE
105mint[slim][exec]: output: Alloc Granule: 4KB
106mint[slim][exec]: output: Alloc Recommended Granule:4KB
107mint[slim][exec]: output: Alloc Alignment: 4KB
108mint[slim][exec]: output: Accessible by all: TRUE
109mint[slim][exec]: output: Pool 4
110mint[slim][exec]: output: Segment: GLOBAL; FLAGS: COARSE GRAINED
111mint[slim][exec]: output: Size: 395878876(0x1798a1dc) KB
112mint[slim][exec]: output: Allocatable: TRUE
113mint[slim][exec]: output: Alloc Granule: 4KB
114mint[slim][exec]: output: Alloc Recommended Granule:4KB
115mint[slim][exec]: output: Alloc Alignment: 4KB
116mint[slim][exec]: output: Accessible by all: TRUE
117mint[slim][exec]: output: ISA Info:
118mint[slim][exec]: output: *******
119mint[slim][exec]: output: Agent 2
120mint[slim][exec]: output: *******
121mint[slim][exec]: output: Name: gfx90a
122mint[slim][exec]: output: Uuid: GPU-a2c14903fb923a3f
123mint[slim][exec]: output: Marketing Name: AMD Instinct MI210
124mint[slim][exec]: output: Vendor Name: AMD
125mint[slim][exec]: output: Feature: KERNEL_DISPATCH
126mint[slim][exec]: output: Profile: BASE_PROFILE
127mint[slim][exec]: output: Float Round Mode: NEAR
128mint[slim][exec]: output: Max Queue Number: 128(0x80)
129mint[slim][exec]: output: Queue Min Size: 64(0x40)
130mint[slim][exec]: output: Queue Max Size: 131072(0x20000)
131mint[slim][exec]: output: Queue Type: MULTI
132mint[slim][exec]: output: Node: 1
133mint[slim][exec]: output: Device Type: GPU
134mint[slim][exec]: output: Cache Info:
135mint[slim][exec]: output: L1: 16(0x10) KB
136mint[slim][exec]: output: L2: 8192(0x2000) KB
137mint[slim][exec]: output: Chip ID: 29711(0x740f)
138mint[slim][exec]: output: ASIC Revision: 1(0x1)
139mint[slim][exec]: output: Cacheline Size: 64(0x40)
140mint[slim][exec]: output: Max Clock Freq. (MHz): 1700
141mint[slim][exec]: output: BDFID: 49920
142mint[slim][exec]: output: Internal Node ID: 1
143mint[slim][exec]: output: Compute Unit: 104
144mint[slim][exec]: output: SIMDs per CU: 4
145mint[slim][exec]: output: Shader Engines: 8
146mint[slim][exec]: output: Shader Arrs. per Eng.: 1
147mint[slim][exec]: output: WatchPts on Addr. Ranges:4
148mint[slim][exec]: output: Coherent Host Access: FALSE
149mint[slim][exec]: output: Memory Properties:
150mint[slim][exec]: output: Features: KERNEL_DISPATCH
151mint[slim][exec]: output: Fast F16 Operation: TRUE
152mint[slim][exec]: output: Wavefront Size: 64(0x40)
153mint[slim][exec]: output: Workgroup Max Size: 1024(0x400)
154mint[slim][exec]: output: Workgroup Max Size per Dimension:
155mint[slim][exec]: output: x 1024(0x400)
156mint[slim][exec]: output: y 1024(0x400)
157mint[slim][exec]: output: z 1024(0x400)
158mint[slim][exec]: output: Max Waves Per CU: 32(0x20)
159mint[slim][exec]: output: Max Work-item Per CU: 2048(0x800)
160mint[slim][exec]: output: Grid Max Size: 4294967295(0xffffffff)
161mint[slim][exec]: output: Grid Max Size per Dimension:
162mint[slim][exec]: output: x 4294967295(0xffffffff)
163mint[slim][exec]: output: y 4294967295(0xffffffff)
164mint[slim][exec]: output: z 4294967295(0xffffffff)
165mint[slim][exec]: output: Max fbarriers/Workgrp: 32
166mint[slim][exec]: output: Packet Processor uCode:: 71
167mint[slim][exec]: output: SDMA engine uCode:: 8
168mint[slim][exec]: output: IOMMU Support:: None
169mint[slim][exec]: output: Pool Info:
170mint[slim][exec]: output: Pool 1
171mint[slim][exec]: output: Segment: GLOBAL; FLAGS: COARSE GRAINED
172mint[slim][exec]: output: Size: 67092480(0x3ffc000) KB
173mint[slim][exec]: output: Allocatable: TRUE
174mint[slim][exec]: output: Alloc Granule: 4KB
175mint[slim][exec]: output: Alloc Recommended Granule:2048KB
176mint[slim][exec]: output: Alloc Alignment: 4KB
177mint[slim][exec]: output: Accessible by all: FALSE
178mint[slim][exec]: output: Pool 2
179mint[slim][exec]: output: Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
180mint[slim][exec]: output: Size: 67092480(0x3ffc000) KB
181mint[slim][exec]: output: Allocatable: TRUE
182mint[slim][exec]: output: Alloc Granule: 4KB
183mint[slim][exec]: output: Alloc Recommended Granule:2048KB
184mint[slim][exec]: output: Alloc Alignment: 4KB
185mint[slim][exec]: output: Accessible by all: FALSE
186mint[slim][exec]: output: Pool 3
187mint[slim][exec]: output: Segment: GROUP
188mint[slim][exec]: output: Size: 64(0x40) KB
189mint[slim][exec]: output: Allocatable: FALSE
190mint[slim][exec]: output: Alloc Granule: 0KB
191mint[slim][exec]: output: Alloc Recommended Granule:0KB
192mint[slim][exec]: output: Alloc Alignment: 0KB
193mint[slim][exec]: output: Accessible by all: FALSE
194mint[slim][exec]: output: ISA Info:
195mint[slim][exec]: output: ISA 1
196mint[slim][exec]: output: Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
197mint[slim][exec]: output: Machine Models: HSA_MACHINE_MODEL_LARGE
198mint[slim][exec]: output: Profiles: HSA_PROFILE_BASE
199mint[slim][exec]: output: Default Rounding Mode: NEAR
200mint[slim][exec]: output: Default Rounding Mode: NEAR
201mint[slim][exec]: output: Fast f16: TRUE
202mint[slim][exec]: output: Workgroup Max Size: 1024(0x400)
203mint[slim][exec]: output: Workgroup Max Size per Dimension:
204mint[slim][exec]: output: x 1024(0x400)
205mint[slim][exec]: output: y 1024(0x400)
206mint[slim][exec]: output: z 1024(0x400)
207mint[slim][exec]: output: Grid Max Size: 4294967295(0xffffffff)
208mint[slim][exec]: output: Grid Max Size per Dimension:
209mint[slim][exec]: output: x 4294967295(0xffffffff)
210mint[slim][exec]: output: y 4294967295(0xffffffff)
211mint[slim][exec]: output: z 4294967295(0xffffffff)
212mint[slim][exec]: output: FBarrier Max Size: 32
213mint[slim][exec]: output: *******
214mint[slim][exec]: output: Agent 3
215mint[slim][exec]: output: *******
216mint[slim][exec]: output: Name: gfx90a
217mint[slim][exec]: output: Uuid: GPU-5b81d02ab699960e
218mint[slim][exec]: output: Marketing Name: AMD Instinct MI210
219mint[slim][exec]: output: Vendor Name: AMD
220mint[slim][exec]: output: Feature: KERNEL_DISPATCH
221mint[slim][exec]: output: Profile: BASE_PROFILE
222mint[slim][exec]: output: Float Round Mode: NEAR
223mint[slim][exec]: output: Max Queue Number: 128(0x80)
224mint[slim][exec]: output: Queue Min Size: 64(0x40)
225mint[slim][exec]: output: Queue Max Size: 131072(0x20000)
226mint[slim][exec]: output: Queue Type: MULTI
227mint[slim][exec]: output: Node: 2
228mint[slim][exec]: output: Device Type: GPU
229mint[slim][exec]: output: Cache Info:
230mint[slim][exec]: output: L1: 16(0x10) KB
231mint[slim][exec]: output: L2: 8192(0x2000) KB
232mint[slim][exec]: output: Chip ID: 29711(0x740f)
233mint[slim][exec]: output: ASIC Revision: 1(0x1)
234mint[slim][exec]: output: Cacheline Size: 64(0x40)
235mint[slim][exec]: output: Max Clock Freq. (MHz): 1700
236mint[slim][exec]: output: BDFID: 768
237mint[slim][exec]: output: Internal Node ID: 2
238mint[slim][exec]: output: Compute Unit: 104
239mint[slim][exec]: output: SIMDs per CU: 4
240mint[slim][exec]: output: Shader Engines: 8
241mint[slim][exec]: output: Shader Arrs. per Eng.: 1
242mint[slim][exec]: output: WatchPts on Addr. Ranges:4
243mint[slim][exec]: output: Coherent Host Access: FALSE
244mint[slim][exec]: output: Memory Properties:
245mint[slim][exec]: output: Features: KERNEL_DISPATCH
246mint[slim][exec]: output: Fast F16 Operation: TRUE
247mint[slim][exec]: output: Wavefront Size: 64(0x40)
248mint[slim][exec]: output: Workgroup Max Size: 1024(0x400)
249mint[slim][exec]: output: Workgroup Max Size per Dimension:
250mint[slim][exec]: output: x 1024(0x400)
251mint[slim][exec]: output: y 1024(0x400)
252mint[slim][exec]: output: z 1024(0x40
253mint[slim][exec]: output: 0)
254mint[slim][exec]: output: Max Waves Per CU: 32(0x20)
255mint[slim][exec]: output: Max Work-item Per CU: 2048(0x800)
256mint[slim][exec]: output: Grid Max Size: 4294967295(0xffffffff)
257mint[slim][exec]: output: Grid Max Size per Dimension:
258mint[slim][exec]: output: x 4294967295(0xffffffff)
259mint[slim][exec]: output: y 4294967295(0xffffffff)
260mint[slim][exec]: output: z 4294967295(0xffffffff)
261mint[slim][exec]: output: Max fbarriers/Workgrp: 32
262mint[slim][exec]: output: Packet Processor uCode:: 71
263mint[slim][exec]: output: SDMA engine uCode:: 8
264mint[slim][exec]: output: IOMMU Support:: None
265mint[slim][exec]: output: Pool Info:
266mint[slim][exec]: output: Pool 1
267mint[slim][exec]: output: Segment: GLOBAL; FLAGS: COARSE GRAINED
268mint[slim][exec]: output: Size: 67092480(0x3ffc000) KB
269mint[slim][exec]: output: Allocatable: TRUE
270mint[slim][exec]: output: Alloc Granule: 4KB
271mint[slim][exec]: output: Alloc Recommended Granule:2048KB
272mint[slim][exec]: output: Alloc Alignment: 4KB
273mint[slim][exec]: output: Accessible by all: FALSE
274mint[slim][exec]: output: Pool 2
275mint[slim][exec]: output: Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
276mint[slim][exec]: output: Size: 67092480(0x3ffc000) KB
277mint[slim][exec]: output: Allocatable: TRUE
278mint[slim][exec]: output: Alloc Granule: 4KB
279mint[slim][exec]: output: Alloc Recommended Granule:2048KB
280mint[slim][exec]: output: Alloc Alignment: 4KB
281mint[slim][exec]: output: Accessible by all: FALSE
282mint[slim][exec]: output: Pool 3
283mint[slim][exec]: output: Segment: GROUP
284mint[slim][exec]: output: Size: 64(0x40) KB
285mint[slim][exec]: output: Allocatable: FALSE
286mint[slim][exec]: output: Alloc Granule: 0KB
287mint[slim][exec]: output: Alloc Recommended Granule:0KB
288mint[slim][exec]: output: Alloc Alignment: 0KB
289mint[slim][exec]: output: Accessible by all: FALSE
290mint[slim][exec]: output: ISA Info:
291mint[slim][exec]: output: ISA 1
292mint[slim][exec]: output: Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
293mint[slim][exec]: output: Machine Models: HSA_MACHINE_MODEL_LARGE
294mint[slim][exec]: output: Profiles: HSA_PROFILE_BASE
295mint[slim][exec]: output: Default Rounding Mode: NEAR
296mint[slim][exec]: output: Default Rounding Mode: NEAR
297mint[slim][exec]: output: Fast f16: TRUE
298mint[slim][exec]: output: Workgroup Max Size: 1024(0x400)
299mint[slim][exec]: output: Workgroup Max Size per Dimension:
300mint[slim][exec]: output: x 1024(0x400)
301mint[slim][exec]: output: y 1024(0x400)
302mint[slim][exec]: output: z 1024(0x400)
303mint[slim][exec]: output: Grid Max Size: 4294967295(0xffffffff)
304mint[slim][exec]: output: Grid Max Size per Dimension:
305mint[slim][exec]: output: x 4294967295(0xffffffff)
306mint[slim][exec]: output: y 4294967295(0xffffffff)
307mint[slim][exec]: output: z 4294967295(0xffffffff)
308mint[slim][exec]: output: FBarrier Max Size: 32
309mint[slim][exec]: output: *** Done ***
310mint[slim][exec]: output: ========================================= ROCm System Management Interface =========================================
311mint[slim][exec]: output: =================================================== Concise Info ===================================================
312mint[slim][exec]: output: Device Node IDs Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU%
313mint[slim][exec]: output: (DID, GUID) (Edge) (Avg) (Mem, Compute, ID)
314mint[slim][exec]: output: ====================================================================================================================
315mint[slim][exec]: output: 0 2 0x740f, 22303 45.0°C 42.0W N/A, N/A, 0 800Mhz 1600Mhz 0% auto 300.0W 0% 0%
316mint[slim][exec]: output: 1 1 0x740f, 2552 48.0°C 40.0W N/A, N/A, 0 800Mhz 1600Mhz 0% auto 300.0W 0% 0%
317mint[slim][exec]: output: ====================================================================================================================
318mint[slim][exec]: output: =============================================== End of ROCm SMI Log ================================================
319mint[slim][exec]: output: TORCH BUILD: 2.7.0.dev20250309+rocm6.3 ecc1272a4b291814d73c785fe3025ef86ffb7f06
320mint[slim][exec]: output: ROCm/HIP Status: True 6.3.42131-fa1d09cbd
321mint[slim][exec]: output: tensor([[ 1.6998, 0.6942],
322mint[slim][exec]: output: [-3.3546, -0.2955]], device='cuda:0')
323mint[slim][exec]: output: INFO 04-28 05:04:06 [__init__.py:239] Automatically detected platform rocm.
324mint[slim][exec]: output: INFO 04-28 05:04:20 [config.py:716] This model supports multiple tasks: {'reward', 'score', 'classify', 'embed', 'generate'}. Defaulting to 'generate'.
325mint[slim][exec]: output: INFO 04-28 05:04:20 [arg_utils.py:1691] rocm is experimental on VLLM_USE_V1=1. Falling back to V0 Engine.
326mint[slim][exec]: output: INFO 04-28 05:04:20 [config.py:1770] Disabled the custom all-reduce kernel because it is not supported on current platform.
327mint[slim][exec]: output: INFO 04-28 05:04:20 [llm_engine.py:242] Initializing a V0 LLM engine (v0.1.dev1+g9420a1f) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=facebook/opt-125m, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
328mint[slim][exec]: output: INFO 04-28 05:04:22 [rocm.py:186] None is not supported in AMD GPUs.
329mint[slim][exec]: output: INFO 04-28 05:04:22 [rocm.py:187] Using ROCmFlashAttention backend.
330mint[slim][exec]: output: INFO 04-28 05:04:22 [parallel_state.py:946] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
331mint[slim][exec]: output: INFO 04-28 05:04:22 [model_runner.py:1120] Starting to load model facebook/opt-125m...
332mint[slim][exec]: output: INFO 04-28 05:04:23 [weight_utils.py:265] Using model weights format ['*.bin']
333mint[slim][exec]: output: Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
334mint[slim][exec]: output: INFO 04-28 05:04:23 [weight_utils.py:281] Time spent downloading weights for facebook/opt-125m: 0.787238 seconds
335Loading pt checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
336Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 2.54it/s]
337Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 2.54it/s]
338mint[slim][exec]: output: INFO 04-28 05:04:24 [loader.py:458] Loading weights took 0.40 seconds
339mint[slim][exec]: output: INFO 04-28 05:04:24 [model_runner.py:1156] Model loading took 0.2500 GiB and 1.273095 seconds
340mint[slim][exec]: output: INFO 04-28 05:04:26 [worker.py:287] Memory profiling takes 1.67 seconds
341mint[slim][exec]: output: INFO 04-28 05:04:26 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (0.90) = 57.59GiB
342mint[slim][exec]: output: INFO 04-28 05:04:26 [worker.py:287] model weights take 0.25GiB; non_torch_memory takes 0.08GiB; PyTorch activation peak memory takes 0.46GiB; the rest of the memory reserved for KV Cache is 56.79GiB.
343mint[slim][exec]: output: INFO 04-28 05:04:26 [executor_base.py:112] # rocm blocks: 103386, # CPU blocks: 7281
344mint[slim][exec]: output: INFO 04-28 05:04:26 [executor_base.py:117] Maximum concurrency for 2048 tokens per request: 807.70x
345mint[slim][exec]: output: INFO 04-28 05:04:27 [model_runner.py:1466] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
346Capturing CUDA graph shapes: 0%| | 0/35 [00:00<?, ?it/s]
347Capturing CUDA graph shapes: 3%|| 1/35 [00:00<00:14, 2.32it/s]
348Capturing CUDA graph shapes: 6%|| 2/35 [00:00<00:12, 2.57it/s]
349Capturing CUDA graph shapes: 9%|| 3/35 [00:01<00:11, 2.68it/s]
350Capturing CUDA graph shapes: 11%|█▏ | 4/35 [00:01<00:11, 2.80it/s]
351Capturing CUDA graph shapes: 14%|█▍ | 5/35 [00:01<00:10, 2.85it/s]
352Capturing CUDA graph shapes: 17%|█▋ | 6/35 [00:02<00:10, 2.86it/s]
353Capturing CUDA graph shapes: 20%|██ | 7/35 [00:02<00:09, 2.89it/s]
354Capturing CUDA graph shapes: 23%|██▎ | 8/35 [00:02<00:09, 2.91it/s]
355Capturing CUDA graph shapes: 26%|██▌ | 9/35 [00:03<00:08, 2.94it/s]
356Capturing CUDA graph shapes: 29%|██▊ | 10/35 [00:03<00:08, 2.93it/s]
357Capturing CUDA graph shapes: 31%|███▏ | 11/35 [00:03<00:08, 2.95it/s]
358Capturing CUDA graph shapes: 34%|███▍ | 12/35 [00:04<00:07, 2.96it/s]
359Capturing CUDA graph shapes: 37%|███▋ | 13/35 [00:04<00:07, 2.96it/s]
360Capturing CUDA graph shapes: 40%|████ | 14/35 [00:04<00:07, 2.97it/s]
361Capturing CUDA graph shapes: 43%|████▎ | 15/35 [00:05<00:06, 2.97it/s]
362Capturing CUDA graph shapes: 46%|████▌ | 16/35 [00:05<00:06, 2.96it/s]
363Capturing CUDA graph shapes: 49%|████▊ | 17/35 [00:05<00:06, 2.95it/s]
364Capturing CUDA graph shapes: 51%|█████▏ | 18/35 [00:06<00:05, 2.95it/s]
365Capturing CUDA graph shapes: 54%|█████▍ | 19/35 [00:06<00:05, 2.96it/s]
366Capturing CUDA graph shapes: 57%|█████▋ | 20/35 [00:06<00:05, 2.96it/s]
367Capturing CUDA graph shapes: 60%|██████ | 21/35 [00:07<00:04, 2.95it/s]
368Capturing CUDA graph shapes: 63%|██████▎ | 22/35 [00:07<00:04, 2.94it/s]
369Capturing CUDA graph shapes: 66%|██████▌ | 23/35 [00:07<00:04, 2.93it/s]
370Capturing CUDA graph shapes: 69%|██████▊ | 24/35 [00:08<00:03, 2.94it/s]
371Capturing CUDA graph shapes: 71%|███████▏ | 25/35 [00:08<00:03, 2.95it/s]
372Capturing CUDA graph shapes: 74%|███████▍ | 26/35 [00:08<00:03, 2.96it/s]
373Capturing CUDA graph shapes: 77%|███████▋ | 27/35 [00:09<00:02, 2.95it/s]
374Capturing CUDA graph shapes: 80%|████████ | 28/35 [00:09<00:02, 2.95it/s]
375Capturing CUDA graph shapes: 83%|████████▎ | 29/35 [00:09<00:02, 2.94it/s]
376Capturing CUDA graph shapes: 86%|████████▌ | 30/35 [00:10<00:01, 2.96it/s]
377Capturing CUDA graph shapes: 89%|████████▊ | 31/35 [00:10<00:01, 2.97it/s]
378Capturing CUDA graph shapes: 91%|█████████▏| 32/35 [00:10<00:01, 2.95it/s]
379Capturing CUDA graph shapes: 94%|█████████▍| 33/35 [00:11<00:00, 2.95it/s]
380Capturing CUDA graph shapes: 97%|█████████▋| 34/35 [00:11<00:00, 2.95it/s]
381Capturing CUDA graph shapes: 100%|██████████| 35/35 [00:11<00:00, 2.92it/s]
382mint[slim][exec]: output: INFO 04-28 05:04:39 [model_runner.py:1608] Graph capturing finished in 12 secs, took 0.12 GiB
383mint[slim][exec]: output: INFO 04-28 05:04:39 [llm_engine.py:443] init engine (profile, create kv cache, warmup model) took 14.87 seconds
384Processed prompts: 0%| | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
385Processed prompts: 100%|██████████| 1/1 [00:00<00:00, 32.72it/s, est. speed input: 163.88 toks/s, output: 163.81 toks/s]
386mint[slim][exec]: output: As many ordinary people do
387mint[slim][exec]: output: [rank0]:[W428 05:04:40.069509463 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
388cmd=slim info=continue.after mode='exec' exitcode='0'
389cmd=slim state=container.inspection.finishing
390cmd=slim state=container.inspection.artifact.processing
391cmd=slim state=container.inspection.done
392cmd=slim state=building message="building optimized image" engine=internal
393cmd=slim state=completed
394cmd=slim info=results status='MINIFIED' by='2.40X' size.original='62 GB' size.optimized='26 GB'
395cmd=slim info=results has.data='true' image-build-engine='internal' image.name='wmf-debian-vllm:slim' image.size='26 GB' image.id='sha256:c8f3ae351eb21b05b59c4104d383fd03f5303ea228d642527fa4dc1770a1127a' image.digest='sha256:4b378ca62323b6925476edd0dceee5255f845c37c230a842b8e8f16d40ffcedf'
396cmd=slim info=results artifacts.location='/home/kevinbazira/WMF_vLLM_image/slimtoolkit/dist_linux/.mint-state/images/5dc2e54d7438f2c3f2f78f5b4b67138bae2d6af90407dda504905a95e6f3b98d/artifacts'
397cmd=slim info=results artifacts.report='creport.json'
398cmd=slim info=results artifacts.dockerfile.reversed='Dockerfile.reversed'
399cmd=slim info=results artifacts.seccomp='wmf-debian-vllm-seccomp.json'
400cmd=slim info=results artifacts.apparmor='wmf-debian-vllm-apparmor-profile'
401cmd=slim state=done
402cmd=slim info=commands message='use the xray command to learn more about the optimize image'
403cmd=slim info=report file='slim.report.json'
404app='mint' message='GitHub Discussions' info='https://github.com/mintoolkit/mint/discussions'
405app='mint' message='Join the CNCF Slack channel to ask questions or to share your feedback' info='https://cloud-native.slack.com/archives/C059QP1RH1S'
406app='mint' message='Join the Discord server to ask questions or to share your feedback' info='https://discord.gg/fAvq4ruKsG'
407kevinbazira@ml-lab1002:~/WMF_vLLM_image/slimtoolkit$

This resulted in a wmf-debian-vllm:slim that is only ~25GB which is a ~2.40x reduction in size from the original wmf-debian-vllm build:

$ docker images
REPOSITORY                               TAG                                             IMAGE ID       CREATED          SIZE
wmf-debian-vllm                           slim                                            c8f3ae351eb2   28 minutes ago   25.8GB
wmf-debian-vllm                           latest                                          5dc2e54d7438   3 days ago       61.9GB
rocm/vllm                                rocm6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6   d632a062cd17   3 months ago     35.9GB
6. Testing vLLM in WMF Debian slim container

To ensure the functionality had been fully retained, I tested the wmf-debian-vllm:slim container using the same inference command that I had run in the big image, and it succeeded:

1$ docker run --rm --network=host -it \
2-e VLLM_USE_TRITON_FLASH_ATTN=0 \
3--device=/dev/kfd --device=/dev/dri \
4--group-add=$(getent group video | cut -d: -f3) \
5--group-add=$(getent group render | cut -d: -f3) \
6--ipc=host \
7--security-opt seccomp=unconfined \
8-v /srv/hf-cache:/home/vllm/.cache/huggingface \
9wmf-debian-vllm:slim /app/venv/bin/python -c "
10from vllm import LLM, SamplingParams; \
11llm = LLM('facebook/opt-125m'); \
12print(llm.generate('Hello, world!', SamplingParams(max_tokens=5))[0].outputs[0].text)"
13/app/venv/lib/python3.11/site-packages/requests/__init__.py:86: RequestsDependencyWarning: Unable to find acceptable character detection dependency (chardet or charset_normalizer).
14 warnings.warn(
15INFO 04-28 06:25:52 [__init__.py:239] Automatically detected platform rocm.
16config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 651/651 [00:00<00:00, 4.21MB/s]
17INFO 04-28 06:26:05 [config.py:716] This model supports multiple tasks: {'classify', 'score', 'reward', 'embed', 'generate'}. Defaulting to 'generate'.
18INFO 04-28 06:26:05 [arg_utils.py:1691] rocm is experimental on VLLM_USE_V1=1. Falling back to V0 Engine.
19INFO 04-28 06:26:09 [config.py:1770] Disabled the custom all-reduce kernel because it is not supported on current platform.
20INFO 04-28 06:26:09 [llm_engine.py:242] Initializing a V0 LLM engine (v0.1.dev1+g9420a1f) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=facebook/opt-125m, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
21tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 685/685 [00:00<00:00, 5.83MB/s]
22vocab.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 899k/899k [00:00<00:00, 24.1MB/s]
23merges.txt: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 43.4MB/s]
24special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 441/441 [00:00<00:00, 3.43MB/s]
25generation_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 137/137 [00:00<00:00, 827kB/s]
26INFO 04-28 06:26:12 [rocm.py:186] None is not supported in AMD GPUs.
27INFO 04-28 06:26:12 [rocm.py:187] Using ROCmFlashAttention backend.
28INFO 04-28 06:26:12 [parallel_state.py:946] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
29INFO 04-28 06:26:12 [model_runner.py:1120] Starting to load model facebook/opt-125m...
30INFO 04-28 06:26:12 [weight_utils.py:265] Using model weights format ['*.bin']
31Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
32pytorch_model.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 251M/251M [00:00<00:00, 426MB/s]
33INFO 04-28 06:26:13 [weight_utils.py:281] Time spent downloading weights for facebook/opt-125m: 0.702202 seconds
34Loading pt checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
35Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 5.71it/s]
36Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 5.69it/s]
37
38INFO 04-28 06:26:13 [loader.py:458] Loading weights took 0.18 seconds
39INFO 04-28 06:26:13 [model_runner.py:1156] Model loading took 0.3965 GiB and 1.258954 seconds
40INFO 04-28 06:26:15 [worker.py:287] Memory profiling takes 1.92 seconds
41INFO 04-28 06:26:15 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (0.90) = 57.59GiB
42INFO 04-28 06:26:15 [worker.py:287] model weights take 0.40GiB; non_torch_memory takes 0.29GiB; PyTorch activation peak memory takes 0.46GiB; the rest of the memory reserved for KV Cache is 56.44GiB.
43INFO 04-28 06:26:16 [executor_base.py:112] # rocm blocks: 102737, # CPU blocks: 7281
44INFO 04-28 06:26:16 [executor_base.py:117] Maximum concurrency for 2048 tokens per request: 802.63x
45INFO 04-28 06:26:16 [model_runner.py:1466] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
46Capturing CUDA graph shapes: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:10<00:00, 3.23it/s]
47INFO 04-28 06:26:27 [model_runner.py:1608] Graph capturing finished in 11 secs, took 0.12 GiB
48INFO 04-28 06:26:27 [llm_engine.py:443] init engine (profile, create kv cache, warmup model) took 13.96 seconds
49Processed prompts: 100%|████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 31.71it/s, est. speed input: 158.80 toks/s, output: 158.74 toks/s]
50 Also
51[rank0]:[W428 06:26:29.684639841 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

Next steps are to test this image with aya-expanse models.

Testing the aya-expanse-8b model with the wmf-debian-vllm image built in T385173#10771940 returns the following error:

1>>> from vllm import LLM, SamplingParams
2INFO 04-29 00:06:20 [__init__.py:239] Automatically detected platform rocm.
3>>> llm = LLM('CohereForAI/aya-expanse-8b')
4config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 634/634 [00:00<00:00, 3.24MB/s]
5INFO 04-29 00:06:44 [config.py:716] This model supports multiple tasks: {'reward', 'generate', 'classify', 'embed', 'score'}. Defaulting to 'generate'.
6INFO 04-29 00:06:44 [arg_utils.py:1691] rocm is experimental on VLLM_USE_V1=1. Falling back to V0 Engine.
7INFO 04-29 00:06:49 [config.py:1770] Disabled the custom all-reduce kernel because it is not supported on current platform.
8INFO 04-29 00:06:49 [llm_engine.py:242] Initializing a V0 LLM engine (v0.1.dev1+g9420a1f) with config: model='CohereForAI/aya-expanse-8b', speculative_config=None, tokenizer='CohereForAI/aya-expanse-8b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=CohereForAI/aya-expanse-8b, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
9tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 8.64k/8.64k [00:00<00:00, 49.2MB/s]
10tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12.8M/12.8M [00:00<00:00, 210MB/s]
11special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 439/439 [00:00<00:00, 3.01MB/s]
12generation_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 137/137 [00:00<00:00, 1.27MB/s]
13INFO 04-29 00:06:52 [rocm.py:186] None is not supported in AMD GPUs.
14INFO 04-29 00:06:52 [rocm.py:187] Using ROCmFlashAttention backend.
15INFO 04-29 00:06:52 [parallel_state.py:946] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
16INFO 04-29 00:06:52 [model_runner.py:1120] Starting to load model CohereForAI/aya-expanse-8b...
17INFO 04-29 00:06:53 [weight_utils.py:265] Using model weights format ['*.safetensors']
18model-00004-of-00004.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 1.22G/1.22G [00:06<00:00, 191MB/s]
19model-00001-of-00004.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 4.92G/4.92G [00:22<00:00, 222MB/s]
20model-00002-of-00004.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 4.92G/4.92G [00:22<00:00, 216MB/s]
21model-00003-of-00004.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 5.00G/5.00G [00:23<00:00, 211MB/s]
22INFO 04-29 00:07:17 [weight_utils.py:281] Time spent downloading weights for CohereForAI/aya-expanse-8b: 24.070777 seconds | 4.50G/5.00G [00:21<00:01, 252MB/s]
23model.safetensors.index.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 21.0k/21.0k [00:00<00:00, 88.0MB/s]
24Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]█████████████████████████████████████████████████████▍| 4.97G/5.00G [00:23<00:00, 264MB/s]
25Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:38<01:55, 38.46s/it]
26Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:40<00:34, 17.10s/it]
27Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:43<00:10, 10.41s/it]
28Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:45<00:00, 7.27s/it]
29Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:45<00:00, 11.38s/it]
30
31INFO 04-29 00:08:03 [loader.py:458] Loading weights took 45.85 seconds
32INFO 04-29 00:08:03 [model_runner.py:1156] Model loading took 15.1387 GiB and 70.528114 seconds
33:0:rocdevice.cpp :3020: 5664519004198d us: Callback: Queue 0x7f11c8200000 aborting with error : HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception. code: 0x1016
34Aborted

I have faced this hardware exception before: https://phabricator.wikimedia.org/P74816$326 and the solution then was disabling Triton FlashAttention (using VLLM_USE_TRITON_FLASH_ATTN=0) so that vLLM could default to CK FlashAttention as shown in T385173#10726495.

Since I hadn't installed FlashAttention in the initial version of the wmf-debian-vllm image, I added CK FlashAttention to the next iteration of the image as shown below:

1########################################################
2# wmf-debian-vllm: ROCm, PyTorch, FlashAttention, vLLM #
3########################################################
4ARG BASE_IMAGE=docker-registry.wikimedia.org/bookworm:20250413
5FROM ${BASE_IMAGE} AS builder
6
7# — Set proxy env vars required on ml-lab1008 (see: https://phabricator.wikimedia.org/P75284#302759)
8ENV http_proxy=http://208.80.154.74:8080
9ENV https_proxy=http://208.80.154.74:8080
10ENV HTTP_PROXY=$http_proxy
11ENV HTTPS_PROXY=$https_proxy
12
13COPY apt.conf /etc/apt/apt.conf
14
15# — Mirror upstream: pin ROCm packages and create 'render' group
16ARG ROCM_VERSION=6.3.1
17ARG AMDGPU_VERSION=6.3.1
18ARG APT_PREF="Package: *\nPin: release o=repo.radeon.com\nPin-Priority: 600"
19RUN groupadd -g 109 render \
20 && printf "$APT_PREF" > /etc/apt/preferences.d/rocm-pin-600
21
22# — Add AMD ROCm & AMDGPU repositories and keys
23RUN mkdir -p /etc/apt/keyrings \
24 && apt-get update -q \
25 && apt-get install -q -y --no-install-recommends wget gnupg ca-certificates apt-transport-https \
26 && wget -qO - https://repo.radeon.com/rocm/rocm.gpg.key | gpg --dearmor > /etc/apt/keyrings/rocm.gpg \
27 && echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/amdgpu/${AMDGPU_VERSION}/ubuntu jammy main" > /etc/apt/sources.list.d/amdgpu.list \
28 && echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/rocm/apt/${ROCM_VERSION} jammy main" > /etc/apt/sources.list.d/rocm.list \
29 && apt-get update -q \
30 # Clean up APT lists and packages used only for adding repos in this layer
31 && apt-get purge --auto-remove -y wget gnupg \
32 && rm -rf /var/lib/apt/lists/*
33
34WORKDIR /app
35RUN mkdir -p /app
36
37# — Install ROCm libs & Python tooling
38RUN apt-get update -q \
39 && apt-get install -q -y \
40 rocm \
41 cmake build-essential \
42 python3 python3-pip python3-dev python3-venv \
43 git curl sudo vim \
44 sqlite3 libsqlite3-dev libfmt-dev libmsgpack-dev libsuitesparse-dev \
45 && rm -rf /var/lib/apt/lists/*
46
47# — Set environment for ROCm and vLLM
48ENV ROCM_PATH=/opt/rocm \
49 VLLM_TARGET_DEVICE=rocm \
50 PYTORCH_ROCM_ARCH=gfx90a \
51 PATH=/opt/rocm/llvm/bin:/opt/rocm/bin:/app/venv/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin \
52 LD_LIBRARY_PATH=/opt/rocm/lib:/usr/local/lib:
53
54# — Create a Python virtual environment
55RUN python3 -m venv /app/venv
56ENV PATH="/app/venv/bin:${PATH}"
57
58# — Create a custom temp directory
59RUN mkdir -p /opt/tmp
60ENV TMPDIR=/opt/tmp
61
62# — Install ROCm-enabled PyTorch (into the venv)
63RUN pip install --no-cache-dir --pre torch==2.7.0.dev20250309+rocm6.3 \
64 --index-url https://download.pytorch.org/whl/nightly/rocm6.3
65
66# — Install the AMD SMI Python interface
67RUN pip install --no-cache-dir /opt/rocm/share/amd_smi
68
69# — Install Python build packages required by both FlashAttention and vLLM
70RUN pip install --no-cache-dir setuptools_scm packaging \
71 "cmake<4" ninja wheel setuptools pybind11 Cython
72
73# — Install CK FlashAttention just like upstream (within the venv)
74RUN git clone https://github.com/Dao-AILab/flash-attention.git /app/flash-attn \
75 && cd /app/flash-attn \
76 && git checkout 1a7f4dfa \
77 && git submodule update --init \
78 && GPU_ARCHS=gfx90a python3 setup.py install
79
80# — Clone vLLM, install its Python dependencies, and Build it in-place (using the venv)
81ARG VLLM_REPO=https://github.com/vllm-project/vllm.git
82ARG VLLM_BRANCH=main
83RUN git clone --branch ${VLLM_BRANCH} ${VLLM_REPO} /app/vllm
84WORKDIR /app/vllm
85RUN git checkout c53e073 \
86 && git submodule update --init
87RUN pip install --no-cache-dir -r requirements/rocm.txt
88RUN python3 setup.py develop

Now testing both the aya-expanse 8b and 32b models with the wmf-debian-vllm image succeeded:

1$ docker run --rm --network=host -it \
2-e VLLM_USE_TRITON_FLASH_ATTN=0 \
3-e HF_TOKEN=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 \ # remember to replace this token with yours as I have invalidated this one
4--device=/dev/kfd --device=/dev/dri \
5--group-add=$(getent group video | cut -d: -f3) \
6--group-add=$(getent group render | cut -d: -f3) \
7--ipc=host \
8--security-opt seccomp=unconfined \
9-v /srv/hf-cache:/home/vllm/.cache/huggingface \
10wmf-debian-vllm /app/venv/bin/python -c "
11from vllm import LLM, SamplingParams; \
12llm = LLM('CohereForAI/aya-expanse-8b'); \
13print(llm.generate('Hello, world!', SamplingParams(max_tokens=5))[0].outputs[0].text)"
14INFO 04-30 15:38:22 [__init__.py:239] Automatically detected platform rocm.
15config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 634/634 [00:00<00:00, 3.76MB/s]
16INFO 04-30 15:38:36 [config.py:716] This model supports multiple tasks: {'embed', 'classify', 'reward', 'generate', 'score'}. Defaulting to 'generate'.
17INFO 04-30 15:38:42 [arg_utils.py:1699] rocm is experimental on VLLM_USE_V1=1. Falling back to V0 Engine.
18INFO 04-30 15:38:42 [config.py:1770] Disabled the custom all-reduce kernel because it is not supported on current platform.
19INFO 04-30 15:38:42 [llm_engine.py:242] Initializing a V0 LLM engine (v0.8.5.dev243+gc53e0730c) with config: model='CohereForAI/aya-expanse-8b', speculative_config=None, tokenizer='CohereForAI/aya-expanse-8b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=CohereForAI/aya-expanse-8b, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
20tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 8.64k/8.64k [00:00<00:00, 24.2MB/s]
21tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12.8M/12.8M [00:00<00:00, 86.6MB/s]
22special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 439/439 [00:00<00:00, 3.35MB/s]
23generation_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 137/137 [00:00<00:00, 1.61MB/s]
24INFO 04-30 15:38:44 [rocm.py:186] None is not supported in AMD GPUs.
25INFO 04-30 15:38:44 [rocm.py:187] Using ROCmFlashAttention backend.
26INFO 04-30 15:38:44 [parallel_state.py:946] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
27INFO 04-30 15:38:44 [model_runner.py:1120] Starting to load model CohereForAI/aya-expanse-8b...
28INFO 04-30 15:38:45 [weight_utils.py:265] Using model weights format ['*.safetensors']
29model-00001-of-00004.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 4.92G/4.92G [00:21<00:00, 232MB/s]
30model-00003-of-00004.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 5.00G/5.00G [00:21<00:00, 236MB/s]
31model-00002-of-00004.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 4.92G/4.92G [00:23<00:00, 210MB/s]
32model-00004-of-00004.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 1.22G/1.22G [00:31<00:00, 38.9MB/s]
33INFO 04-30 15:39:17 [weight_utils.py:281] Time spent downloading weights for CohereForAI/aya-expanse-8b: 31.652447 seconds███████▊| 4.91G/4.92G [00:23<00:00, 254MB/s]
34model.safetensors.index.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 21.0k/21.0k [00:00<00:00, 92.9kB/s]
35Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
36Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:02<00:06, 2.01s/it]
37Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:04<00:04, 2.22s/it]
38Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:06<00:02, 2.26s/it]
39Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:07<00:00, 1.71s/it]
40Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:07<00:00, 1.89s/it]
41
42INFO 04-30 15:39:25 [loader.py:458] Loading weights took 7.65 seconds
43INFO 04-30 15:39:25 [model_runner.py:1152] Model loading took 15.1387 GiB and 41.155314 seconds
44INFO 04-30 15:39:53 [worker.py:287] Memory profiling takes 27.08 seconds
45INFO 04-30 15:39:53 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (0.90) = 57.59GiB
46INFO 04-30 15:39:53 [worker.py:287] model weights take 15.14GiB; non_torch_memory takes 0.28GiB; PyTorch activation peak memory takes 2.38GiB; the rest of the memory reserved for KV Cache is 39.78GiB.
47INFO 04-30 15:39:53 [executor_base.py:112] # rocm blocks: 20369, # CPU blocks: 2048
48INFO 04-30 15:39:53 [executor_base.py:117] Maximum concurrency for 8192 tokens per request: 39.78x
49INFO 04-30 15:39:53 [model_runner.py:1462] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
50Capturing CUDA graph shapes: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:16<00:00, 2.13it/s]
51INFO 04-30 15:40:10 [model_runner.py:1604] Graph capturing finished in 16 secs, took 0.24 GiB
52INFO 04-30 15:40:10 [llm_engine.py:443] init engine (profile, create kv cache, warmup model) took 44.47 seconds
53Processed prompts: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 9.43it/s, est. speed input: 47.21 toks/s, output: 47.21 toks/s]
54
55Max Hopp (
56[rank0]:[W430 15:40:11.325586293 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

1$ docker run --rm --network=host -it \
2-e VLLM_USE_TRITON_FLASH_ATTN=0 \
3-e HF_TOKEN=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 \ # remember to replace this token with yours as I have invalidated this one
4--device=/dev/kfd --device=/dev/dri \
5--group-add=$(getent group video | cut -d: -f3) \
6--group-add=$(getent group render | cut -d: -f3) \
7--ipc=host \
8--security-opt seccomp=unconfined \
9-v /srv/hf-cache:/home/vllm/.cache/huggingface \
10wmf-debian-vllm /app/venv/bin/python -c "
11from vllm import LLM, SamplingParams; \
12llm = LLM(model='CohereForAI/aya-expanse-32b', gpu_memory_utilization=1, max_model_len=5296); \
13print(llm.generate('Hello, world!', SamplingParams(max_tokens=5))[0].outputs[0].text)"
14INFO 04-30 15:41:54 [__init__.py:239] Automatically detected platform rocm.
15config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 637/637 [00:00<00:00, 4.05MB/s]
16INFO 04-30 15:42:09 [config.py:716] This model supports multiple tasks: {'generate', 'score', 'reward', 'classify', 'embed'}. Defaulting to 'generate'.
17INFO 04-30 15:42:15 [arg_utils.py:1699] rocm is experimental on VLLM_USE_V1=1. Falling back to V0 Engine.
18INFO 04-30 15:42:15 [config.py:1770] Disabled the custom all-reduce kernel because it is not supported on current platform.
19INFO 04-30 15:42:15 [llm_engine.py:242] Initializing a V0 LLM engine (v0.8.5.dev243+gc53e0730c) with config: model='CohereForAI/aya-expanse-32b', speculative_config=None, tokenizer='CohereForAI/aya-expanse-32b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=5296, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=CohereForAI/aya-expanse-32b, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
20tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 8.64k/8.64k [00:00<00:00, 39.2MB/s]
21tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12.8M/12.8M [00:00<00:00, 243MB/s]
22special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 439/439 [00:00<00:00, 3.11MB/s]
23generation_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 137/137 [00:00<00:00, 1.30MB/s]
24INFO 04-30 15:42:18 [rocm.py:186] None is not supported in AMD GPUs.
25INFO 04-30 15:42:18 [rocm.py:187] Using ROCmFlashAttention backend.
26INFO 04-30 15:42:18 [parallel_state.py:946] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
27INFO 04-30 15:42:18 [model_runner.py:1120] Starting to load model CohereForAI/aya-expanse-32b...
28INFO 04-30 15:42:19 [weight_utils.py:265] Using model weights format ['*.safetensors']
29model-00007-of-00014.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 4.93G/4.93G [00:25<00:00, 196MB/s]
30model-00004-of-00014.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 4.83G/4.83G [00:25<00:00, 186MB/s]
31model-00003-of-00014.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 4.93G/4.93G [00:30<00:00, 164MB/s]
32model-00002-of-00014.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 4.93G/4.93G [00:32<00:00, 153MB/s]
33model-00009-of-00014.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 4.93G/4.93G [00:27<00:00, 178MB/s]
34model-00011-of-00014.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 4.93G/4.93G [00:23<00:00, 207MB/s]
35model-00010-of-00014.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 4.93G/4.93G [00:28<00:00, 176MB/s]
36model-00014-of-00014.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 805M/805M [00:02<00:00, 271MB/s]
37model-00012-of-00014.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 4.83G/4.83G [00:25<00:00, 191MB/s]
38model-00013-of-00014.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 4.93G/4.93G [00:21<00:00, 233MB/s]
39model-00006-of-00014.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 4.93G/4.93G [01:16<00:00, 64.2MB/s]
40model-00005-of-00014.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 4.93G/4.93G [01:18<00:00, 63.0MB/s]
41model-00001-of-00014.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 4.90G/4.90G [03:59<00:00, 20.4MB/s]
42model-00008-of-00014.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 4.83G/4.83G [06:38<00:00, 12.1MB/s]
43INFO 04-30 15:48:57 [weight_utils.py:281] Time spent downloading weights for CohereForAI/aya-expanse-32b: 398.278957 seconds█████▌| 4.91G/4.93G [01:18<00:00, 305MB/s]
44model.safetensors.index.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 26.2k/26.2k [00:00<00:00, 29.3MB/s]
45Loading safetensors checkpoint shards: 0% Completed | 0/14 [00:00<?, ?it/s]█████████████████████████████████████████████████████| 4.93G/4.93G [00:21<00:00, 226MB/s]
46Loading safetensors checkpoint shards: 7% Completed | 1/14 [00:07<01:37, 7.51s/it]
47Loading safetensors checkpoint shards: 14% Completed | 2/14 [00:09<00:53, 4.43s/it]
48Loading safetensors checkpoint shards: 21% Completed | 3/14 [00:12<00:38, 3.46s/it]█████████████████████████████████████████████| 4.83G/4.83G [06:38<00:00, 220MB/s]
49Loading safetensors checkpoint shards: 29% Completed | 4/14 [00:14<00:30, 3.01s/it]
50Loading safetensors checkpoint shards: 36% Completed | 5/14 [00:16<00:24, 2.76s/it]
51Loading safetensors checkpoint shards: 43% Completed | 6/14 [00:19<00:20, 2.61s/it]
52Loading safetensors checkpoint shards: 50% Completed | 7/14 [00:21<00:17, 2.53s/it]
53Loading safetensors checkpoint shards: 57% Completed | 8/14 [00:22<00:11, 1.94s/it]
54Loading safetensors checkpoint shards: 64% Completed | 9/14 [00:24<00:09, 1.96s/it]
55Loading safetensors checkpoint shards: 71% Completed | 10/14 [00:26<00:08, 2.06s/it]
56Loading safetensors checkpoint shards: 79% Completed | 11/14 [00:28<00:06, 2.15s/it]
57Loading safetensors checkpoint shards: 86% Completed | 12/14 [00:31<00:04, 2.21s/it]
58Loading safetensors checkpoint shards: 93% Completed | 13/14 [00:33<00:02, 2.23s/it]
59Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:35<00:00, 2.24s/it]
60Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:35<00:00, 2.54s/it]
61
62INFO 04-30 15:49:33 [loader.py:458] Loading weights took 35.95 seconds
63INFO 04-30 15:49:33 [model_runner.py:1152] Model loading took 60.3418 GiB and 435.190638 seconds
64INFO 04-30 15:49:38 [worker.py:287] Memory profiling takes 4.79 seconds
65INFO 04-30 15:49:38 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (1.00) = 63.98GiB
66INFO 04-30 15:49:38 [worker.py:287] model weights take 60.34GiB; non_torch_memory takes 0.28GiB; PyTorch activation peak memory takes 2.40GiB; the rest of the memory reserved for KV Cache is 0.96GiB.
67INFO 04-30 15:49:39 [executor_base.py:112] # rocm blocks: 393, # CPU blocks: 1638
68INFO 04-30 15:49:39 [executor_base.py:117] Maximum concurrency for 5296 tokens per request: 1.19x
69INFO 04-30 15:49:39 [model_runner.py:1462] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
70Capturing CUDA graph shapes: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:22<00:00, 1.56it/s]
71INFO 04-30 15:50:02 [model_runner.py:1604] Graph capturing finished in 22 secs, took 0.24 GiB
72INFO 04-30 15:50:02 [llm_engine.py:443] init engine (profile, create kv cache, warmup model) took 28.29 seconds
73Processed prompts: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 2.12it/s, est. speed input: 10.62 toks/s, output: 10.62 toks/s]
74 One of the most important
75[rank0]:[W430 15:50:03.345211347 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

When we added FlashAttention to the wmf-debian-vllm image in T385173#10780983, the image size grew to ~58GB:

$ docker images
REPOSITORY                               TAG                                             IMAGE ID       CREATED          SIZE
wmf-debian-vllm                          fa                                              b0de8d1342bf   50 minutes ago   58.2GB
rocm/vllm                                rocm6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6   d632a062cd17   3 months ago     35.9GB

Using the same image slimming steps we used in T385173#10771940, we:

1.identified the source of the large image size and it was the usual culprits — Python and ROCm dependencies:

root@ml-lab1002:/app# du -sh *
827M	flash-attn
24G	venv
455M	vllm
root@ml-lab1002:/app# du -sh /opt/rocm-6.3.1/*
0	/opt/rocm-6.3.1/amdgcn
264M	/opt/rocm-6.3.1/bin
122M	/opt/rocm-6.3.1/include
27G	/opt/rocm-6.3.1/lib
15M	/opt/rocm-6.3.1/libexec
0	/opt/rocm-6.3.1/llvm
632M	/opt/rocm-6.3.1/share

2.identified essential runtime dependencies using the script below:

1#!/bin/bash
2set -euo pipefail # Exit on error, unset var, pipe fail
3
4# --- Configuration ---
5TIMEOUT_SECONDS=180
6# Define the command to run. Using $'' syntax for easier quote handling.
7# NOTE: Commented out 'hipcc --version' as hipcc is a compiler and likely
8# not needed for runtime inference. Include if your specific runtime
9# process actually invokes hipcc.
10COMMAND_TO_RUN=$(cat <<'EOF'
11rocminfo && \
12rocm-smi && \
13# hipcc --version && \
14/app/venv/bin/python -c "
15import sys
16print(f'--- Python Info ---', file=sys.stderr) # Debug output to stderr
17print(f'Python Exec: {sys.executable}', file=sys.stderr)
18print(f'Sys Path: {sys.path}', file=sys.stderr)
19import torch;
20print('TORCH BUILD:', torch.__version__, torch.version.git_version);
21print('ROCm/HIP Status:', torch.cuda.is_available(), torch.version.hip);
22x = torch.randn(2, 2, device='cuda'); y = x @ x; print('Matrix mul result:', y);
23from vllm import LLM, SamplingParams;
24print('Imported vLLM OK', file=sys.stderr)
25# Make sure model cache exists or is writable if needed!
26llm = LLM('CohereForAI/aya-expanse-8b');
27print('LLM Loaded OK', file=sys.stderr)
28result = llm.generate('Hello, world!', SamplingParams(max_tokens=5))[0].outputs[0].text;
29print('Generated Text:', result)
30print('--- Python Script End ---', file=sys.stderr)"
31EOF
32)
33
34# --- Script Logic ---
35if [[ $# -ne 1 ]]; then
36 echo "Usage: $0 <directory_to_scan>"
37 echo " Scans the specified directory, renaming items to <item>.disabled"
38 echo " if they are not required for the test command to succeed."
39 echo
40 echo "Example: $0 /opt/rocm-6.3.1/lib"
41 echo "Example: $0 /app/venv/lib/python3.11/site-packages"
42 exit 1
43fi
44
45BASE_DIR="$1"
46
47if [[ ! -d "$BASE_DIR" ]]; then
48 echo "Error: Base directory '$BASE_DIR' not found."
49 exit 1
50fi
51
52echo "Scanning directory: $BASE_DIR"
53# Navigate to the directory to simplify mv commands
54cd "$BASE_DIR"
55
56# Use find to handle potential special characters in filenames safely
57# -maxdepth 1: only look in the current directory, not subdirs
58# -mindepth 1: don't process '.' itself
59# -print0 / read -d $'\0': null-delimit filenames for safety
60find . -maxdepth 1 -mindepth 1 -print0 | while IFS= read -r -d $'\0' item_path; do
61 # item_path will be like './libfoo.so' or './torch'
62 item=$(basename "$item_path") # Get name like 'libfoo.so' or 'torch'
63
64 # Skip if it already ends with .disabled or isn't a file/dir we can move
65 if [[ "$item" == *.disabled ]] || [[ ! -e "$item" ]]; then
66 # echo "Skipping '$item' (already disabled or invalid type)"
67 continue
68 fi
69
70 disabled="${item}.disabled"
71
72 # Safety check: skip if the .disabled version somehow already exists
73 if [[ -e "$disabled" ]]; then
74 echo "Warning: Target '$disabled' already exists. Skipping '$item'."
75 continue
76 fi
77
78 echo "=== Testing item: $item ==="
79
80 # Disable the item (file or directory)
81 echo "Disabling '$item' -> '$disabled'"
82 mv "$item" "$disabled"
83
84 echo "Running test command (timeout ${TIMEOUT_SECONDS}s)..."
85 command_output=""
86 exit_code=0
87
88 # Run the command with timeout, capture output and exit code.
89 # Use 'bash -c' to execute the complex command string correctly.
90 # Redirect stderr to stdout (2>&1) to capture all output.
91 # Use '|| exit_code=$?' to capture the exit code even if timeout itself fails (though less likely).
92 command_output=$( timeout "$TIMEOUT_SECONDS" bash -c "$COMMAND_TO_RUN" 2>&1 ) || exit_code=$?
93
94 echo "--- Command Output Start ---"
95 # Only print output if it's not empty
96 if [[ -n "$command_output" ]]; then
97 echo "$command_output"
98 else
99 echo "(No command output)"
100 fi
101 echo "--- Command Output End ---"
102 echo "Exit code: $exit_code"
103
104
105 # --- Decision Logic ---
106 # The primary check is the exit code of the command.
107 # A non-zero exit code indicates failure.
108 # Exit code 124 specifically means the 'timeout' command killed the process,
109 # which we also treat as a failure caused by the missing component.
110 if [[ $exit_code -ne 0 ]]; then
111 echo "❌ Test failed (Exit Code: $exit_code). Restoring '$item'..."
112 # Ensure the disabled item still exists before trying to move it back
113 if [[ -e "$disabled" ]]; then
114 mv "$disabled" "$item"
115 echo "Restored '$item'."
116 else
117 # This shouldn't happen unless something external interfered
118 echo "Error: '$disabled' not found. Cannot restore '$item'. Manual check needed."
119 # Consider exiting here if this is critical: exit 1
120 fi
121 else
122 echo "✅ Test succeeded (Exit Code: 0). Leaving '$item' disabled as '$disabled'."
123 # Item remains named $disabled
124 fi
125 echo "============================"
126 echo # Add a blank line for readability
127done
128
129echo "Script finished scanning $BASE_DIR."
130echo "Items ending in '.disabled' are potentially unnecessary."

and prioritized the two largest directories:

$ export HF_TOKEN=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 # remember to replace this token with yours as I have invalidated this one
$ ./test_packages.sh /opt/rocm-6.3.1/lib
$ ./test_packages.sh /app/venv/lib/python3.11/site-packages

3.added the list of the essential packages and paths to an includes.txt:

1/opt/amdgpu
2/opt/rocm-6.3.1/lib/libhsa-runtime64.so.1
3/opt/rocm-6.3.1/lib/libhsa-runtime64.so.1.14.60301
4/opt/rocm-6.3.1/lib/librocprofiler-register.so.0
5/opt/rocm-6.3.1/lib/librocprofiler-register.so.0.4.0
6/opt/rocm-6.3.1/lib/librocprofiler64v2.so
7/opt/tmp
8/usr
9/lib
10/lib64
11/etc
12/dev
13/bin
14/app/venv/lib/python3.11/site-packages/PIL
15/app/venv/lib/python3.11/site-packages/PyYAML-6.0.2.dist-info
16/app/venv/lib/python3.11/site-packages/__pycache__
17/app/venv/lib/python3.11/site-packages/aiohappyeyeballs
18/app/venv/lib/python3.11/site-packages/aiohttp
19/app/venv/lib/python3.11/site-packages/aiosignal
20/app/venv/lib/python3.11/site-packages/amdsmi
21/app/venv/lib/python3.11/site-packages/annotated_types
22/app/venv/lib/python3.11/site-packages/anyio
23/app/venv/lib/python3.11/site-packages/attr
24/app/venv/lib/python3.11/site-packages/blake3
25/app/venv/lib/python3.11/site-packages/cachetools
26/app/venv/lib/python3.11/site-packages/certifi
27/app/venv/lib/python3.11/site-packages/cloudpickle
28/app/venv/lib/python3.11/site-packages/cpuinfo
29/app/venv/lib/python3.11/site-packages/distro
30/app/venv/lib/python3.11/site-packages/easy-install.pth
31/app/venv/lib/python3.11/site-packages/fastapi
32/app/venv/lib/python3.11/site-packages/filelock
33/app/venv/lib/python3.11/site-packages/filelock-3.16.1.dist-info
34/app/venv/lib/python3.11/site-packages/frozenlist
35/app/venv/lib/python3.11/site-packages/fsspec
36/app/venv/lib/python3.11/site-packages/functorch
37/app/venv/lib/python3.11/site-packages/gguf
38/app/venv/lib/python3.11/site-packages/httpx
39/app/venv/lib/python3.11/site-packages/huggingface_hub
40/app/venv/lib/python3.11/site-packages/huggingface_hub-0.30.2.dist-info
41/app/venv/lib/python3.11/site-packages/idna
42/app/venv/lib/python3.11/site-packages/jinja2
43/app/venv/lib/python3.11/site-packages/jiter
44/app/venv/lib/python3.11/site-packages/markupsafe
45/app/venv/lib/python3.11/site-packages/mpmath
46/app/venv/lib/python3.11/site-packages/msgspec
47/app/venv/lib/python3.11/site-packages/multidict
48/app/venv/lib/python3.11/site-packages/networkx
49/app/venv/lib/python3.11/site-packages/numpy
50/app/venv/lib/python3.11/site-packages/numpy-2.2.5.dist-info
51/app/venv/lib/python3.11/site-packages/numpy.libs
52/app/venv/lib/python3.11/site-packages/openai
53/app/venv/lib/python3.11/site-packages/packaging
54/app/venv/lib/python3.11/site-packages/packaging-25.0.dist-info
55/app/venv/lib/python3.11/site-packages/pillow.libs
56/app/venv/lib/python3.11/site-packages/propcache
57/app/venv/lib/python3.11/site-packages/psutil
58/app/venv/lib/python3.11/site-packages/pydantic
59/app/venv/lib/python3.11/site-packages/pydantic_core
60/app/venv/lib/python3.11/site-packages/pyzmq.libs
61/app/venv/lib/python3.11/site-packages/regex-2024.11.6.dist-info
62/app/venv/lib/python3.11/site-packages/requests
63/app/venv/lib/python3.11/site-packages/requests-2.32.3.dist-info
64/app/venv/lib/python3.11/site-packages/safetensors
65/app/venv/lib/python3.11/site-packages/safetensors-0.5.3.dist-info
66/app/venv/lib/python3.11/site-packages/sentencepiece
67/app/venv/lib/python3.11/site-packages/sniffio
68/app/venv/lib/python3.11/site-packages/starlette
69/app/venv/lib/python3.11/site-packages/sympy
70/app/venv/lib/python3.11/site-packages/tokenizers
71/app/venv/lib/python3.11/site-packages/tokenizers-0.21.1.dist-info
72/app/venv/lib/python3.11/site-packages/torch
73/app/venv/lib/python3.11/site-packages/torch-2.7.0.dev20250309+rocm6.3.dist-info
74/app/venv/lib/python3.11/site-packages/torchgen
75/app/venv/lib/python3.11/site-packages/tqdm
76/app/venv/lib/python3.11/site-packages/tqdm-4.67.1.dist-info
77/app/venv/lib/python3.11/site-packages/transformers
78/app/venv/lib/python3.11/site-packages/triton
79/app/venv/lib/python3.11/site-packages/typing_extensions.py
80/app/venv/lib/python3.11/site-packages/typing_inspection
81/app/venv/lib/python3.11/site-packages/urllib3
82/app/venv/lib/python3.11/site-packages/yaml
83/app/venv/lib/python3.11/site-packages/yarl
84/app/venv/lib/python3.11/site-packages/zmq
85/app/flash-attn
86/app/vllm

4.used docker-slim with the above includes.txt as shown below:

1$ slim build --network host \
2--target wmf-debian-vllm:fa \
3--tag wmf-debian-vllm:fa-slim \
4--http-probe=false \
5--continue-after=exec \
6--env VLLM_USE_TRITON_FLASH_ATTN=0 \
7--env HF_TOKEN=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 \ # remember to replace this token with yours as I have invalidated this one
8--exec="rocminfo && \
9rocm-smi && \
10/app/venv/bin/python -c \"
11import torch; \
12print('TORCH BUILD:', torch.__version__, torch.version.git_version); \
13print('ROCm/HIP Status:', torch.cuda.is_available(), torch.version.hip); \
14x = torch.randn(2, 2, device='cuda'); y = x @ x; print(y); \
15from vllm import LLM, SamplingParams; \
16aya_llm = LLM('CohereForAI/aya-expanse-8b'); \
17print(aya_llm.generate('Hello, world!', SamplingParams(max_tokens=5))[0].outputs[0].text)\"" \
18--include-shell=true \
19--include-path-file=includes.txt
20cmd=slim info=params include.path='/dev' message='ignoring'
21cmd=slim state=started
22cmd=slim info=cmd.input.params target.image='wmf-debian-vllm:fa' continue.mode='exec' rt.as.user='true' keep.perms='true' tags='wmf-debian-vllm:fa-slim' image-bui
23ld-engine='internal' target.type='image'
24cmd=slim state=image.inspection.start
25cmd=slim info=image id='sha256:b0de8d1342bfed519c41d7e52fe6bad5dbff84c19a9240241bec3d8c72b24a0f' size.bytes='58161299483' size.human='58 GB'
26cmd=slim info=image.stack index='0' name='docker-registry.wikimedia.org/bookworm:20250413' id='sha256:76769c10bf7aa98746670cbbb9747f0940c8af78491ef6ab1e44df0761e88586
27'
28cmd=slim info=image.stack index='1' name='wmf-debian-vllm:fa' id='sha256:b0de8d1342bfed519c41d7e52fe6bad5dbff84c19a9240241bec3d8c72b24a0f'
29cmd=slim state=image.inspection.done
30cmd=slim state=container.inspection.start
31cmd=slim info=sensor location='/home/kevinbazira/WMF_vLLM_image/slimtoolkit/dist_linux/mint-sensor' filemode='-rwxr-xr-x' version='linux/amd64|ALP|x.1.42.2|29e62e7836
32de7b1004607c51c502537ffe1969f0|2025-01-16_07:48:54AM|x' volume='mint-sensor.x.1.42.2'
33cmd=slim info=container status='created' name='mintk_865339_20250502192350' id='252c41995ae280e150d98e10624f32ddf3cf8330bef986643faf6401277f3776'
34cmd=slim info=container status='running' name='mintk_865339_20250502192350' id='252c41995ae280e150d98e10624f32ddf3cf8330bef986643faf6401277f3776'
35cmd=slim info=container ip='127.0.0.1' message='obtained IP address'
36cmd=slim info=cmd.startmonitor status='sent'
37cmd=slim info=event.startmonitor.done status='received'
38cmd=slim info=container name='mintk_865339_20250502192350' id='252c41995ae280e150d98e10624f32ddf3cf8330bef986643faf6401277f3776' target.port.list='' target.port.info=
39'' message='YOU CAN USE THESE PORTS TO INTERACT WITH THE CONTAINER'
40cmd=slim info=continue.after mode='exec' message='provide the expected input to allow the container inspector to continue its execution'
41cmd=slim info=continue.after mode='exec' shell='rocminfo && rocm-smi && /app/venv/bin/python -c "
42import torch; print('TORCH BUILD:', torch.__version__, torch.version.git_version); print('ROCm/HIP Status:', torch.cuda.is_available(), torch.version.hip); x = torch.
43randn(2, 2, device='cuda'); y = x @ x; print(y); from vllm import LLM, SamplingParams; aya_llm = LLM('CohereForAI/aya-expanse-8b'); print(aya_llm.generate('Hello, world!', SamplingParams(max_tokens=5))[0].outputs[0].text)"'
44mint[slim][exec]: output: ROCk module is loaded
45mint[slim][exec]: output: =====================
46mint[slim][exec]: output: HSA System Attributes
47mint[slim][exec]: output: =====================
48mint[slim][exec]: output: Runtime Version: 1.14
49mint[slim][exec]: output: Runtime Ext Version: 1.6
50mint[slim][exec]: output: System Timestamp Freq.: 1000.000000MHz
51mint[slim][exec]: output: Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
52mint[slim][exec]: output: Machine Model: LARGE
53mint[slim][exec]: output: System Endianness: LITTLE
54mint[slim][exec]: output: Mwaitx: DISABLED
55mint[slim][exec]: output: DMAbuf Support: NO
56mint[slim][exec]: output: ==========
57mint[slim][exec]: output: HSA Agents
58mint[slim][exec]: output: ==========
59mint[slim][exec]: output: *******
60mint[slim][exec]: output: Agent 1
61mint[slim][exec]: output: *******
62mint[slim][exec]: output: Name: AMD EPYC 7643P 48-Core Processor
63mint[slim][exec]: output: Uuid: CPU-XX
64mint[slim][exec]: output: Marketing Name: AMD EPYC 7643P 48-Core Processor
65mint[slim][exec]: output: Vendor Name: CPU
66mint[slim][exec]: output: Feature: None specified
67mint[slim][exec]: output: Profile: FULL_PROFILE
68mint[slim][exec]: output: Float Round Mode: NEAR
69mint[slim][exec]: output: Max Queue Number: 0(0x0)
70mint[slim][exec]: output: Queue Min Size: 0(0x0)
71mint[slim][exec]: output: Queue Max Size: 0(0x0)
72mint[slim][exec]: output: Queue Type: MULTI
73mint[slim][exec]: output: Node: 0
74mint[slim][exec]: output: Device Type: CPU
75mint[slim][exec]: output: Cache Info:
76mint[slim][exec]: output: L1: 32768(0x8000) KB
77mint[slim][exec]: output: Chip ID: 0(0x0)
78mint[slim][exec]: output: ASIC Revision: 0(0x0)
79mint[slim][exec]: output: Cacheline Size: 64(0x40)
80mint[slim][exec]: output: Max Clock Freq. (MHz): 2300
81mint[slim][exec]: output: BDFID: 0
82mint[slim][exec]: output: Internal Node ID: 0
83mint[slim][exec]: output: Compute Unit: 96
84mint[slim][exec]: output: SIMDs per CU: 0
85mint[slim][exec]: output: Shader Engines: 0
86mint[slim][exec]: output: Shader Arrs. per Eng.: 0
87mint[slim][exec]: output: WatchPts on Addr. Ranges:1
88mint[slim][exec]: output: Memory Properties:
89mint[slim][exec]: output: Features: None
90mint[slim][exec]: output: Pool Info:
91mint[slim][exec]: output: Pool 1
92mint[slim][exec]: output: Segment: GLOBAL; FLAGS: FINE GRAINED
93mint[slim][exec]: output: Size: 395878872(0x1798a1d8) KB
94mint[slim][exec]: output: Allocatable: TRUE
95mint[slim][exec]: output: Alloc Granule: 4KB
96mint[slim][exec]: output: Alloc Recommended Granule:4KB
97mint[slim][exec]: output: Alloc Alignment: 4KB
98mint[slim][exec]: output: Accessible by all: TRUE
99mint[slim][exec]: output: Pool 2
100mint[slim][exec]: output: Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
101mint[slim][exec]: output: Size: 395878872(0x1798a1d8) KB
102mint[slim][exec]: output: Allocatable: TRUE
103mint[slim][exec]: output: Alloc Granule: 4KB
104mint[slim][exec]: output: Alloc Recommended Granule:4KB
105mint[slim][exec]: output: Alloc Alignment: 4KB
106mint[slim][exec]: output: Accessible by all: TRUE
107mint[slim][exec]: output: Pool 3
108mint[slim][exec]: output: Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
109mint[slim][exec]: output: Size: 395878872(0x1798a1d8) KB
110mint[slim][exec]: output: Allocatable: TRUE
111mint[slim][exec]: output: Alloc Granule: 4KB
112mint[slim][exec]: output: Alloc Recommended Granule:4KB
113mint[slim][exec]: output: Alloc Alignment: 4KB
114mint[slim][exec]: output: Accessible by all: TRUE
115mint[slim][exec]: output: Pool 4
116mint[slim][exec]: output: Segment: GLOBAL; FLAGS: COARSE GRAINED
117mint[slim][exec]: output: Size: 395878872(0x1798a1d8) KB
118mint[slim][exec]: output: Allocatable: TRUE
119mint[slim][exec]: output: Alloc Granule: 4KB
120mint[slim][exec]: output: Alloc Recommended Granule:4KB
121mint[slim][exec]: output: Alloc Alignment: 4KB
122mint[slim][exec]: output: Accessible by all: TRUE
123mint[slim][exec]: output: ISA Info:
124mint[slim][exec]: output: *******
125mint[slim][exec]: output: Agent 2
126mint[slim][exec]: output: *******
127mint[slim][exec]: output: Name: gfx90a
128mint[slim][exec]: output: Uuid: GPU-a2c14903fb923a3f
129mint[slim][exec]: output: Marketing Name: AMD Instinct MI210
130mint[slim][exec]: output: Vendor Name: AMD
131mint[slim][exec]: output: Feature: KERNEL_DISPATCH
132mint[slim][exec]: output: Profile: BASE_PROFILE
133mint[slim][exec]: output: Float Round Mode: NEAR
134mint[slim][exec]: output: Max Queue Number: 128(0x80)
135mint[slim][exec]: output: Queue Min Size: 64(0x40)
136mint[slim][exec]: output: Queue Max Size: 131072(0x20000)
137mint[slim][exec]: output: Queue Type: MULTI
138mint[slim][exec]: output: Node: 1
139mint[slim][exec]: output: Device Type: GPU
140mint[slim][exec]: output: Cache Info:
141mint[slim][exec]: output: L1: 16(0x10) KB
142mint[slim][exec]: output: L2: 8192(0x2000) KB
143mint[slim][exec]: output: Chip ID: 29711(0x740f)
144mint[slim][exec]: output: ASIC Revision: 1(0x1)
145mint[slim][exec]: output: Cacheline Size: 64(0x40)
146mint[slim][exec]: output: Max Clock Freq. (MHz): 1700
147mint[slim][exec]: output: BDFID: 49920
148mint[slim][exec]: output: Internal Node ID: 1
149mint[slim][exec]: output: Compute Unit: 104
150mint[slim][exec]: output: SIMDs per CU: 4
151mint[slim][exec]: output: Shader Engines: 8
152mint[slim][exec]: output: Shader Arrs. per Eng.: 1
153mint[slim][exec]: output: WatchPts on Addr. Ranges:4
154mint[slim][exec]: output: Coherent Host Access: FALSE
155mint[slim][exec]: output: Memory Properties:
156mint[slim][exec]: output: Features: KERNEL_DISPATCH
157mint[slim][exec]: output: Fast F16 Operation: TRUE
158mint[slim][exec]: output: Wavefront Size: 64(0x40)
159mint[slim][exec]: output: Workgroup Max Size: 1024(0x400)
160mint[slim][exec]: output: Workgroup Max Size per Dimension:
161mint[slim][exec]: output: x 1024(0x400)
162mint[slim][exec]: output: y 1024(0x400)
163mint[slim][exec]: output: z 1024(0x400)
164mint[slim][exec]: output: Max Waves Per CU: 32(0x20)
165mint[slim][exec]: output: Max Work-item Per CU: 2048(0x800)
166mint[slim][exec]: output: Grid Max Size: 4294967295(0xffffffff)
167mint[slim][exec]: output: Grid Max Size per Dimension:
168mint[slim][exec]: output: x 4294967295(0xffffffff)
169mint[slim][exec]: output: y 4294967295(0xffffffff)
170mint[slim][exec]: output: z 4294967295(0xffffffff)
171mint[slim][exec]: output: Max fbarriers/Workgrp: 32
172mint[slim][exec]: output: Packet Processor uCode:: 71
173mint[slim][exec]: output: SDMA engine uCode:: 8
174mint[slim][exec]: output: IOMMU Support:: None
175mint[slim][exec]: output: Pool Info:
176mint[slim][exec]: output: Pool 1
177mint[slim][exec]: output: Segment: GLOBAL; FLAGS: COARSE GRAINED
178mint[slim][exec]: output: Size: 67092480(0x3ffc000) KB
179mint[slim][exec]: output: Allocatable: TRUE
180mint[slim][exec]: output: Alloc Granule: 4KB
181mint[slim][exec]: output: Alloc Recommended Granule:2048KB
182mint[slim][exec]: output: Alloc Alignment: 4KB
183mint[slim][exec]: output: Accessible by all: FALSE
184mint[slim][exec]: output: Pool 2
185mint[slim][exec]: output: Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
186mint[slim][exec]: output: Size: 67092480(0x3ffc000) KB
187mint[slim][exec]: output: Allocatable: TRUE
188mint[slim][exec]: output: Alloc Granule: 4KB
189mint[slim][exec]: output: Alloc Recommended Granule:2048KB
190mint[slim][exec]: output:
191mint[slim][exec]: output: Alloc Alignment: 4KB
192mint[slim][exec]: output: Accessible by all: FALSE
193mint[slim][exec]: output: Pool 3
194mint[slim][exec]: output: Segment: GROUP
195mint[slim][exec]: output: Size: 64(0x40) KB
196mint[slim][exec]: output: Allocatable: FALSE
197mint[slim][exec]: output: Alloc Granule: 0KB
198mint[slim][exec]: output: Alloc Recommended Granule:0KB
199mint[slim][exec]: output: Alloc Alignment: 0KB
200mint[slim][exec]: output: Accessible by all: FALSE
201mint[slim][exec]: output: ISA Info:
202mint[slim][exec]: output: ISA 1
203mint[slim][exec]: output: Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
204mint[slim][exec]: output: Machine Models: HSA_MACHINE_MODEL_LARGE
205mint[slim][exec]: output: Profiles: HSA_PROFILE_BASE
206mint[slim][exec]: output: Default Rounding Mode: NEAR
207mint[slim][exec]: output: Default Rounding Mode: NEAR
208mint[slim][exec]: output: Fast f16: TRUE
209mint[slim][exec]: output: Workgroup Max Size: 1024(0x400)
210mint[slim][exec]: output: Workgroup Max Size per Dimension:
211mint[slim][exec]: output: x 1024(0x400)
212mint[slim][exec]: output: y 1024(0x400)
213mint[slim][exec]: output: z 1024(0x400)
214mint[slim][exec]: output: Grid Max Size: 4294967295(0xffffffff)
215mint[slim][exec]: output: Grid Max Size per Dimension:
216mint[slim][exec]: output: x 4294967295(0xffffffff)
217mint[slim][exec]: output: y 4294967295(0xffffffff)
218mint[slim][exec]: output: z 4294967295(0xffffffff)
219mint[slim][exec]: output: FBarrier Max Size: 32
220mint[slim][exec]: output: *******
221mint[slim][exec]: output: Agent 3
222mint[slim][exec]: output: *******
223mint[slim][exec]: output: Name: gfx90a
224mint[slim][exec]: output: Uuid: GPU-5b81d02ab699960e
225mint[slim][exec]: output: Marketing Name: AMD Instinct MI210
226mint[slim][exec]: output: Vendor Name: AMD
227mint[slim][exec]: output: Feature: KERNEL_DISPATCH
228mint[slim][exec]: output: Profile: BASE_PROFILE
229mint[slim][exec]: output: Float Round Mode: NEAR
230mint[slim][exec]: output: Max Queue Number: 128(0x80)
231mint[slim][exec]: output: Queue Min Size: 64(0x40)
232mint[slim][exec]: output: Queue Max Size: 131072(0x20000)
233mint[slim][exec]: output: Queue Type: MULTI
234mint[slim][exec]: output: Node: 2
235mint[slim][exec]: output: Device Type: GPU
236mint[slim][exec]: output: Cache Info:
237mint[slim][exec]: output: L1: 16(0x10) KB
238mint[slim][exec]: output: L2: 8192(0x2000) KB
239mint[slim][exec]: output: Chip ID: 29711(0x740f)
240mint[slim][exec]: output: ASIC Revision: 1(0x1)
241mint[slim][exec]: output: Cacheline Size: 64(0x40)
242mint[slim][exec]: output: Max Clock Freq. (MHz): 1700
243mint[slim][exec]: output: BDFID: 768
244mint[slim][exec]: output: Internal Node ID: 2
245mint[slim][exec]: output: Compute Unit: 104
246mint[slim][exec]: output: SIMDs per CU: 4
247mint[slim][exec]: output: Shader Engines: 8
248mint[slim][exec]: output: Shader Arrs. per Eng.: 1
249mint[slim][exec]: output: WatchPts on Addr. Ranges:4
250mint[slim][exec]: output: Coherent Host Access: FALSE
251mint[slim][exec]: output: Memory Properties:
252mint[slim][exec]: output: Features: KERNEL_DISPATCH
253mint[slim][exec]: output: Fast F16 Operation: TRUE
254mint[slim][exec]: output: Wavefront Size: 64(0x40)
255mint[slim][exec]: output: Workgroup Max Size: 1024(0x400)
256mint[slim][exec]: output: Workgroup Max Size per Dimension:
257mint[slim][exec]: output: x 1024(0x400)
258mint[slim][exec]: output: y 1024(0x400)
259mint[slim][exec]: output: z 1024(0x40
260mint[slim][exec]: output: 0)
261mint[slim][exec]: output: Max Waves Per CU: 32(0x20)
262mint[slim][exec]: output: Max Work-item Per CU: 2048(0x800)
263mint[slim][exec]: output: Grid Max Size: 4294967295(0xffffffff)
264mint[slim][exec]: output: Grid Max Size per Dimension:
265mint[slim][exec]: output: x 4294967295(0xffffffff)
266mint[slim][exec]: output: y 4294967295(0xffffffff)
267mint[slim][exec]: output: z 4294967295(0xffffffff)
268mint[slim][exec]: output: Max fbarriers/Workgrp: 32
269mint[slim][exec]: output: Packet Processor uCode:: 71
270mint[slim][exec]: output: SDMA engine uCode:: 8
271mint[slim][exec]: output: IOMMU Support:: None
272mint[slim][exec]: output: Pool Info:
273mint[slim][exec]: output: Pool 1
274mint[slim][exec]: output: Segment: GLOBAL; FLAGS: COARSE GRAINED
275mint[slim][exec]: output: Size: 67092480(0x3ffc000) KB
276mint[slim][exec]: output: Allocatable: TRUE
277mint[slim][exec]: output: Alloc Granule: 4KB
278mint[slim][exec]: output: Alloc Recommended Granule:2048KB
279mint[slim][exec]: output: Alloc Alignment: 4KB
280mint[slim][exec]: output: Accessible by all: FALSE
281mint[slim][exec]: output: Pool 2
282mint[slim][exec]: output: Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
283mint[slim][exec]: output: Size: 67092480(0x3ffc000) KB
284mint[slim][exec]: output: Allocatable: TRUE
285mint[slim][exec]: output: Alloc Granule: 4KB
286mint[slim][exec]: output: Alloc Recommended Granule:2048KB
287mint[slim][exec]: output: Alloc Alignment: 4KB
288mint[slim][exec]: output: Accessible by all: FALSE
289mint[slim][exec]: output: Pool 3
290mint[slim][exec]: output: Segment: GROUP
291mint[slim][exec]: output: Size: 64(0x40) KB
292mint[slim][exec]: output: Allocatable: FALSE
293mint[slim][exec]: output: Alloc Granule: 0KB
294mint[slim][exec]: output: Alloc Recommended Granule:0KB
295mint[slim][exec]: output: Alloc Alignment: 0KB
296mint[slim][exec]: output: Accessible by all: FALSE
297mint[slim][exec]: output: ISA Info:
298mint[slim][exec]: output: ISA 1
299mint[slim][exec]: output: Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
300mint[slim][exec]: output: Machine Models: HSA_MACHINE_MODEL_LARGE
301mint[slim][exec]: output: Profiles: HSA_PROFILE_BASE
302mint[slim][exec]: output: Default Rounding Mode: NEAR
303mint[slim][exec]: output: Default Rounding Mode: NEAR
304mint[slim][exec]: output: Fast f16: TRUE
305mint[slim][exec]: output: Workgroup Max Size: 1024(0x400)
306mint[slim][exec]: output: Workgroup Max Size per Dimension:
307mint[slim][exec]: output: x 1024(0x400)
308mint[slim][exec]: output: y 1024(0x400)
309mint[slim][exec]: output: z 1024(0x400)
310mint[slim][exec]: output: Grid Max Size: 4294967295(0xffffffff)
311mint[slim][exec]: output: Grid Max Size per Dimension:
312mint[slim][exec]: output: x 4294967295(0xffffffff)
313mint[slim][exec]: output: y 4294967295(0xffffffff)
314mint[slim][exec]: output: z 4294967295(0xffffffff)
315mint[slim][exec]: output: FBarrier Max Size: 32
316mint[slim][exec]: output: *** Done ***
317mint[slim][exec]: output: ========================================= ROCm System Management Interface =========================================
318mint[slim][exec]: output: =================================================== Concise Info ===================================================
319mint[slim][exec]: output: Device Node IDs Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU%
320mint[slim][exec]: output: (DID, GUID) (Edge) (Avg) (Mem, Compute, ID)
321mint[slim][exec]: output: ====================================================================================================================
322mint[slim][exec]: output: 0 2 0x740f, 22303 45.0°C 42.0W N/A, N/A, 0 800Mhz 1600Mhz 0% auto 300.0W 0% 0%
323mint[slim][exec]: output: 1 1 0x740f, 2552 48.0°C 40.0W N/A, N/A, 0 800Mhz 1600Mhz 0% auto 300.0W 0% 0%
324mint[slim][exec]: output: ====================================================================================================================
325mint[slim][exec]: output: =============================================== End of ROCm SMI Log ================================================
326mint[slim][exec]: output: TORCH BUILD: 2.7.0.dev20250309+rocm6.3 ecc1272a4b291814d73c785fe3025ef86ffb7f06
327mint[slim][exec]: output: ROCm/HIP Status: True 6.3.42131-fa1d09cbd
328mint[slim][exec]: output: tensor([[-3.7145, 1.0186],
329mint[slim][exec]: output: [-0.6336, -4.3516]], device='cuda:0')
330mint[slim][exec]: output: INFO 05-02 19:24:12 [__init__.py:239] Automatically detected platform rocm.
331mint[slim][exec]: output: INFO 05-02 19:24:26 [config.py:716] This model supports multiple tasks: {'score', 'reward', 'classify', 'embed', 'generate'}. Defaulting to 'generate'.
332mint[slim][exec]: output: INFO 05-02 19:24:28 [arg_utils.py:1699] rocm is experimental on VLLM_USE_V1=1. Falling back to V0 Engine.
333mint[slim][exec]: output: INFO 05-02 19:24:28 [config.py:1770] Disabled the custom all-reduce kernel because it is not supported on current platform.
334mint[slim][exec]: output: INFO 05-02 19:24:28 [llm_engine.py:242] Initializing a V0 LLM engine (v0.8.5.dev243+gc53e0730c) with config: model='CohereForAI/aya-expanse-8b', speculative_config=None, tokenizer='CohereForAI/aya-expanse-8b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=CohereForAI/aya-expanse-8b, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
335mint[slim][exec]: output: INFO 05-02 19:24:30 [rocm.py:186] None is not supported in AMD GPUs.
336mint[slim][exec]: output: INFO 05-02 19:24:30 [rocm.py:187] Using ROCmFlashAttention backend.
337mint[slim][exec]: output: INFO 05-02 19:24:31 [parallel_state.py:946] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
338mint[slim][exec]: output: INFO 05-02 19:24:31 [model_runner.py:1120] Starting to load model CohereForAI/aya-expanse-8b...
339mint[slim][exec]: output: INFO 05-02 19:24:31 [weight_utils.py:265] Using model weights format ['*.safetensors']
340mint[slim][exec]: output: INFO 05-02 19:24:59 [weight_utils.py:281] Time spent downloading weights for CohereForAI/aya-expanse-8b: 28.116590 seconds
341Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
342Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:00<00:01, 1.81it/s]
343Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:02<00:02, 1.50s/it]
344Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:05<00:01, 1.90s/it]
345Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:07<00:00, 2.13s/it]
346Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:07<00:00, 1.89s/it]
347mint[slim][exec]: output: INFO 05-02 19:25:07 [loader.py:458] Loading weights took 7.92 seconds
348mint[slim][exec]: output: INFO 05-02 19:25:07 [model_runner.py:1152] Model loading took 14.9863 GiB and 36.597384 seconds
349mint[slim][exec]: output: INFO 05-02 19:25:52 [worker.py:287] Memory profiling takes 44.70 seconds
350mint[slim][exec]: output: INFO 05-02 19:25:52 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (0.90) = 57.59GiB
351mint[slim][exec]: output: INFO 05-02 19:25:52 [worker.py:287] model weights take 14.99GiB; non_torch_memory takes 0.08GiB; PyTorch activation peak memory takes 2.38GiB; the rest of the memory reserved for KV Cache is 40.13GiB.
352mint[slim][exec]: output: INFO 05-02 19:25:52 [executor_base.py:112] # rocm blocks: 20549, # CPU blocks: 2048
353mint[slim][exec]: output: INFO 05-02 19:25:52 [executor_base.py:117] Maximum concurrency for 8192 tokens per request: 40.13x
354mint[slim][exec]: output: INFO 05-02 19:25:53 [model_runner.py:1462] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
355Capturing CUDA graph shapes: 0%| | 0/35 [00:00<?, ?it/s]
356Capturing CUDA graph shapes: 3%|| 1/35 [00:01<00:34, 1.02s/it]
357Capturing CUDA graph shapes: 6%|| 2/35 [00:01<00:23, 1.41it/s]
358Capturing CUDA graph shapes: 9%|| 3/35 [00:01<00:19, 1.66it/s]
359Capturing CUDA graph shapes: 11%|█▏ | 4/35 [00:02<00:16, 1.83it/s]
360Capturing CUDA graph shapes: 14%|█▍ | 5/35 [00:02<00:15, 1.96it/s]
361Capturing CUDA graph shapes: 17%|█▋ | 6/35 [00:03<00:14, 2.04it/s]
362Capturing CUDA graph shapes: 20%|██ | 7/35 [00:03<00:13, 2.10it/s]
363Capturing CUDA graph shapes: 23%|██▎ | 8/35 [00:04<00:12, 2.14it/s]
364Capturing CUDA graph shapes: 26%|██▌ | 9/35 [00:04<00:11, 2.19it/s]
365Capturing CUDA graph shapes: 29%|██▊ | 10/35 [00:05<00:11, 2.22it/s]
366Capturing CUDA graph shapes: 31%|███▏ | 11/35 [00:05<00:10, 2.24it/s]
367Capturing CUDA graph shapes: 34%|███▍ | 12/35 [00:05<00:10, 2.26it/s]
368Capturing CUDA graph shapes: 37%|███▋ | 13/35 [00:06<00:09, 2.27it/s]
369Capturing CUDA graph shapes: 40%|████ | 14/35 [00:06<00:09, 2.28it/s]
370Capturing CUDA graph shapes: 43%|████▎ | 15/35 [00:07<00:08, 2.30it/s]
371Capturing CUDA graph shapes: 46%|████▌ | 16/35 [00:07<00:08, 2.30it/s]
372Capturing CUDA graph shapes: 49%|████▊ | 17/35 [00:08<00:07, 2.31it/s]
373Capturing CUDA graph shapes: 51%|█████▏ | 18/35 [00:08<00:07, 2.32it/s]
374Capturing CUDA graph shapes: 54%|█████▍ | 19/35 [00:08<00:06, 2.32it/s]
375Capturing CUDA graph shapes: 57%|█████▋ | 20/35 [00:09<00:06, 2.33it/s]
376Capturing CUDA graph shapes: 60%|██████ | 21/35 [00:09<00:06, 2.33it/s]
377Capturing CUDA graph shapes: 63%|██████▎ | 22/35 [00:10<00:05, 2.35it/s]
378Capturing CUDA graph shapes: 66%|██████▌ | 23/35 [00:10<00:05, 2.36it/s]
379Capturing CUDA graph shapes: 69%|██████▊ | 24/35 [00:11<00:04, 2.36it/s]
380Capturing CUDA graph shapes: 71%|███████▏ | 25/35 [00:11<00:04, 2.38it/s]
381Capturing CUDA graph shapes: 74%|███████▍ | 26/35 [00:11<00:03, 2.39it/s]
382Capturing CUDA graph shapes: 77%|███████▋ | 27/35 [00:12<00:03, 2.40it/s]
383Capturing CUDA graph shapes: 80%|████████ | 28/35 [00:12<00:02, 2.39it/s]
384Capturing CUDA graph shapes: 83%|████████▎ | 29/35 [00:13<00:02, 2.40it/s]
385Capturing CUDA graph shapes: 86%|████████▌ | 30/35 [00:13<00:02, 2.41it/s]
386Capturing CUDA graph shapes: 89%|████████▊ | 31/35 [00:14<00:01, 2.42it/s]
387Capturing CUDA graph shapes: 91%|█████████▏| 32/35 [00:14<00:01, 2.44it/s]
388Capturing CUDA graph shapes: 94%|█████████▍| 33/35 [00:14<00:00, 2.44it/s]
389Capturing CUDA graph shapes: 97%|█████████▋| 34/35 [00:15<00:00, 2.47it/s]
390Capturing CUDA graph shapes: 100%|██████████| 35/35 [00:16<00:00, 2.16it/s]
391mint[slim][exec]: output: INFO 05-02 19:26:09 [model_runner.py:1604] Graph capturing finished in 16 secs, took 0.24 GiB
392mint[slim][exec]: output: INFO 05-02 19:26:09 [llm_engine.py:443] init engine (profile, create kv cache, warmup model) took 61.88 seconds
393Processed prompts: 0%| | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
394Processed prompts: 100%|██████████| 1/1 [00:00<00:00, 9.40it/s, est. speed input: 46.99 toks/s, output: 46.98 toks/s]
395Processed prompts: 100%|██████████| 1/1 [00:00<00:00, 9.38it/s, est. speed input: 46.99 toks/s, output: 46.98 toks/s]
396mint[slim][exec]: output: | Purple D'or
397mint[slim][exec]: output: [rank0]:[W502 19:26:10.830716644 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
398cmd=slim info=continue.after mode='exec' exitcode='0'
399cmd=slim state=container.inspection.finishing
400cmd=slim state=container.inspection.artifact.processing
401cmd=slim state=container.inspection.done
402cmd=slim state=building message="building optimized image" engine=internal
403cmd=slim state=completed
404cmd=slim info=results status='MINIFIED' by='2.18X' size.original='58 GB' size.optimized='27 GB'
405cmd=slim info=results image.size='27 GB' image.id='sha256:5dd4e262e75f2884f346c2ddbc836ef1173ea8a458f48a96b34fa4b02d4ac565' image.digest='sha256:86c9074ff5290dca052025b1c14f62347b0c243620e8e89bf0c0a94b4bb8ff1d' has.data='true' image-build-engine='internal' image.name='wmf-debian-vllm:fa-slim'
406cmd=slim info=results artifacts.location='/home/kevinbazira/WMF_vLLM_image/slimtoolkit/dist_linux/.mint-state/images/b0de8d1342bfed519c41d7e52fe6bad5dbff84c19a9240241bec3d8c72b24a0f/artifacts'
407cmd=slim info=results artifacts.report='creport.json'
408cmd=slim info=results artifacts.dockerfile.reversed='Dockerfile.reversed'
409cmd=slim info=results artifacts.seccomp='wmf-debian-vllm-seccomp.json'
410cmd=slim info=results artifacts.apparmor='wmf-debian-vllm-apparmor-profile'
411cmd=slim state=done
412cmd=slim info=commands message='use the xray command to learn more about the optimize image'
413cmd=slim info=report file='slim.report.json'
414app='mint' message='GitHub Discussions' info='https://github.com/mintoolkit/mint/discussions'
415app='mint' message='Join the CNCF Slack channel to ask questions or to share your feedback' info='https://cloud-native.slack.com/archives/C059QP1RH1S'
416app='mint' message='Join the Discord server to ask questions or to share your feedback' info='https://discord.gg/fAvq4ruKsG'

This resulted in a wmf-debian-vllm:fa-slim that is ~26GB which is a ~2.18x reduction in size from the original ~58GB wmf-debian-vllm:fa build:

$ docker images
REPOSITORY                               TAG                                             IMAGE ID       CREATED          SIZE
wmf-debian-vllm                          fa-slim                                         5dd4e262e75f   10 minutes ago   26.7GB
wmf-debian-vllm                          fa                                              b0de8d1342bf   50 minutes ago   58.2GB
rocm/vllm                                rocm6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6   d632a062cd17   3 months ago     35.9GB

5.tested vLLM in wmf-debian-vllm:fa-slim container with both aya-expanse 8b and 32b models:

5.1.aya-expanse-8b was served successfully:

1$ docker run --rm --network=host -it \
2-e VLLM_USE_TRITON_FLASH_ATTN=0 \
3-e HF_TOKEN=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 \ # remember to replace this token with yours as I have invalidated this one
4--device=/dev/kfd --device=/dev/dri \
5--group-add=$(getent group video | cut -d: -f3) \
6--group-add=$(getent group render | cut -d: -f3) \
7--ipc=host \
8--security-opt seccomp=unconfined \
9-v /srv/hf-cache:/home/vllm/.cache/huggingface \
10wmf-debian-vllm:fa-slim /app/venv/bin/python -c "
11from vllm import LLM, SamplingParams; \
12llm = LLM('CohereForAI/aya-expanse-8b'); \
13print(llm.generate('Hello, world!', SamplingParams(max_tokens=5))[0].outputs[0].text)"
14/app/venv/lib/python3.11/site-packages/requests/__init__.py:86: RequestsDependencyWarning: Unable to find acceptable character detection dependency (chardet or charset_normalizer).
15 warnings.warn(
16INFO 05-05 04:38:38 [__init__.py:239] Automatically detected platform rocm.
17config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 634/634 [00:00<00:00, 3.86MB/s]
18INFO 05-05 04:38:51 [config.py:716] This model supports multiple tasks: {'reward', 'classify', 'score', 'generate', 'embed'}. Defaulting to 'generate'.
19INFO 05-05 04:38:57 [arg_utils.py:1699] rocm is experimental on VLLM_USE_V1=1. Falling back to V0 Engine.
20INFO 05-05 04:38:57 [config.py:1770] Disabled the custom all-reduce kernel because it is not supported on current platform.
21INFO 05-05 04:38:57 [llm_engine.py:242] Initializing a V0 LLM engine (v0.8.5.dev243+gc53e0730c) with config: model='CohereForAI/aya-expanse-8b', speculative_config=None, tokenizer='CohereForAI/aya-expanse-8b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=CohereForAI/aya-expanse-8b, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
22tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 8.64k/8.64k [00:00<00:00, 76.0MB/s]
23tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12.8M/12.8M [00:00<00:00, 251MB/s]
24special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 439/439 [00:00<00:00, 3.18MB/s]
25generation_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 137/137 [00:00<00:00, 2.22MB/s]
26INFO 05-05 04:38:58 [rocm.py:186] None is not supported in AMD GPUs.
27INFO 05-05 04:38:58 [rocm.py:187] Using ROCmFlashAttention backend.
28INFO 05-05 04:38:58 [parallel_state.py:946] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
29INFO 05-05 04:38:58 [model_runner.py:1120] Starting to load model CohereForAI/aya-expanse-8b...
30INFO 05-05 04:38:59 [weight_utils.py:265] Using model weights format ['*.safetensors']
31model-00004-of-00004.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 1.22G/1.22G [00:04<00:00, 254MB/s]
32model-00002-of-00004.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 4.92G/4.92G [00:21<00:00, 227MB/s]
33model-00001-of-00004.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 4.92G/4.92G [00:21<00:00, 226MB/s]
34model-00003-of-00004.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 5.00G/5.00G [00:23<00:00, 210MB/s]
35INFO 05-05 04:39:23 [weight_utils.py:281] Time spent downloading weights for CohereForAI/aya-expanse-8b: 23.896260 seconds███████▋| 4.98G/5.00G [00:23<00:00, 283MB/s]
36model.safetensors.index.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 21.0k/21.0k [00:00<00:00, 79.3MB/s]
37Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
38Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:47<02:23, 47.87s/it]
39Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:50<00:41, 20.97s/it]
40Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:52<00:12, 12.48s/it]
41Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:54<00:00, 8.52s/it]
42Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:54<00:00, 13.71s/it]
43
44INFO 05-05 04:40:18 [loader.py:458] Loading weights took 55.19 seconds
45INFO 05-05 04:40:18 [model_runner.py:1152] Model loading took 15.1387 GiB and 79.797683 seconds
46INFO 05-05 04:40:24 [worker.py:287] Memory profiling takes 5.57 seconds
47INFO 05-05 04:40:24 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (0.90) = 57.59GiB
48INFO 05-05 04:40:24 [worker.py:287] model weights take 15.14GiB; non_torch_memory takes 0.31GiB; PyTorch activation peak memory takes 2.38GiB; the rest of the memory reserved for KV Cache is 39.75GiB.
49INFO 05-05 04:40:24 [executor_base.py:112] # rocm blocks: 20354, # CPU blocks: 2048
50INFO 05-05 04:40:24 [executor_base.py:117] Maximum concurrency for 8192 tokens per request: 39.75x
51INFO 05-05 04:40:25 [model_runner.py:1462] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
52Capturing CUDA graph shapes: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:13<00:00, 2.57it/s]
53INFO 05-05 04:40:38 [model_runner.py:1604] Graph capturing finished in 14 secs, took 0.24 GiB
54INFO 05-05 04:40:38 [llm_engine.py:443] init engine (profile, create kv cache, warmup model) took 20.11 seconds
55Processed prompts: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 7.62it/s, est. speed input: 38.13 toks/s, output: 38.12 toks/s]
56 I am optimistic about the
57[rank0]:[W505 04:40:39.726893136 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

5.2.aya-expanse-32b model loading returns a bus error:

1$ docker run --rm --network=host -it \
2-e VLLM_USE_TRITON_FLASH_ATTN=0 \
3-e HF_TOKEN=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 \ # remember to replace this token with yours as I have invalidated this one
4--device=/dev/kfd --device=/dev/dri \
5--group-add=$(getent group video | cut -d: -f3) \
6--group-add=$(getent group render | cut -d: -f3) \
7--ipc=host \
8--security-opt seccomp=unconfined \
9-v /srv/hf-cache:/home/vllm/.cache/huggingface \
10wmf-debian-vllm:fa-slim /app/venv/bin/python -c "
11from vllm import LLM, SamplingParams; \
12llm = LLM(model='CohereForAI/aya-expanse-32b', gpu_memory_utilization=1, max_model_len=5296); \
13print(llm.generate('Hello, world!', SamplingParams(max_tokens=5))[0].outputs[0].text)"
14/app/venv/lib/python3.11/site-packages/requests/__init__.py:86: RequestsDependencyWarning: Unable to find acceptable character detection dependency (chardet or charset_normalizer).
15 warnings.warn(
16INFO 05-05 04:53:04 [__init__.py:239] Automatically detected platform rocm.
17config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 637/637 [00:00<00:00, 3.25MB/s]
18INFO 05-05 04:53:19 [config.py:716] This model supports multiple tasks: {'score', 'embed', 'classify', 'generate', 'reward'}. Defaulting to 'generate'.
19INFO 05-05 04:53:24 [arg_utils.py:1699] rocm is experimental on VLLM_USE_V1=1. Falling back to V0 Engine.
20INFO 05-05 04:53:24 [config.py:1770] Disabled the custom all-reduce kernel because it is not supported on current platform.
21INFO 05-05 04:53:24 [llm_engine.py:242] Initializing a V0 LLM engine (v0.8.5.dev243+gc53e0730c) with config: model='CohereForAI/aya-expanse-32b', speculative_config=None, tokenizer='CohereForAI/aya-expanse-32b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=5296, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=CohereForAI/aya-expanse-32b, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
22tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 8.64k/8.64k [00:00<00:00, 39.6MB/s]
23tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12.8M/12.8M [00:00<00:00, 268MB/s]
24special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 439/439 [00:00<00:00, 3.16MB/s]
25generation_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 137/137 [00:00<00:00, 1.27MB/s]
26INFO 05-05 04:53:26 [rocm.py:186] None is not supported in AMD GPUs.
27INFO 05-05 04:53:26 [rocm.py:187] Using ROCmFlashAttention backend.
28INFO 05-05 04:53:26 [parallel_state.py:946] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
29INFO 05-05 04:53:26 [model_runner.py:1120] Starting to load model CohereForAI/aya-expanse-32b...
30INFO 05-05 04:53:26 [weight_utils.py:265] Using model weights format ['*.safetensors']
31model-00004-of-00014.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 4.83G/4.83G [00:28<00:00, 172MB/s]
32model-00002-of-00014.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 4.93G/4.93G [00:35<00:00, 138MB/s]
33model-00006-of-00014.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 4.93G/4.93G [00:40<00:00, 122MB/s]
34model-00009-of-00014.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 4.93G/4.93G [00:29<00:00, 166MB/s]
35model-00010-of-00014.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 4.93G/4.93G [00:27<00:00, 179MB/s]
36model-00007-of-00014.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 4.93G/4.93G [01:04<00:00, 75.9MB/s]
37model-00003-of-00014.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 4.93G/4.93G [01:09<00:00, 71.1MB/s]
38model-00008-of-00014.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 4.83G/4.83G [01:11<00:00, 67.1MB/s]
39model-00005-of-00014.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 4.93G/4.93G [01:12<00:00, 68.3MB/s]
40model-00001-of-00014.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 4.90G/4.90G [01:32<00:00, 52.7MB/s]
41model-00014-of-00014.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 805M/805M [00:06<00:00, 134MB/s]
42model-00011-of-00014.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 4.93G/4.93G [00:20<00:00, 241MB/s]
43model-00013-of-00014.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 4.93G/4.93G [00:22<00:00, 219MB/s]
44model-00012-of-00014.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 4.83G/4.83G [00:23<00:00, 205MB/s]
45INFO 05-05 04:55:28 [weight_utils.py:281] Time spent downloading weights for CohereForAI/aya-expanse-32b: 121.835209 seconds█████▌| 4.91G/4.93G [00:22<00:00, 339MB/s]
46model.safetensors.index.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 26.2k/26.2k [00:00<00:00, 85.2MB/s]
47Loading safetensors checkpoint shards: 0% Completed | 0/14 [00:00<?, ?it/s]████████████████████████████████████████████████████▊| 4.82G/4.83G [00:23<00:00, 306MB/s]
48Loading safetensors checkpoint shards: 7% Completed | 1/14 [01:03<13:45, 63.52s/it]
49Loading safetensors checkpoint shards: 14% Completed | 2/14 [01:05<05:30, 27.54s/it]
50Loading safetensors checkpoint shards: 21% Completed | 3/14 [01:08<02:56, 16.04s/it]
51Loading safetensors checkpoint shards: 29% Completed | 4/14 [01:10<01:46, 10.65s/it]
52Loading safetensors checkpoint shards: 36% Completed | 5/14 [01:12<01:08, 7.64s/it]
53Loading safetensors checkpoint shards: 43% Completed | 6/14 [01:15<00:46, 5.84s/it]
54Loading safetensors checkpoint shards: 50% Completed | 7/14 [01:17<00:32, 4.70s/it]
55Loading safetensors checkpoint shards: 57% Completed | 8/14 [01:20<00:23, 3.96s/it]
56Loading safetensors checkpoint shards: 64% Completed | 9/14 [01:22<00:17, 3.45s/it]
57Loading safetensors checkpoint shards: 71% Completed | 10/14 [01:24<00:12, 3.10s/it]
58Loading safetensors checkpoint shards: 79% Completed | 11/14 [01:25<00:07, 2.35s/it]
59Loading safetensors checkpoint shards: 86% Completed | 12/14 [01:27<00:04, 2.26s/it]
60Loading safetensors checkpoint shards: 93% Completed | 13/14 [01:29<00:02, 2.28s/it]
61Loading safetensors checkpoint shards: 100% Completed | 14/14 [01:31<00:00, 2.29s/it]
62Loading safetensors checkpoint shards: 100% Completed | 14/14 [01:31<00:00, 6.57s/it]
63
64INFO 05-05 04:57:01 [loader.py:458] Loading weights took 92.32 seconds
65INFO 05-05 04:57:01 [model_runner.py:1152] Model loading took 60.3418 GiB and 214.928894 seconds
66INFO 05-05 04:57:09 [worker.py:287] Memory profiling takes 8.52 seconds
67INFO 05-05 04:57:09 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (1.00) = 63.98GiB
68INFO 05-05 04:57:09 [worker.py:287] model weights take 60.34GiB; non_torch_memory takes 0.31GiB; PyTorch activation peak memory takes 2.40GiB; the rest of the memory reserved for KV Cache is 0.93GiB.
69INFO 05-05 04:57:10 [executor_base.py:112] # rocm blocks: 381, # CPU blocks: 1638
70INFO 05-05 04:57:10 [executor_base.py:117] Maximum concurrency for 5296 tokens per request: 1.15x
71INFO 05-05 04:57:10 [model_runner.py:1462] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
72Capturing CUDA graph shapes: 97%|█████████████████████████████████████████████████████████████████████████████████████████████████▏ | 34/35 [00:20<00:00, 1.98it/s]
73Bus error

In T385173#10790749, the initial wmf-debian-vllm:fa-slim image was built by tracing the serving of the aya-expanse-8b model using docker-slim. While this slimmed image successfully served the aya-expanse-8b model, it failed when attempting to load the aya-expanse-32b model, throwing a Bus error.

This indicated that docker-slim likely removed dependencies required by the larger 32b model, which were not executed during the 8b trace.

To address this, we rebuilt the slim image, modifying the docker-slim --exec command to trace the execution of the aya-expanse-32b model instead:

1$ slim build --network host \
2--target wmf-debian-vllm:latest \
3--tag wmf-debian-vllm:fa-slim \
4--http-probe=false \
5--continue-after=exec \
6--env VLLM_USE_TRITON_FLASH_ATTN=0 \
7--env HF_TOKEN=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 \ # remember to replace this token with yours as I have invalidated this one
8--exec="rocminfo && \
9rocm-smi && \
10/app/venv/bin/python -c \"
11import torch; \
12print('TORCH BUILD:', torch.__version__, torch.version.git_version); \
13print('ROCm/HIP Status:', torch.cuda.is_available(), torch.version.hip); \
14x = torch.randn(2, 2, device='cuda'); y = x @ x; print(y); \
15from vllm import LLM, SamplingParams; \
16aya_llm = LLM(model='CohereForAI/aya-expanse-32b', gpu_memory_utilization=1, max_model_len=5296); \
17print(aya_llm.generate('Hello, world!', SamplingParams(max_tokens=5))[0].outputs[0].text)\"" \
18--include-shell=true \
19--include-path-file=includes.txt
20cmd=slim info=params include.path='/dev' message='ignoring'
21cmd=slim state=started
22cmd=slim info=cmd.input.params target.type='image' target.image='wmf-debian-vllm:latest' continue.mode='exec' rt.as.user='true' keep.perms='true' tags='wmf-debian-vll
23m:fa-slim' image-build-engine='internal'
24cmd=slim state=image.inspection.start
25cmd=slim info=image size.bytes='58161299483' size.human='58 GB' id='sha256:b0de8d1342bfed519c41d7e52fe6bad5dbff84c19a9240241bec3d8c72b24a0f'
26cmd=slim info=image.stack id='sha256:76769c10bf7aa98746670cbbb9747f0940c8af78491ef6ab1e44df0761e88586' index='0' name='docker-registry.wikimedia.org/bookworm:20250413'
27cmd=slim info=image.stack index='1' name='wmf-debian-vllm:latest' id='sha256:b0de8d1342bfed519c41d7e52fe6bad5dbff84c19a9240241bec3d8c72b24a0f'
28cmd=slim state=image.inspection.done
29cmd=slim state=container.inspection.start
30cmd=slim info=sensor location='/home/kevinbazira/WMF_vLLM_image/slimtoolkit/dist_linux/mint-sensor' filemode='-rwxr-xr-x' version='linux/amd64|ALP|x.1.42.2|29e62e7836de7b1004607c51c502537ffe1969f0|2025-01-16_07:48:54AM|x' volume='mint-sensor.x.1.42.2'
31cmd=slim info=container status='created' name='mintk_1613854_20250505092229' id='5e350de7749d65ab8a3cdd9d2a31df2a596176745bfee54f643f8445b390f39b'
32cmd=slim info=container status='running' name='mintk_1613854_20250505092229' id='5e350de7749d65ab8a3cdd9d2a31df2a596176745bfee54f643f8445b390f39b'
33cmd=slim info=container message='obtained IP address' ip='127.0.0.1'
34cmd=slim info=cmd.startmonitor status='sent'
35cmd=slim info=event.startmonitor.done status='received'
36cmd=slim info=container name='mintk_1613854_20250505092229' id='5e350de7749d65ab8a3cdd9d2a31df2a596176745bfee54f643f8445b390f39b' target.port.list='' target.port.info='' message='YOU CAN USE THESE PORTS TO INTERACT WITH THE CONTAINER'
37cmd=slim info=continue.after mode='exec' message='provide the expected input to allow the container inspector to continue its execution'
38cmd=slim info=continue.after mode='exec' shell='rocminfo && rocm-smi && /app/venv/bin/python -c "
39import torch; print('TORCH BUILD:', torch.__version__, torch.version.git_version); print('ROCm/HIP Status:', torch.cuda.is_available(), torch.version.hip); x = torch.randn(2, 2, device='cuda'); y = x @ x; print(y); from vllm import LLM, SamplingParams; aya_llm = LLM(model='CohereForAI/aya-expanse-32b', gpu_memory_utilization=1, max_model_len=5296); print(aya_llm.generate('Hello, world!', SamplingParams(max_tokens=5))[0].outputs[0].text)"'
40mint[slim][exec]: output: ROCk module is loaded
41mint[slim][exec]: output: =====================
42mint[slim][exec]: output: HSA System Attributes
43mint[slim][exec]: output: =====================
44mint[slim][exec]: output: Runtime Version: 1.14
45mint[slim][exec]: output: Runtime Ext Version: 1.6
46mint[slim][exec]: output: System Timestamp Freq.: 1000.000000MHz
47mint[slim][exec]: output: Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
48mint[slim][exec]: output: Machine Model: LARGE
49mint[slim][exec]: output: System Endianness: LITTLE
50mint[slim][exec]: output: Mwaitx: DISABLED
51mint[slim][exec]: output: DMAbuf Support: NO
52mint[slim][exec]: output: ==========
53mint[slim][exec]: output: HSA Agents
54mint[slim][exec]: output: ==========
55mint[slim][exec]: output: *******
56mint[slim][exec]: output: Agent 1
57mint[slim][exec]: output: *******
58mint[slim][exec]: output: Name: AMD EPYC 7643P 48-Core Processor
59mint[slim][exec]: output: Uuid: CPU-XX
60mint[slim][exec]: output: Marketing Name: AMD EPYC 7643P 48-Core Processor
61mint[slim][exec]: output: Vendor Name: CPU
62mint[slim][exec]: output: Feature: None specified
63mint[slim][exec]: output: Profile: FULL_PROFILE
64mint[slim][exec]: output: Float Round Mode: NEAR
65mint[slim][exec]: output: Max Queue Number: 0(0x0)
66mint[slim][exec]: output: Queue Min Size: 0(0x0)
67mint[slim][exec]: output: Queue Max Size: 0(0x0)
68mint[slim][exec]: output: Queue Type: MULTI
69mint[slim][exec]: output: Node: 0
70mint[slim][exec]: output: Device Type: CPU
71mint[slim][exec]: output: Cache Info:
72mint[slim][exec]: output: L1: 32768(0x8000) KB
73mint[slim][exec]: output: Chip ID: 0(0x0)
74mint[slim][exec]: output: ASIC Revision: 0(0x0)
75mint[slim][exec]: output: Cacheline Size: 64(0x40)
76mint[slim][exec]: output: Max Clock Freq. (MHz): 2300
77mint[slim][exec]: output: BDFID: 0
78mint[slim][exec]: output: Internal Node ID: 0
79mint[slim][exec]: output: Compute Unit: 96
80mint[slim][exec]: output: SIMDs per CU: 0
81mint[slim][exec]: output: Shader Engines: 0
82mint[slim][exec]: output: Shader Arrs. per Eng.: 0
83mint[slim][exec]: output: WatchPts on Addr. Ranges:1
84mint[slim][exec]: output: Memory Properties:
85mint[slim][exec]: output: Features: None
86mint[slim][exec]: output: Pool Info:
87mint[slim][exec]: output: Pool 1
88mint[slim][exec]: output: Segment: GLOBAL; FLAGS: FINE GRAINED
89mint[slim][exec]: output: Size: 395878872(0x1798a1d8) KB
90mint[slim][exec]: output: Allocatable: TRUE
91mint[slim][exec]: output: Alloc Granule: 4KB
92mint[slim][exec]: output: Alloc Recommended Granule:4KB
93mint[slim][exec]: output: Alloc Alignment: 4KB
94mint[slim][exec]: output: Accessible by all: TRUE
95mint[slim][exec]: output: Pool 2
96mint[slim][exec]: output: Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
97mint[slim][exec]: output: Size: 395878872(0x1798a1d8) KB
98mint[slim][exec]: output: Allocatable: TRUE
99mint[slim][exec]: output: Alloc Granule: 4KB
100mint[slim][exec]: output: Alloc Recommended Granule:4KB
101mint[slim][exec]: output: Alloc Alignment: 4KB
102mint[slim][exec]: output: Accessible by all: TRUE
103mint[slim][exec]: output: Pool 3
104mint[slim][exec]: output: Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
105mint[slim][exec]: output: Size: 395878872(0x1798a1d8) KB
106mint[slim][exec]: output: Allocatable: TRUE
107mint[slim][exec]: output: Alloc Granule: 4KB
108mint[slim][exec]: output: Alloc Recommended Granule:4KB
109mint[slim][exec]: output: Alloc Alignment: 4KB
110mint[slim][exec]: output: Accessible by all: TRUE
111mint[slim][exec]: output: Pool 4
112mint[slim][exec]: output: Segment: GLOBAL; FLAGS: COARSE GRAINED
113mint[slim][exec]: output: Size: 395878872(0x1798a1d8) KB
114mint[slim][exec]: output: Allocatable: TRUE
115mint[slim][exec]: output: Alloc Granule: 4KB
116mint[slim][exec]: output: Alloc Recommended Granule:4KB
117mint[slim][exec]: output: Alloc Alignment: 4KB
118mint[slim][exec]: output: Accessible by all: TRUE
119mint[slim][exec]: output: ISA Info:
120mint[slim][exec]: output: *******
121mint[slim][exec]: output: Agent 2
122mint[slim][exec]: output: *******
123mint[slim][exec]: output: Name: gfx90a
124mint[slim][exec]: output: Uuid: GPU-a2c14903fb923a3f
125mint[slim][exec]: output: Marketing Name: AMD Instinct MI210
126mint[slim][exec]: output: Vendor Name: AMD
127mint[slim][exec]: output: Feature: KERNEL_DISPATCH
128mint[slim][exec]: output: Profile: BASE_PROFILE
129mint[slim][exec]: output: Float Round Mode: NEAR
130mint[slim][exec]: output: Max Queue Number: 128(0x80)
131mint[slim][exec]: output: Queue Min Size: 64(0x40)
132mint[slim][exec]: output: Queue Max Size: 131072(0x20000)
133mint[slim][exec]: output: Queue Type: MULTI
134mint[slim][exec]: output: Node: 1
135mint[slim][exec]: output: Device Type: GPU
136mint[slim][exec]: output: Cache Info:
137mint[slim][exec]: output: L1: 16(0x10) KB
138mint[slim][exec]: output: L2: 8192(0x2000) KB
139mint[slim][exec]: output: Chip ID: 29711(0x740f)
140mint[slim][exec]: output: ASIC Revision: 1(0x1)
141mint[slim][exec]: output: Cacheline Size: 64(0x40)
142mint[slim][exec]: output: Max Clock Freq. (MHz): 1700
143mint[slim][exec]: output: BDFID: 49920
144mint[slim][exec]: output: Internal Node ID: 1
145mint[slim][exec]: output: Compute Unit: 104
146mint[slim][exec]: output: SIMDs per CU: 4
147mint[slim][exec]: output: Shader Engines: 8
148mint[slim][exec]: output: Shader Arrs. per Eng.: 1
149mint[slim][exec]: output: WatchPts on Addr. Ranges:4
150mint[slim][exec]: output: Coherent Host Access: FALSE
151mint[slim][exec]: output: Memory Properties:
152mint[slim][exec]: output: Features: KERNEL_DISPATCH
153mint[slim][exec]: output: Fast F16 Operation: TRUE
154mint[slim][exec]: output: Wavefront Size: 64(0x40)
155mint[slim][exec]: output: Workgroup Max Size: 1024(0x400)
156mint[slim][exec]: output: Workgroup Max Size per Dimension:
157mint[slim][exec]: output: x 1024(0x400)
158mint[slim][exec]: output: y 1024(0x400)
159mint[slim][exec]: output: z 1024(0x400)
160mint[slim][exec]: output: Max Waves Per CU: 32(0x20)
161mint[slim][exec]: output: Max Work-item Per CU: 2048(0x800)
162mint[slim][exec]: output: Grid Max Size: 4294967295(0xffffffff)
163mint[slim][exec]: output: Grid Max Size per Dimension:
164mint[slim][exec]: output: x 4294967295(0xffffffff)
165mint[slim][exec]: output: y 4294967295(0xffffffff)
166mint[slim][exec]: output: z 4294967295(0xffffffff)
167mint[slim][exec]: output: Max fbarriers/Workgrp: 32
168mint[slim][exec]: output: Packet Processor uCode:: 71
169mint[slim][exec]: output: SDMA engine uCode:: 8
170mint[slim][exec]: output: IOMMU Support:: None
171mint[slim][exec]: output: Pool Info:
172mint[slim][exec]: output: Pool 1
173mint[slim][exec]: output: Segment: GLOBAL; FLAGS: COARSE GRAINED
174mint[slim][exec]: output: Size: 67092480(0x3ffc000) KB
175mint[slim][exec]: output: Allocatable: TRUE
176mint[slim][exec]: output: Alloc Granule: 4KB
177mint[slim][exec]: output: Alloc Recommended Granule:2048KB
178mint[slim][exec]: output: Alloc Alignment: 4KB
179mint[slim][exec]: output: Accessible by all: FALSE
180mint[slim][exec]: output: Pool 2
181mint[slim][exec]: output: Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
182mint[slim][exec]: output: Size: 67092480(0x3ffc000) KB
183mint[slim][exec]: output: Allocatable: TRUE
184mint[slim][exec]: output: Alloc Granule: 4KB
185mint[slim][exec]: output: Alloc Recommended Granule:2048KB
186mint[slim][exec]: output: Alloc Alignment: 4KB
187mint[slim][exec]: output: Accessible by all: FALSE
188mint[slim][exec]: output: Pool 3
189mint[slim][exec]: output: Segment: GROUP
190mint[slim][exec]: output: Size: 64(0x40) KB
191mint[slim][exec]: output: Allocatable: FALSE
192mint[slim][exec]: output: Alloc Granule: 0KB
193mint[slim][exec]: output: Alloc Recommended Granule:0KB
194mint[slim][exec]: output: Alloc Alignment: 0KB
195mint[slim][exec]: output: Accessible by all: FALSE
196mint[slim][exec]: output: ISA Info:
197mint[slim][exec]: output: ISA 1
198mint[slim][exec]: output: Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
199mint[slim][exec]: output: Machine Models: HSA_MACHINE_MODEL_LARGE
200mint[slim][exec]: output: Profiles: HSA_PROFILE_BASE
201mint[slim][exec]: output: Default Rounding Mode: NEAR
202mint[slim][exec]: output: Default Rounding Mode: NEAR
203mint[slim][exec]: output: Fast f16: TRUE
204mint[slim][exec]: output: Workgroup Max Size: 1024(0x400)
205mint[slim][exec]: output: Workgroup Max Size per Dimension:
206mint[slim][exec]: output: x 1024(0x400)
207mint[slim][exec]: output: y 1024(0x400)
208mint[slim][exec]: output: z 1024(0x400)
209mint[slim][exec]: output: Grid Max Size: 4294967295(0xffffffff)
210mint[slim][exec]: output: Grid Max Size per Dimension:
211mint[slim][exec]: output: x 4294967295(0xffffffff)
212mint[slim][exec]: output: y 4294967295(0xffffffff)
213mint[slim][exec]: output: z 4294967295(0xffffffff)
214mint[slim][exec]: output: FBarrier Max Size: 32
215mint[slim][exec]: output: *******
216mint[slim][exec]: output: Agent 3
217mint[slim][exec]: output: *******
218mint[slim][exec]: output: Name: gfx90a
219mint[slim][exec]: output: Uuid: GPU-5b81d02ab699960e
220mint[slim][exec]: output: Marketing Name: AMD Instinct MI210
221mint[slim][exec]: output: Vendor Name: AMD
222mint[slim][exec]: output: Feature: KERNEL_DISPATCH
223mint[slim][exec]: output: Profile: BASE_PROFILE
224mint[slim][exec]: output: Float Round Mode: NEAR
225mint[slim][exec]: output: Max Queue Number: 128(0x80)
226mint[slim][exec]: output: Queue Min Size: 64(0x40)
227mint[slim][exec]: output: Queue Max Size: 131072(0x20000)
228mint[slim][exec]: output: Queue Type: MULTI
229mint[slim][exec]: output: Node: 2
230mint[slim][exec]: output: Device Type: GPU
231mint[slim][exec]: output: Cache Info:
232mint[slim][exec]: output: L1: 16(0x10) KB
233mint[slim][exec]: output: L2: 8192(0x2000) KB
234mint[slim][exec]: output: Chip ID: 29711(0x740f)
235mint[slim][exec]: output: ASIC Revision: 1(0x1)
236mint[slim][exec]: output: Cacheline Size: 64(0x40)
237mint[slim][exec]: output: Max Clock Freq. (MHz): 1700
238mint[slim][exec]: output: BDFID: 768
239mint[slim][exec]: output: Internal Node ID: 2
240mint[slim][exec]: output: Compute Unit: 104
241mint[slim][exec]: output: SIMDs per CU: 4
242mint[slim][exec]: output: Shader Engines: 8
243mint[slim][exec]: output: Shader Arrs. per Eng.: 1
244mint[slim][exec]: output: WatchPts on Addr. Ranges:4
245mint[slim][exec]: output: Coherent Host Access: FALSE
246mint[slim][exec]: output: Memory Properties:
247mint[slim][exec]: output: Features: KERNEL_DISPATCH
248mint[slim][exec]: output: Fast F16 Operation: TRUE
249mint[slim][exec]: output: Wavefront Size: 64(0x40)
250mint[slim][exec]: output: Workgroup Max Size: 1024(0x400)
251mint[slim][exec]: output: Workgroup Max Size per Dimension:
252mint[slim][exec]: output: x 1024(0x400)
253mint[slim][exec]: output: y 1024(0x400)
254mint[slim][exec]: output: z 1024(0x40
255mint[slim][exec]: output: 0)
256mint[slim][exec]: output: Max Waves Per CU: 32(0x20)
257mint[slim][exec]: output: Max Work-item Per CU: 2048(0x800)
258mint[slim][exec]: output: Grid Max Size: 4294967295(0xffffffff)
259mint[slim][exec]: output: Grid Max Size per Dimension:
260mint[slim][exec]: output: x 4294967295(0xffffffff)
261mint[slim][exec]: output: y 4294967295(0xffffffff)
262mint[slim][exec]: output: z 4294967295(0xffffffff)
263mint[slim][exec]: output: Max fbarriers/Workgrp: 32
264mint[slim][exec]: output: Packet Processor uCode:: 71
265mint[slim][exec]: output: SDMA engine uCode:: 8
266mint[slim][exec]: output: IOMMU Support:: None
267mint[slim][exec]: output: Pool Info:
268mint[slim][exec]: output: Pool 1
269mint[slim][exec]: output: Segment: GLOBAL; FLAGS: COARSE GRAINED
270mint[slim][exec]: output: Size: 67092480(0x3ffc000) KB
271mint[slim][exec]: output: Allocatable: TRUE
272mint[slim][exec]: output: Alloc Granule: 4KB
273mint[slim][exec]: output: Alloc Recommended Granule:2048KB
274mint[slim][exec]: output: Alloc Alignment: 4KB
275mint[slim][exec]: output: Accessible by all: FALSE
276mint[slim][exec]: output: Pool 2
277mint[slim][exec]: output: Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
278mint[slim][exec]: output: Size: 67092480(0x3ffc000) KB
279mint[slim][exec]: output: Allocatable: TRUE
280mint[slim][exec]: output: Alloc Granule: 4KB
281mint[slim][exec]: output: Alloc Recommended Granule:2048KB
282mint[slim][exec]: output: Alloc Alignment: 4KB
283mint[slim][exec]: output: Accessible by all: FALSE
284mint[slim][exec]: output: Pool 3
285mint[slim][exec]: output: Segment: GROUP
286mint[slim][exec]: output: Size: 64(0x40) KB
287mint[slim][exec]: output: Allocatable: FALSE
288mint[slim][exec]: output: Alloc Granule: 0KB
289mint[slim][exec]: output: Alloc Recommended Granule:0KB
290mint[slim][exec]: output: Alloc Alignment: 0KB
291mint[slim][exec]: output: Accessible by all: FALSE
292mint[slim][exec]: output: ISA Info:
293mint[slim][exec]: output: ISA 1
294mint[slim][exec]: output: Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
295mint[slim][exec]: output: Machine Models: HSA_MACHINE_MODEL_LARGE
296mint[slim][exec]: output: Profiles: HSA_PROFILE_BASE
297mint[slim][exec]: output: Default Rounding Mode: NEAR
298mint[slim][exec]: output: Default Rounding Mode: NEAR
299mint[slim][exec]: output: Fast f16: TRUE
300mint[slim][exec]: output: Workgroup Max Size: 1024(0x400)
301mint[slim][exec]: output: Workgroup Max Size per Dimension:
302mint[slim][exec]: output: x 1024(0x400)
303mint[slim][exec]: output: y 1024(0x400)
304mint[slim][exec]: output: z 1024(0x400)
305mint[slim][exec]: output: Grid Max Size: 4294967295(0xffffffff)
306mint[slim][exec]: output: Grid Max Size per Dimension:
307mint[slim][exec]: output: x 4294967295(0xffffffff)
308mint[slim][exec]: output: y 4294967295(0xffffffff)
309mint[slim][exec]: output: z 4294967295(0xffffffff)
310mint[slim][exec]: output: FBarrier Max Size: 32
311mint[slim][exec]: output: *** Done ***
312mint[slim][exec]: output: ========================================= ROCm System Management Interface =========================================
313mint[slim][exec]: output: =================================================== Concise Info ===================================================
314mint[slim][exec]: output: Device Node IDs Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU%
315mint[slim][exec]: output: (DID, GUID) (Edge) (Avg) (Mem, Compute, ID)
316mint[slim][exec]: output: ====================================================================================================================
317mint[slim][exec]: output: 0 2 0x740f, 22303 45.0°C 42.0W N/A, N/A, 0 800Mhz 1600Mhz 0% auto 300.0W 0% 0%
318mint[slim][exec]: output: 1 1 0x740f, 2552 48.0°C 40.0W N/A, N/A, 0 800Mhz 1600Mhz 0% auto 300.0W 0% 0%
319mint[slim][exec]: output: ====================================================================================================================
320mint[slim][exec]: output: =============================================== End of ROCm SMI Log ================================================
321mint[slim][exec]: output: TORCH BUILD: 2.7.0.dev20250309+rocm6.3 ecc1272a4b291814d73c785fe3025ef86ffb7f06
322mint[slim][exec]: output: ROCm/HIP Status: True 6.3.42131-fa1d09cbd
323mint[slim][exec]: output: tensor([[ 0.2040, -0.3447],
324mint[slim][exec]: output: [-0.1186, 1.0030]], device='cuda:0')
325mint[slim][exec]: output: INFO 05-05 09:22:50 [__init__.py:239] Automatically detected platform rocm.
326mint[slim][exec]: output: INFO 05-05 09:23:05 [config.py:716] This model supports multiple tasks: {'classify', 'embed', 'reward', 'generate', 'score'}. Defaulting to 'generate'.
327mint[slim][exec]: output: INFO 05-05 09:23:07 [arg_utils.py:1699] rocm is experimental on VLLM_USE_V1=1. Falling back to V0 Engine.
328mint[slim][exec]: output: INFO 05-05 09:23:07 [config.py:1770] Disabled the custom all-reduce kernel because it is not supported on current platform.
329mint[slim][exec]: output: INFO 05-05 09:23:07 [llm_engine.py:242] Initializing a V0 LLM engine (v0.8.5.dev243+gc53e0730c) with config: model='CohereForAI/aya-expanse-32b', speculative_config=None, tokenizer='CohereForAI/aya-expanse-32b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=5296, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=CohereForAI/aya-expanse-32b, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
330mint[slim][exec]: output: INFO 05-05 09:23:08 [rocm.py:186] None is not supported in AMD GPUs.
331mint[slim][exec]: output: INFO 05-05 09:23:08 [rocm.py:187] Using ROCmFlashAttention backend.
332mint[slim][exec]: output: INFO 05-05 09:23:08 [parallel_state.py:946] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
333mint[slim][exec]: output: INFO 05-05 09:23:08 [model_runner.py:1120] Starting to load model CohereForAI/aya-expanse-32b...
334mint[slim][exec]: output: INFO 05-05 09:23:09 [weight_utils.py:265] Using model weights format ['*.safetensors']
335mint[slim][exec]: output: INFO 05-05 09:24:53 [weight_utils.py:281] Time spent downloading weights for CohereForAI/aya-expanse-32b: 104.447445 seconds
336Loading safetensors checkpoint shards: 0% Completed | 0/14 [00:00<?, ?it/s]
337Loading safetensors checkpoint shards: 7% Completed | 1/14 [01:31<19:54, 91.87s/it]
338Loading safetensors checkpoint shards: 14% Completed | 2/14 [01:34<07:50, 39.20s/it]
339Loading safetensors checkpoint shards: 21% Completed | 3/14 [01:36<04:06, 22.37s/it]
340Loading safetensors checkpoint shards: 29% Completed | 4/14 [01:38<02:24, 14.44s/it]
341Loading safetensors checkpoint shards: 36% Completed | 5/14 [01:41<01:30, 10.08s/it]
342Loading safetensors checkpoint shards: 43% Completed | 6/14 [01:43<00:59, 7.43s/it]
343Loading safetensors checkpoint shards: 50% Completed | 7/14 [01:45<00:40, 5.76s/it]
344Loading safetensors checkpoint shards: 57% Completed | 8/14 [01:48<00:27, 4.66s/it]
345Loading safetensors checkpoint shards: 64% Completed | 9/14 [01:48<00:17, 3.42s/it]
346Loading safetensors checkpoint shards: 71% Completed | 10/14 [01:50<00:11, 2.99s/it]
347Loading safetensors checkpoint shards: 79% Completed | 11/14 [01:53<00:08, 2.78s/it]
348Loading safetensors checkpoint shards: 86% Completed | 12/14 [01:55<00:05, 2.63s/it]
349Loading safetensors checkpoint shards: 93% Completed | 13/14 [01:57<00:02, 2.54s/it]
350Loading safetensors checkpoint shards: 100% Completed | 14/14 [02:00<00:00, 2.48s/it]
351Loading safetensors checkpoint shards: 100% Completed | 14/14 [02:00<00:00, 8.58s/it]
352mint[slim][exec]: output: INFO 05-05 09:26:54 [loader.py:458] Loading weights took 120.41 seconds
353mint[slim][exec]: output: INFO 05-05 09:26:54 [model_runner.py:1152] Model loading took 60.1895 GiB and 225.541285 seconds
354mint[slim][exec]: output: INFO 05-05 09:26:59 [worker.py:287] Memory profiling takes 4.69 seconds
355mint[slim][exec]: output: INFO 05-05 09:26:59 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (1.00) = 63.98GiB
356mint[slim][exec]: output: INFO 05-05 09:26:59 [worker.py:287] model weights take 60.19GiB; non_torch_memory takes 0.08GiB; PyTorch activation peak memory takes 2.40GiB; the rest of the memory reserved for KV Cache is 1.31GiB.
357mint[slim][exec]: output: INFO 05-05 09:26:59 [executor_base.py:112] # rocm blocks: 537, # CPU blocks: 1638
358mint[slim][exec]: output: INFO 05-05 09:26:59 [executor_base.py:117] Maximum concurrency for 5296 tokens per request: 1.62x
359decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
360Capturing CUDA graph shapes: 0%| | 0/35 [00:00<?, ?it/s]
361Capturing CUDA graph shapes: 3%|| 1/35 [00:01<00:43, 1.27s/it]
362Capturing CUDA graph shapes: 6%|| 2/35 [00:01<00:31, 1.05it/s]
363Capturing CUDA graph shapes: 9%|| 3/35 [00:02<00:26, 1.19it/s]
364Capturing CUDA graph shapes: 11%|█▏ | 4/35 [00:03<00:24, 1.28it/s]
365Capturing CUDA graph shapes: 14%|█▍ | 5/35 [00:04<00:22, 1.35it/s]
366Capturing CUDA graph shapes: 17%|█▋ | 6/35 [00:04<00:20, 1.40it/s]
367Capturing CUDA graph shapes: 20%|██ | 7/35 [00:05<00:19, 1.43it/s]
368Capturing CUDA graph shapes: 23%|██▎ | 8/35 [00:06<00:18, 1.46it/s]
369Capturing CUDA graph shapes: 26%|██▌ | 9/35 [00:06<00:17, 1.48it/s]
370Capturing CUDA graph shapes: 29%|██▊ | 10/35 [00:07<00:16, 1.50it/s]
371Capturing CUDA graph shapes: 31%|███▏ | 11/35 [00:07<00:15, 1.51it/s]
372Capturing CUDA graph shapes: 34%|███▍ | 12/35 [00:08<00:15, 1.51it/s]
373Capturing CUDA graph shapes: 37%|███▋ | 13/35 [00:09<00:14, 1.53it/s]
374Capturing CUDA graph shapes: 40%|████ | 14/35 [00:09<00:13, 1.54it/s]
375Capturing CUDA graph shapes: 43%|████▎ | 15/35 [00:10<00:12, 1.55it/s]
376Capturing CUDA graph shapes: 46%|████▌ | 16/35 [00:11<00:12, 1.55it/s]
377Capturing CUDA graph shapes: 49%|████▊ | 17/35 [00:11<00:11, 1.57it/s]
378Capturing CUDA graph shapes: 51%|█████▏ | 18/35 [00:12<00:10, 1.58it/s]
379Capturing CUDA graph shapes: 54%|█████▍ | 19/35 [00:13<00:10, 1.58it/s]
380Capturing CUDA graph shapes: 57%|█████▋ | 20/35 [00:13<00:09, 1.58it/s]
381Capturing CUDA graph shapes: 60%|██████ | 21/35 [00:14<00:08, 1.58it/s]
382Capturing CUDA graph shapes: 63%|██████▎ | 22/35 [00:14<00:08, 1.59it/s]
383Capturing CUDA graph shapes: 66%|██████▌ | 23/35 [00:15<00:07, 1.60it/s]
384Capturing CUDA graph shapes: 69%|██████▊ | 24/35 [00:16<00:06, 1.61it/s]
385Capturing CUDA graph shapes: 71%|███████▏ | 25/35 [00:16<00:06, 1.62it/s]
386Capturing CUDA graph shapes: 74%|███████▍ | 26/35 [00:17<00:05, 1.63it/s]
387Capturing CUDA graph shapes: 77%|███████▋ | 27/35 [00:18<00:04, 1.64it/s]
388Capturing CUDA graph shapes: 80%|████████ | 28/35 [00:18<00:04, 1.66it/s]
389Capturing CUDA graph shapes: 83%|████████▎ | 29/35 [00:19<00:03, 1.68it/s]
390Capturing CUDA graph shapes: 86%|████████▌ | 30/35 [00:19<00:02, 1.70it/s]
391Capturing CUDA graph shapes: 89%|████████▊ | 31/35 [00:20<00:02, 1.72it/s]
392Capturing CUDA graph shapes: 91%|█████████▏| 32/35 [00:20<00:01, 1.74it/s]
393Capturing CUDA graph shapes: 94%|█████████▍| 33/35 [00:21<00:01, 1.75it/s]
394Capturing CUDA graph shapes: 97%|█████████▋| 34/35 [00:21<00:00, 1.82it/s]
395Capturing CUDA graph shapes: 100%|██████████| 35/35 [00:22<00:00, 1.52it/s]
396mint[slim][exec]: output: INFO 05-05 09:27:22 [model_runner.py:1604] Graph capturing finished in 23 secs, took 0.24 GiB
397mint[slim][exec]: output: INFO 05-05 09:27:22 [llm_engine.py:443] init engine (profile, create kv cache, warmup model) took 28.77 seconds
398Processed prompts: 0%| | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
399Processed prompts: 100%|██████████| 1/1 [00:00<00:00, 2.12it/s, est. speed input: 10.60 toks/s, output: 10.60 toks/s]
400Processed prompts: 100%|██████████| 1/1 [00:00<00:00, 2.12it/s, est. speed input: 10.60 toks/s, output: 10.60 toks/s]
401mint[slim][exec]: output: ~
402mint[slim][exec]: output: So the
403mint[slim][exec]: output: [rank0]:[W505 09:27:23.096445888 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
404cmd=slim info=continue.after mode='exec' exitcode='0'
405cmd=slim state=container.inspection.finishing
406cmd=slim state=container.inspection.artifact.processing
407cmd=slim state=container.inspection.done
408cmd=slim state=building message="building optimized image" engine=internal
409cmd=slim state=completed
410cmd=slim info=results by='2.18X' size.original='58 GB' size.optimized='27 GB' status='MINIFIED'
411cmd=slim info=results image.name='wmf-debian-vllm:fa-slim' image.size='27 GB' image.id='sha256:9f9341d881d3ed10b3c1ab71e39f93758e494d0472baa1b5a4216c37d7adb8dd' image.digest='sha256:c7a752f0f235e84621093636d6a1ba2a6d36b33900be78383506489a86c88327' has.data='true' image-build-engine='internal'
412cmd=slim info=results artifacts.location='/home/kevinbazira/WMF_vLLM_image/slimtoolkit/dist_linux/.mint-state/images/b0de8d1342bfed519c41d7e52fe6bad5dbff84c19a9240241bec3d8c72b24a0f/artifacts'
413cmd=slim info=results artifacts.report='creport.json'
414cmd=slim info=results artifacts.dockerfile.reversed='Dockerfile.reversed'
415cmd=slim info=results artifacts.seccomp='wmf-debian-vllm-seccomp.json'
416cmd=slim info=results artifacts.apparmor='wmf-debian-vllm-apparmor-profile'
417cmd=slim state=done
418cmd=slim info=commands message='use the xray command to learn more about the optimize image'
419cmd=slim info=report file='slim.report.json'
420app='mint' message='GitHub Discussions' info='https://github.com/mintoolkit/mint/discussions'
421app='mint' message='Join the CNCF Slack channel to ask questions or to share your feedback' info='https://cloud-native.slack.com/archives/C059QP1RH1S'
422app='mint' message='Join the Discord server to ask questions or to share your feedback' info='https://discord.gg/fAvq4ruKsG'

The resulting wmf-debian-vllm:fa-slim was still ~26GB as before:

$ docker images
REPOSITORY                               TAG                                             IMAGE ID       CREATED             SIZE
wmf-debian-vllm                          fa-slim                                         9f9341d881d3   About an hour ago   26.7GB
wmf-debian-vllm                          fa                                              b0de8d1342bf   5 days ago          58.2GB
rocm/vllm                                rocm6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6   d632a062cd17   3 months ago        35.9GB

On testing the wmf-debian-vllm:fa-slim image, it was able to serve both aya-expanse 8b and 32b models successfully:

1$ docker run --rm --network=host -it \
2-e VLLM_USE_TRITON_FLASH_ATTN=0 \
3-e HF_TOKEN=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 \ # remember to replace this token with yours as I have invalidated this one
4--device=/dev/kfd --device=/dev/dri \
5--group-add=$(getent group video | cut -d: -f3) \
6--group-add=$(getent group render | cut -d: -f3) \
7--ipc=host \
8--security-opt seccomp=unconfined \
9-v /srv/hf-cache:/home/vllm/.cache/huggingface \
10wmf-debian-vllm:fa-slim /app/venv/bin/python -c "
11from vllm import LLM, SamplingParams; \
12llm = LLM('CohereForAI/aya-expanse-8b'); \
13print(llm.generate('Hello, world!', SamplingParams(max_tokens=5))[0].outputs[0].text)"
14/app/venv/lib/python3.11/site-packages/requests/__init__.py:86: RequestsDependencyWarning: Unable to find acceptable character detection dependency (chardet or charse
15t_normalizer).
16 warnings.warn(
17INFO 05-05 10:07:41 [__init__.py:239] Automatically detected platform rocm.
18config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 634/634 [00:00<00:00, 3.78MB/s]
19INFO 05-05 10:07:55 [config.py:716] This model supports multiple tasks: {'score', 'classify', 'reward', 'generate', 'embed'}. Defaulting to 'generate'.
20INFO 05-05 10:08:01 [arg_utils.py:1699] rocm is experimental on VLLM_USE_V1=1. Falling back to V0 Engine.
21INFO 05-05 10:08:01 [config.py:1770] Disabled the custom all-reduce kernel because it is not supported on current platform.
22INFO 05-05 10:08:01 [llm_engine.py:242] Initializing a V0 LLM engine (v0.8.5.dev243+gc53e0730c) with config: model='CohereForAI/aya-expanse-8b', speculative_config=No
23ne, tokenizer='CohereForAI/aya-expanse-8b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust
24_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_c
25ustom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto'
26, reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_mo
27del_execute_time=False), seed=None, served_model_name=CohereForAI/aya-expanse-8b, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, c
28hunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"spli
29tting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
30tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 8.64k/8.64k [00:00<00:00, 46.8MB/s]
31tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12.8M/12.8M [00:00<00:00, 210MB/s]
32special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 439/439 [00:00<00:00, 3.21MB/s]
33generation_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 137/137 [00:00<00:00, 1.03MB/s]
34INFO 05-05 10:08:02 [rocm.py:186] None is not supported in AMD GPUs.
35INFO 05-05 10:08:02 [rocm.py:187] Using ROCmFlashAttention backend.
36INFO 05-05 10:08:02 [parallel_state.py:946] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
37INFO 05-05 10:08:02 [model_runner.py:1120] Starting to load model CohereForAI/aya-expanse-8b...
38INFO 05-05 10:08:03 [weight_utils.py:265] Using model weights format ['*.safetensors']
39model-00004-of-00004.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 1.22G/1.22G [00:06<00:00, 200MB/s]
40model-00002-of-00004.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 4.92G/4.92G [00:21<00:00, 230MB/s]
41model-00001-of-00004.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 4.92G/4.92G [00:21<00:00, 229MB/s]
42model-00003-of-00004.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 5.00G/5.00G [00:22<00:00, 224MB/s]
43INFO 05-05 10:08:25 [weight_utils.py:281] Time spent downloading weights for CohereForAI/aya-expanse-8b: 22.489289 seconds
44model.safetensors.index.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 21.0k/21.0k [00:00<00:00, 80.9MB/s]
45Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
46Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:00<00:01, 1.76it/s]
47Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:02<00:03, 1.51s/it]
48Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:05<00:01, 1.92s/it]
49Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:07<00:00, 2.15s/it]
50Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:07<00:00, 1.91s/it]
51
52INFO 05-05 10:08:33 [loader.py:458] Loading weights took 7.99 seconds
53INFO 05-05 10:08:33 [model_runner.py:1152] Model loading took 15.1387 GiB and 31.107758 seconds
54INFO 05-05 10:09:22 [worker.py:287] Memory profiling takes 49.12 seconds
55INFO 05-05 10:09:22 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (0.90) = 57.59GiB
56INFO 05-05 10:09:22 [worker.py:287] model weights take 15.14GiB; non_torch_memory takes 0.31GiB; PyTorch activation peak memory takes 2.38GiB; the rest of the memory reserved for KV Cache is 39.75GiB.
57INFO 05-05 10:09:23 [executor_base.py:112] # rocm blocks: 20354, # CPU blocks: 2048
58INFO 05-05 10:09:23 [executor_base.py:117] Maximum concurrency for 8192 tokens per request: 39.75x
59INFO 05-05 10:09:23 [model_runner.py:1462] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
60Capturing CUDA graph shapes: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:15<00:00, 2.26it/s]
61INFO 05-05 10:09:39 [model_runner.py:1604] Graph capturing finished in 15 secs, took 0.24 GiB
62INFO 05-05 10:09:39 [llm_engine.py:443] init engine (profile, create kv cache, warmup model) took 65.51 seconds
63Processed prompts: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 7.21it/s, est. speed input: 36.10 toks/s, output: 36.10 toks/s]
64 January 27,
65[rank0]:[W505 10:09:40.358866264 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

1$ docker run --rm --network=host -it \
2-e VLLM_USE_TRITON_FLASH_ATTN=0 \
3-e HF_TOKEN=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 \ # remember to replace this token with yours as I have invalidated this one
4--device=/dev/kfd --device=/dev/dri \
5--group-add=$(getent group video | cut -d: -f3) \
6--group-add=$(getent group render | cut -d: -f3) \
7--ipc=host \
8--security-opt seccomp=unconfined \
9-v /srv/hf-cache:/home/vllm/.cache/huggingface \
10wmf-debian-vllm:fa-slim /app/venv/bin/python -c "
11from vllm import LLM, SamplingParams; \
12llm = LLM(model='CohereForAI/aya-expanse-32b', gpu_memory_utilization=1, max_model_len=5296); \
13print(llm.generate('Hello, world!', SamplingParams(max_tokens=5))[0].outputs[0].text)"
14/app/venv/lib/python3.11/site-packages/requests/__init__.py:86: RequestsDependencyWarning: Unable to find acceptable character detection dependency (chardet or charse
15t_normalizer).
16 warnings.warn(
17INFO 05-05 10:13:11 [__init__.py:239] Automatically detected platform rocm.
18config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 637/637 [00:00<00:00, 3.98MB/s]
19INFO 05-05 10:13:25 [config.py:716] This model supports multiple tasks: {'classify', 'embed', 'score', 'reward', 'generate'}. Defaulting to 'generate'.
20INFO 05-05 10:13:31 [arg_utils.py:1699] rocm is experimental on VLLM_USE_V1=1. Falling back to V0 Engine.
21INFO 05-05 10:13:31 [config.py:1770] Disabled the custom all-reduce kernel because it is not supported on current platform.
22INFO 05-05 10:13:31 [llm_engine.py:242] Initializing a V0 LLM engine (v0.8.5.dev243+gc53e0730c) with config: model='CohereForAI/aya-expanse-32b', speculative_config=N
23one, tokenizer='CohereForAI/aya-expanse-32b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, tru
24st_remote_code=False, dtype=torch.float16, max_seq_len=5296, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable
25_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='aut
26o', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_
27model_execute_time=False), seed=None, served_model_name=CohereForAI/aya-expanse-32b, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None
28, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"s
29plitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40
30,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
31tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 8.64k/8.64k [00:00<00:00, 38.1MB/s]
32tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12.8M/12.8M [00:00<00:00, 262MB/s]
33special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 439/439 [00:00<00:00, 3.28MB/s]
34generation_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 137/137 [00:00<00:00, 1.20MB/s]
35INFO 05-05 10:13:32 [rocm.py:186] None is not supported in AMD GPUs.
36INFO 05-05 10:13:32 [rocm.py:187] Using ROCmFlashAttention backend.
37INFO 05-05 10:13:32 [parallel_state.py:946] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
38INFO 05-05 10:13:32 [model_runner.py:1120] Starting to load model CohereForAI/aya-expanse-32b...
39INFO 05-05 10:13:32 [weight_utils.py:265] Using model weights format ['*.safetensors']
40model-00004-of-00014.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 4.83G/4.83G [00:40<00:00, 119MB/s]
41model-00007-of-00014.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 4.93G/4.93G [00:43<00:00, 113MB/s]
42model-00006-of-00014.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 4.93G/4.93G [00:47<00:00, 104MB/s]
43model-00008-of-00014.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 4.83G/4.83G [01:00<00:00, 79.8MB/s]
44model-00009-of-00014.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 4.93G/4.93G [00:23<00:00, 209MB/s]
45model-00005-of-00014.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 4.93G/4.93G [01:04<00:00, 76.4MB/s]
46model-00001-of-00014.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 4.90G/4.90G [01:08<00:00, 71.8MB/s]
47model-00002-of-00014.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 4.93G/4.93G [01:11<00:00, 68.6MB/s]
48model-00003-of-00014.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 4.93G/4.93G [01:39<00:00, 49.5MB/s]
49model-00014-of-00014.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 805M/805M [00:05<00:00, 157MB/s]
50model-00010-of-00014.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 4.93G/4.93G [00:22<00:00, 224MB/s]
51model-00011-of-00014.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 4.93G/4.93G [00:24<00:00, 204MB/s]
52model-00013-of-00014.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 4.93G/4.93G [00:29<00:00, 168MB/s]
53model-00012-of-00014.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 4.83G/4.83G [00:29<00:00, 163MB/s]
54INFO 05-05 10:15:41 [weight_utils.py:281] Time spent downloading weights for CohereForAI/aya-expanse-32b: 129.136083 seconds█████▌| 4.91G/4.93G [00:29<00:00, 366MB/s]
55model.safetensors.index.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 26.2k/26.2k [00:00<00:00, 55.0MB/s]
56Loading safetensors checkpoint shards: 0% Completed | 0/14 [00:00<?, ?it/s]██████████████████████████████████████████████████▋ | 4.71G/4.83G [00:29<00:00, 360MB/s]
57Loading safetensors checkpoint shards: 7% Completed | 1/14 [01:18<17:01, 78.59s/it]█████████████████████████████████████████████| 4.83G/4.83G [00:29<00:00, 359MB/s]
58Loading safetensors checkpoint shards: 14% Completed | 2/14 [01:20<06:44, 33.73s/it]
59Loading safetensors checkpoint shards: 21% Completed | 3/14 [01:23<03:33, 19.40s/it]
60Loading safetensors checkpoint shards: 29% Completed | 4/14 [01:25<02:06, 12.67s/it]
61Loading safetensors checkpoint shards: 36% Completed | 5/14 [01:27<01:20, 8.94s/it]
62Loading safetensors checkpoint shards: 43% Completed | 6/14 [01:30<00:53, 6.69s/it]
63Loading safetensors checkpoint shards: 50% Completed | 7/14 [01:32<00:36, 5.24s/it]
64Loading safetensors checkpoint shards: 57% Completed | 8/14 [01:34<00:25, 4.31s/it]
65Loading safetensors checkpoint shards: 64% Completed | 9/14 [01:37<00:18, 3.70s/it]
66Loading safetensors checkpoint shards: 71% Completed | 10/14 [01:37<00:11, 2.77s/it]
67Loading safetensors checkpoint shards: 79% Completed | 11/14 [01:39<00:07, 2.55s/it]
68Loading safetensors checkpoint shards: 86% Completed | 12/14 [01:42<00:04, 2.49s/it]
69Loading safetensors checkpoint shards: 93% Completed | 13/14 [01:44<00:02, 2.45s/it]
70Loading safetensors checkpoint shards: 100% Completed | 14/14 [01:46<00:00, 2.40s/it]
71Loading safetensors checkpoint shards: 100% Completed | 14/14 [01:46<00:00, 7.64s/it]
72
73INFO 05-05 10:17:29 [loader.py:458] Loading weights took 107.26 seconds
74INFO 05-05 10:17:29 [model_runner.py:1152] Model loading took 60.3418 GiB and 237.032957 seconds
75INFO 05-05 10:17:37 [worker.py:287] Memory profiling takes 7.94 seconds
76INFO 05-05 10:17:37 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (1.00) = 63.98GiB
77INFO 05-05 10:17:37 [worker.py:287] model weights take 60.34GiB; non_torch_memory takes 0.31GiB; PyTorch activation peak memory takes 2.40GiB; the rest of the memory reserved for KV Cache is 0.93GiB.
78INFO 05-05 10:17:37 [executor_base.py:112] # rocm blocks: 381, # CPU blocks: 1638
79INFO 05-05 10:17:37 [executor_base.py:117] Maximum concurrency for 5296 tokens per request: 1.15x
80INFO 05-05 10:17:38 [model_runner.py:1462] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
81Capturing CUDA graph shapes: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:19<00:00, 1.76it/s]
82INFO 05-05 10:17:58 [model_runner.py:1604] Graph capturing finished in 20 secs, took 0.24 GiB
83INFO 05-05 10:17:58 [llm_engine.py:443] init engine (profile, create kv cache, warmup model) took 28.83 seconds
84Processed prompts: 100%|████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.95it/s, est. speed input: 9.73 toks/s, output: 9.73 toks/s]
85
86I’ve been
87[rank0]:[W505 10:17:59.456651202 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

In T385173#10737743, we ran inference latency benchmarks using the upstream ROCm-vLLM image to understand how vLLM performs when serving the aya-expanse-32b model on an AMD MI200 GPU in ml-lab1002. Now that we have ported the upstream image to wmf-debian-vllm, we wanted to verify whether the performance remained consistent. We used the same ROCm MAD framework to run these benchmarks, and the results are shown below:

latency_bar_chart.png (1×3 px, 352 KB)

Across all tested input/output lengths, the ported wmf-debian-vllm image shows a similar latency profile as the upstream ROCm-vLLM image. There are minor variations in the upstream vs ported image latencies, which confirms that porting to wmf-debian-vllm did not introduce any meaningful performance regression.

Hi @elukey, following your suggestion in T385173#10538744, we ported the upstream Ubuntu based ROCm-vLLM image to use WMF's Debian Bookworm. The resulting wmf-debian-vllm image that runs WMF Debian Bookworm, ROCm, PyTorch, FlashAttention, and vLLM is ~58GB and we slimmed it down to ~26GB as shown in T385173#10794682. You can find both the full and slim variants of this image on ml-lab1002.eqiad.wmnet.

We have also published the Dockerfile we used in a gitlab repo here: https://gitlab.wikimedia.org/repos/machine-learning/wmf-debian-vllm/-/blob/master/Dockerfile

@isarantopoulos and @klausman advised to get a review from you on this Dockerfile as we prepare to move this image to the WMF docker registry. We would be grateful if you could review it when you have a minute.

Thanks in advance!

I would suggest the following in order to make reviewing a bit easier: @kevinbazira can you open a new MR in the same repo and add a "polished" version of the dockerfile (e.g. removing the proxy env vars etc).? Then we can do code review as we would in any other place (and multiple ppl could take a pass at it). If there is any other suggestion happy to go with it if it helps the process.

I would suggest the following in order to make reviewing a bit easier: @kevinbazira can you open a new MR in the same repo and add a "polished" version of the dockerfile (e.g. removing the proxy env vars etc).? Then we can do code review as we would in any other place (and multiple ppl could take a pass at it). If there is any other suggestion happy to go with it if it helps the process.

@isarantopoulos sure sure, I have created a new MR here: https://gitlab.wikimedia.org/repos/machine-learning/wmf-debian-vllm/-/merge_requests/1

cc: @klausman, @elukey

As we prepare the wmf-debian-vllm image for the wikimedia docker registry in this gitlab MR, we have added multi-stage builds to separate build and runtime stages, minimized layers by combining related RUN commands, isolated build artifacts like FlashAttention and vLLM in separate stages, and optimized runtime dependencies for model-serving. Below is the resulting dockerfile:

1########################################################
2# wmf-debian-vllm: ROCm, PyTorch, FlashAttention, vLLM #
3# #
4# Note: Multiple RUN commands are intentionally kept #
5# separate to avoid hitting the 4GB (compressed) #
6# Docker layer limit required by the Wikimedia Docker #
7# registry. #
8########################################################
9ARG BASE_IMAGE=docker-registry.wikimedia.org/bookworm:20250413
10FROM ${BASE_IMAGE} AS builder
11
12# — Set proxy env vars required on ml-lab1002 (see: https://phabricator.wikimedia.org/P75284#302759)
13ARG http_proxy
14ENV http_proxy=${http_proxy}
15ENV https_proxy=${http_proxy}
16ENV HTTP_PROXY=${http_proxy}
17ENV HTTPS_PROXY=${http_proxy}
18
19# — Mirror upstream: pin ROCm packages and create 'render' group
20ARG ROCM_VERSION=6.3.1
21ARG AMDGPU_VERSION=6.3.1
22ARG APT_PREF="Package: *\nPin: release o=repo.radeon.com\nPin-Priority: 600"
23RUN mkdir -p /app \
24 && groupadd -g 109 render \
25 && printf "$APT_PREF" > /etc/apt/preferences.d/rocm-pin-600
26WORKDIR /app
27
28# — Add AMD ROCm & AMDGPU repositories and keys, and install ROCm libs & Python tooling
29RUN mkdir -p /etc/apt/keyrings \
30 && apt-get update -q \
31 && apt-get install -q -y --no-install-recommends wget gnupg ca-certificates apt-transport-https \
32 && wget -qO - https://repo.radeon.com/rocm/rocm.gpg.key | gpg --dearmor > /etc/apt/keyrings/rocm.gpg \
33 && echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/amdgpu/${AMDGPU_VERSION}/ubuntu jammy main" > /etc/apt/sources.list.d/amdgpu.list \
34 && echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/rocm/apt/${ROCM_VERSION} jammy main" > /etc/apt/sources.list.d/rocm.list \
35 && apt-get update -q \
36 && apt-get install -q -y \
37 rocm \
38 cmake build-essential \
39 python3 python3-pip python3-dev python3-venv \
40 git curl sudo vim \
41 sqlite3 libsqlite3-dev libfmt-dev libmsgpack-dev libsuitesparse-dev \
42 && apt-get purge --auto-remove -y wget gnupg \
43 && rm -rf /var/lib/apt/lists/*
44
45# — Set environment for ROCm and vLLM
46ENV ROCM_PATH=/opt/rocm \
47 VLLM_TARGET_DEVICE=rocm \
48 # For more details on AMD GPU architecures like gfx90a, see: https://rocm.docs.amd.com/en/latest/reference/gpu-arch-specs.html
49 PYTORCH_ROCM_ARCH=gfx90a \
50 PATH=/opt/rocm/llvm/bin:/opt/rocm/bin:/app/venv/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin \
51 LD_LIBRARY_PATH=/opt/rocm/lib:/usr/local/lib:
52
53# — Create a Python virtual environment and a custom temp directory
54RUN python3 -m venv /app/venv \
55 && mkdir -p /opt/tmp
56ENV PATH="/app/venv/bin:${PATH}"
57ENV TMPDIR=/opt/tmp
58
59# — Install ROCm-enabled PyTorch (into the venv)
60RUN pip install --no-cache-dir --pre torch==2.8.0.dev20250508+rocm6.3 \
61 # Using the nightly index (https://download.pytorch.org/whl/nightly/rocm6*) for PyTorch installation due to dependency issues with the stable index (https://download.pytorch.org/whl/rocm6*), as detailed in https://phabricator.wikimedia.org/P75883. This approach aligns with both the vLLM documentation (https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html?device=rocm) and the ROCm documentation (https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/pytorch-install.html).
62 --index-url https://download.pytorch.org/whl/nightly/rocm6.3
63
64# — Install the AMD SMI Python interface
65RUN pip install --no-cache-dir /opt/rocm/share/amd_smi
66
67# — Install Python build packages required by both FlashAttention and vLLM
68RUN pip install --no-cache-dir setuptools_scm packaging \
69 "cmake<4" ninja wheel setuptools pybind11 Cython
70
71# — Build FlashAttention in a separate stage
72FROM builder AS flashattention-builder
73RUN git clone https://github.com/Dao-AILab/flash-attention.git /app/flash-attn \
74 && cd /app/flash-attn \
75 && git checkout 1a7f4dfa \
76 && git submodule update --init \
77 # For more details on AMD GPU architecures like gfx90a, see: https://rocm.docs.amd.com/en/latest/reference/gpu-arch-specs.html
78 && GPU_ARCHS=gfx90a python3 setup.py bdist_wheel \
79 && mkdir -p /app/wheels \
80 && mv dist/*.whl /app/wheels \
81 && cd /app \
82 && rm -rf /app/flash-attn
83
84# — Build vLLM in a separate stage
85FROM builder AS vllm-builder
86ARG VLLM_REPO=https://github.com/vllm-project/vllm.git
87ARG VLLM_BRANCH=main
88RUN git clone --branch ${VLLM_BRANCH} ${VLLM_REPO} /app/vllm \
89 && cd /app/vllm \
90 && git checkout c53e073 \
91 && git submodule update --init \
92 && pip install --no-cache-dir -r requirements/rocm.txt \
93 && python3 setup.py bdist_wheel \
94 && mkdir -p /app/wheels \
95 && mv dist/*.whl /app/wheels \
96 && cd /app \
97 && rm -rf /app/vllm
98
99# — Final stage: Create minimal runtime image
100FROM ${BASE_IMAGE} AS runtime
101COPY --from=builder /app/venv /app/venv
102COPY --from=flashattention-builder /app/wheels /app/wheels
103COPY --from=vllm-builder /app/wheels /app/wheels
104
105# — Set proxy env vars in the runtime variant required on ml-lab1002 (see: https://phabricator.wikimedia.org/P75284#302759)
106ARG http_proxy
107ENV http_proxy=${http_proxy}
108ENV https_proxy=${http_proxy}
109ENV HTTP_PROXY=${http_proxy}
110ENV HTTPS_PROXY=${http_proxy}
111
112# — Set runtime environment
113ENV PATH="/app/venv/bin:${PATH}"
114
115# — Install runtime dependencies and Python packages
116RUN apt-get update -q \
117 && apt-get install -q -y --no-install-recommends \
118 python3 \
119 python3-dev \
120 ca-certificates \
121 gcc \
122 libc6-dev \
123 && rm -rf /var/lib/apt/lists/* \
124 && pip install --no-cache-dir /app/wheels/*.whl

The latest wmf-debian-vllm image built from this optimized dockerfile is only ~26.2GB without using docker-slim as we previously did for the ~58GB image in T385173#10794682:

$ docker images
REPOSITORY                               TAG                                             IMAGE ID       CREATED             SIZE
wmf-debian-vllm                          latest                                          92360f943968   5 minutes ago       26.2GB
rocm/vllm                                rocm6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6   d632a062cd17   3 months ago        35.9GB

Here are the layer sizes:

$ docker history wmf-debian-vllm:latest
IMAGE          CREATED          CREATED BY                                      SIZE      COMMENT
92360f943968   6 minutes ago    /bin/sh -c apt-get update -q     && apt-get …   2.52GB    
947708683c52   8 minutes ago    /bin/sh -c #(nop)  ENV PATH=/app/venv/bin:/u…   0B        
d04f0318a84c   9 minutes ago    /bin/sh -c #(nop)  ENV HTTPS_PROXY=http://we…   0B        
b77614ede15b   9 minutes ago    /bin/sh -c #(nop)  ENV HTTP_PROXY=http://web…   0B        
82bad66d4278   9 minutes ago    /bin/sh -c #(nop)  ENV https_proxy=http://we…   0B        
a8eb8113fde7   9 minutes ago    /bin/sh -c #(nop)  ENV http_proxy=http://web…   0B        
8899e110119a   9 minutes ago    /bin/sh -c #(nop)  ARG http_proxy               0B        
fea614999ad2   10 minutes ago   /bin/sh -c #(nop) COPY dir:aabf045a9405e27d7…   22.5MB    
e0ac438139c3   10 minutes ago   /bin/sh -c #(nop) COPY dir:7d6e1b3d332c01cfb…   46.7MB    
8de6a72d02f5   10 minutes ago   /bin/sh -c #(nop) COPY dir:5a992406bb8316569…   23.5GB    
76769c10bf7a   4 weeks ago      /bin/sh -c #(nop)  CMD ["/bin/bash"]            0B        
<missing>      4 weeks ago      /bin/sh -c #(nop)  ENV LC_ALL=C.UTF-8           0B        
<missing>      4 weeks ago      /bin/sh -c #(nop) ADD file:73c9d159527bfdc90…   74.8MB 

@elukey, @isarantopoulos: in T385173#10816452 the biggest layer size of the wmf-debian-vllm image is ~23.5GB (uncompressed).
I uploaded this image to dockerhub and we can see the biggest layer size is ~4.63GB (compressed): https://hub.docker.com/layers/kevinbazira/wmf-debian-vllm/latest/images/sha256-fa34ab41faa2bafb0b677365a03d8b15f385df2e5e0ec91329485e2682717640

The biggest layer is venv in the runtime variant (final image): https://phabricator.wikimedia.org/P76040$101

Shall we be able to proceed with the compressed size on the wikimedia docker registry?

The compressed layers need to be less that 4GB so this will not work.
Looking at the largest layer which is the copy of the venv directory that is created I see that torch takes up 21GB!

du -sh /app/venv/lib/python3.11/site-packages/*  | sort -hr
21G     ./torch
813M    ./triton
193M    ./ray
171M    ./flash_attn_2_cuda.cpython-311-x86_64-linux-gnu.so
139M    ./pyarrow
129M    ./llvmlite
119M    ./scipy
114M    ./vllm
102M    ./transformers
78M     ./sympy
78M     ./pandas
77M     ./cmake
74M     ./cv2

Digging a bit deeper it seems that we might be able to copy the following top 2 files separately we will fall bellow that limit.

du -sh /app/venv/lib/python3.11/site-packages/torch/lib/*  | sort -hr | head -n 10
10G     /app/venv/lib/python3.11/site-packages/torch/lib/hipblaslt
3.5G    /app/venv/lib/python3.11/site-packages/torch/lib/rocblas
1.6G    /app/venv/lib/python3.11/site-packages/torch/lib/librocsolver.so
1.4G    /app/venv/lib/python3.11/site-packages/torch/lib/librocsparse.so
951M    /app/venv/lib/python3.11/site-packages/torch/lib/libmagma.so

The torch wheel sizes (and the corresponding images) are steadily increasing -- luckily linearly and not exponentially , but we should also identify is this approach is a viable solution for us for the future and see if we need to either increase the compressed layer limit or figure out another way (if there is such).

ROCm-enabled PyTorch dependencies like hipblaslt (~10GB) and rocblas (~3.5GB) are the primary contributors to the largest layer in the wmf-debian-vllm image. These large packages are a known unresolved issue upstream: https://github.com/ROCm/ROCm/issues/4224

To meet the wikimedia docker registry 4GB compressed layer size limit, and following @isarantopoulos's suggestion in T385173#10822741, I have chunked these two heavy packages in the build variant then copied smaller chunks in separate layers into the runtime image. Below is the updated dockerfile:

1########################################################
2# wmf-debian-vllm: ROCm, PyTorch, FlashAttention, vLLM #
3# #
4# Note: Multiple RUN commands are intentionally kept #
5# separate to avoid hitting the 4GB (compressed) #
6# Docker layer limit required by the Wikimedia Docker #
7# registry. #
8########################################################
9ARG BASE_IMAGE=docker-registry.wikimedia.org/bookworm:20250413
10FROM ${BASE_IMAGE} AS builder
11
12# — Set proxy env vars required on ml-lab1002 (see: https://phabricator.wikimedia.org/P75284#302759)
13ARG http_proxy
14ENV http_proxy=${http_proxy}
15ENV https_proxy=${http_proxy}
16ENV HTTP_PROXY=${http_proxy}
17ENV HTTPS_PROXY=${http_proxy}
18
19# — Mirror upstream: pin ROCm packages and create 'render' group
20ARG ROCM_VERSION=6.3.1
21ARG AMDGPU_VERSION=6.3.1
22ARG APT_PREF="Package: *\nPin: release o=repo.radeon.com\nPin-Priority: 600"
23RUN mkdir -p /app \
24 && groupadd -g 109 render \
25 && printf "$APT_PREF" > /etc/apt/preferences.d/rocm-pin-600
26WORKDIR /app
27
28# — Add AMD ROCm & AMDGPU repositories and keys, and install ROCm libs & Python tooling
29RUN mkdir -p /etc/apt/keyrings \
30 && apt-get update -q \
31 && apt-get install -q -y --no-install-recommends wget gnupg ca-certificates apt-transport-https \
32 && wget -qO - https://repo.radeon.com/rocm/rocm.gpg.key | gpg --dearmor > /etc/apt/keyrings/rocm.gpg \
33 && echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/amdgpu/${AMDGPU_VERSION}/ubuntu jammy main" > /etc/apt/sources.list.d/amdgpu.list \
34 && echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/rocm/apt/${ROCM_VERSION} jammy main" > /etc/apt/sources.list.d/rocm.list \
35 && apt-get update -q \
36 && apt-get install -q -y \
37 rocm \
38 cmake build-essential \
39 python3 python3-pip python3-dev python3-venv \
40 git curl sudo vim \
41 sqlite3 libsqlite3-dev libfmt-dev libmsgpack-dev libsuitesparse-dev \
42 && apt-get purge --auto-remove -y wget gnupg \
43 && rm -rf /var/lib/apt/lists/*
44
45# — Set environment for ROCm and vLLM
46ENV ROCM_PATH=/opt/rocm \
47 VLLM_TARGET_DEVICE=rocm \
48 # For more details on AMD GPU architecures like gfx90a, see: https://rocm.docs.amd.com/en/latest/reference/gpu-arch-specs.html
49 PYTORCH_ROCM_ARCH=gfx90a \
50 PATH=/opt/rocm/llvm/bin:/opt/rocm/bin:/app/venv/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin \
51 LD_LIBRARY_PATH=/opt/rocm/lib:/usr/local/lib:
52
53# — Create a Python virtual environment and a custom temp directory
54RUN python3 -m venv /app/venv \
55 && mkdir -p /opt/tmp
56ENV PATH="/app/venv/bin:${PATH}"
57ENV TMPDIR=/opt/tmp
58
59# — Install ROCm-enabled PyTorch (into the venv)
60RUN pip install --no-cache-dir --pre torch==2.8.0.dev20250508+rocm6.3 \
61 # Using the nightly index (https://download.pytorch.org/whl/nightly/rocm6*) for PyTorch installation due to dependency issues with the stable index (https://download.pytorch.org/whl/rocm6*), as detailed in https://phabricator.wikimedia.org/P75883. This approach aligns with both the vLLM documentation (https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html?device=rocm) and the ROCm documentation (https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/pytorch-install.html).
62 --index-url https://download.pytorch.org/whl/nightly/rocm6.3
63
64# — Chunk entire hipblaslt and rocblas directories from PyTorch installation
65# Define paths for clarity. Using Python 3.11 from bookworm base.
66ENV TORCH_LIB_PATH="/app/venv/lib/python3.11/site-packages/torch/lib"
67ENV HIPBLASLT_FULL_PATH="${TORCH_LIB_PATH}/hipblaslt"
68ENV ROCBLAS_FULL_PATH="${TORCH_LIB_PATH}/rocblas"
69
70# — Create parent directory for the torch lib chunks
71RUN mkdir -p /app/torch_lib_chunks
72
73# — Move the entire hipblaslt directory to the chunk location
74RUN if [ -d "${HIPBLASLT_FULL_PATH}" ]; then \
75 mv "${HIPBLASLT_FULL_PATH}" /app/torch_lib_chunks/hipblaslt && \
76 echo "Moved ${HIPBLASLT_FULL_PATH} to /app/torch_lib_chunks/hipblaslt" ; \
77 else \
78 echo "Warning: Directory ${HIPBLASLT_FULL_PATH} not found." ; \
79 fi
80
81# — Move the entire rocblas directory to the chunk location
82RUN if [ -d "${ROCBLAS_FULL_PATH}" ]; then \
83 mv "${ROCBLAS_FULL_PATH}" /app/torch_lib_chunks/rocblas && \
84 echo "Moved ${ROCBLAS_FULL_PATH} to /app/torch_lib_chunks/rocblas" ; \
85 else \
86 echo "Warning: Directory ${ROCBLAS_FULL_PATH} not found." ; \
87 fi
88
89# — Install the AMD SMI Python interface
90RUN pip install --no-cache-dir /opt/rocm/share/amd_smi
91
92# — Install Python build packages required by both FlashAttention and vLLM
93RUN pip install --no-cache-dir setuptools_scm packaging \
94 "cmake<4" ninja wheel setuptools pybind11 Cython
95
96# — Build FlashAttention in a separate stage
97FROM builder AS flashattention-builder
98RUN git clone https://github.com/Dao-AILab/flash-attention.git /app/flash-attn \
99 && cd /app/flash-attn \
100 && git checkout 1a7f4dfa \
101 && git submodule update --init \
102 # For more details on AMD GPU architecures like gfx90a, see: https://rocm.docs.amd.com/en/latest/reference/gpu-arch-specs.html
103 && GPU_ARCHS=gfx90a python3 setup.py bdist_wheel \
104 && mkdir -p /app/wheels \
105 && mv dist/*.whl /app/wheels \
106 && cd /app \
107 && rm -rf /app/flash-attn
108
109# — Build vLLM in a separate stage
110FROM builder AS vllm-builder
111ARG VLLM_REPO=https://github.com/vllm-project/vllm.git
112ARG VLLM_BRANCH=main
113RUN git clone --branch ${VLLM_BRANCH} ${VLLM_REPO} /app/vllm \
114 && cd /app/vllm \
115 && git checkout c53e073 \
116 && git submodule update --init \
117 && pip install --no-cache-dir -r requirements/rocm.txt \
118 && python3 setup.py bdist_wheel \
119 && mkdir -p /app/wheels \
120 && mv dist/*.whl /app/wheels \
121 && cd /app \
122 && rm -rf /app/vllm
123
124# — Final stage: Create minimal runtime image
125FROM ${BASE_IMAGE} AS runtime
126
127# — Set proxy env vars in the runtime variant required on ml-lab1002 (see: https://phabricator.wikimedia.org/P75284#302759)
128ARG http_proxy
129ENV http_proxy=${http_proxy}
130ENV https_proxy=${http_proxy}
131ENV HTTP_PROXY=${http_proxy}
132ENV HTTPS_PROXY=${http_proxy}
133
134# — Copy venv (main structure; torch/lib will be missing hipblaslt and rocblas because they were moved in builder)
135COPY --from=builder /app/venv /app/venv
136
137# — Define runtime paths corresponding to builder paths. Using Python 3.11 from bookworm base.
138# This directory should typically exist after copying /app/venv.
139ARG TORCH_LIB_PATH="/app/venv/lib/python3.11/site-packages/torch/lib"
140
141# — Copy hipblaslt and rocblas directories back into torch/lib/
142COPY --from=builder /app/torch_lib_chunks/hipblaslt/ "${TORCH_LIB_PATH}/hipblaslt/"
143COPY --from=builder /app/torch_lib_chunks/rocblas/ "${TORCH_LIB_PATH}/rocblas/"
144
145# — Copy pre-built wheels for FlashAttention and vLLM
146COPY --from=flashattention-builder /app/wheels /app/wheels
147COPY --from=vllm-builder /app/wheels /app/wheels
148
149# — Set runtime environment PATH to include venv
150ENV PATH="/app/venv/bin:${PATH}"
151
152# — Install runtime dependencies and Python packages from wheels
153RUN apt-get update -q \
154 && apt-get install -q -y --no-install-recommends \
155 python3 \
156 python3-dev \
157 ca-certificates \
158 gcc \
159 libc6-dev \
160 && rm -rf /var/lib/apt/lists/* \
161 # This pip install will use the venv copied above, now with PyTorch and its libraries restored
162 && pip install --no-cache-dir /app/wheels/*.whl

The uncompressed layer sizes now show an improvement, with the largest layer being ~10.7GB, down from ~23.5GB as previously reported in T385173#10816452:

$ docker history wmf-debian-vllm:latest
IMAGE          CREATED          CREATED BY                                      SIZE      COMMENT
1578e7dab5e7   9 minutes ago    |1 TORCH_LIB_PATH=/app/venv/lib/python3.11/s…   2.52GB    
cbeb7a91708a   11 minutes ago   /bin/sh -c #(nop)  ENV PATH=/app/venv/bin:/u…   0B        
0e51a6e0573e   12 minutes ago   /bin/sh -c #(nop) COPY dir:b691cff8ada139f20…   22.5MB    
3e04813fea47   12 minutes ago   /bin/sh -c #(nop) COPY dir:5d6c407f0344156ab…   46.7MB    
9b6880e19ecb   12 minutes ago   /bin/sh -c #(nop) COPY dir:6a7a6baf9a6e470b1…   3.69GB    
5507acdcea0c   13 minutes ago   /bin/sh -c #(nop) COPY dir:22a34b78ee60b28be…   10.7GB    
bd37817781dd   16 minutes ago   /bin/sh -c #(nop)  ARG TORCH_LIB_PATH=/app/v…   0B        
f021b3e36a2d   16 minutes ago   /bin/sh -c #(nop) COPY dir:bc89236df8c5798dc…   9.15GB    
3ffab5948d21   2 hours ago      /bin/sh -c #(nop)  ENV HTTPS_PROXY=http://we…   0B        
a51d1a9b38d9   2 hours ago      /bin/sh -c #(nop)  ENV HTTP_PROXY=http://web…   0B        
c04c10693026   2 hours ago      /bin/sh -c #(nop)  ENV https_proxy=http://we…   0B        
e4c33b2249db   2 hours ago      /bin/sh -c #(nop)  ENV http_proxy=http://web…   0B        
d1e00b2cc76c   2 hours ago      /bin/sh -c #(nop)  ARG http_proxy               0B        
76769c10bf7a   4 weeks ago      /bin/sh -c #(nop)  CMD ["/bin/bash"]            0B        
<missing>      4 weeks ago      /bin/sh -c #(nop)  ENV LC_ALL=C.UTF-8           0B        
<missing>      4 weeks ago      /bin/sh -c #(nop) ADD file:73c9d159527bfdc90…   74.8MB 

The compressed layer sizes on dockerhub also reflect this improvement, with the largest layer now being ~2.61GB: https://hub.docker.com/layers/kevinbazira/wmf-debian-vllm/latest/images/sha256-4a8d8f6fe7cf79429d9b782849466b918f560a2aa982d35bf2931bdf95c2b981

NOTE: This is not the first time we have faced this constraint of torch layer sizes exceeding the wikimedia docker registry limit. Over a year ago, similar discussions occurred in T359067 and T359569. The resolution at the time ended up being increasing the wikimedia docker registry layer size limit, as detailed in T359569#9654014 and T360637.

Change #1146891 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/docker-images/production-images@master] Add vLLM image

https://gerrit.wikimedia.org/r/1146891

Change #1146991 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/puppet@production] aptrepo: Import AMD ROCm 6.3 packages

https://gerrit.wikimedia.org/r/1146991

Change #1146991 merged by Klausman:

[operations/puppet@production] aptrepo: Import AMD ROCm 6.3 packages

https://gerrit.wikimedia.org/r/1146991

Change #1147739 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/puppet@production] aptrepo: Add two missing deps to thirdparty/rocm63 repo

https://gerrit.wikimedia.org/r/1147739

Change #1147739 merged by Klausman:

[operations/puppet@production] aptrepo: Add two missing deps to thirdparty/rocm63 repo

https://gerrit.wikimedia.org/r/1147739

Change #1146891 merged by Elukey:

[operations/docker-images/production-images@master] Add vLLM image in ML namespace

https://gerrit.wikimedia.org/r/1146891

Change #1227697 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/docker-images/production-images@master] ml: fix vllm's image builder config

https://gerrit.wikimedia.org/r/1227697

Change #1227697 merged by Elukey:

[operations/docker-images/production-images@master] ml: fix vllm's image builder config

https://gerrit.wikimedia.org/r/1227697

Change #1229531 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::ml_builder: add docker engine settings

https://gerrit.wikimedia.org/r/1229531

Change #1229531 merged by Elukey:

[operations/puppet@production] role::ml_builder: add docker engine settings

https://gerrit.wikimedia.org/r/1229531