Page MenuHomePhabricator

Q4 24-25 Goal: Simple article summaries: Set up the software stack for efficiently serving production LLMs
Open, Needs TriagePublic

Description

As an engineer on the ML team,
I want to have the appropriate software stack in order to be able to serve LLMs from our stack efficiently using our GPUs, so that LiftWing can power product features. The goal of this quarter is to:

  1. Use vllm to serve LLMs from the MI210 AMD GPUs
  2. Have the stack ready to do the same using the MI300X GPUs once we acquire them.

The path to do this is by using the vllm/rocm docker images provided by - upstream.
We also want to investigate if other frameworks (like SGLang) would be more suitable.

Related Objects

Event Timeline

isarantopoulos renamed this task from Q4 24-25 Simple article summaries: set up the software stack for efficiently serving production LLMs to Q4 24-25 Goal: Simple article summaries: Set up the software stack for efficiently serving production LLMs.Apr 15 2025, 10:32 AM

What we achieved/learned:
The following benchmarks were run for aya-expanse-32B on the following GPUs: MI210 (ml-lab1002) & MI300X (test machine provided by the vendor).
The image was the upstream provided rocm/vllm image rocm/vllm:rocm6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6 and the benchmark used was AMDs ROCm's Model Automation and Dashboarding (MAD) framework

Note: the MI210 benchmark was ran with flash attention disabled due to some compatibility issues. This affects the final results of the benchmark. When we port the image we intend to build it to better support both GPU architectures (gfx90a & gfx942). More information can be found in the related subtask

image.png (1×3 px, 361 KB)

image.png (1×3 px, 331 KB)

Next step(s):

The next task to achieve this goal is to port the upsteam docker image https://hub.docker.com/layers/rocm/vllm/rocm6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6/images/sha256-9a12ef62bbbeb5a4c30a01f702c8e025061f575aa129f291a49fbd02d6b4d6c9 to a debian based one and upload that to the docker registry from ml-lab1002.

We ported the upstream ROCm-vLLM to build a WMF ROCm vLLM image on ml-lab (T385173#10771940), using WMF Debian Bookworm with ROCm, PyTorch, and vLLM (P75488). The initial wmf-debian-vllm image (~61GB) was larger than upstream (~35GB) due to build dependencies. After identifying essential runtime dependencies and using docker-slim (P75475, P75478, P75479), the final image was reduced to ~25GB. We verified vLLM works with the facebook/opt-125m model in both the full and slim wmf-debian-vllm containers (P75492, P75483).

We added CK FlashAttention to the wmf-debian-vllm image with ROCm, PyTorch, and vLLM (T385173#10780983). Then tested this image with aya-expanse 8b and 32b models (P75721, P75723); both ran successfully. The size of the image with FlashAttention was ~58GB due to Python and ROCm dependencies. We slimmed it down to ~26GB by keeping only essential runtime dependencies (P75744, P75742). The initial slim image worked with aya-expanse-8b (P75743) but aya-expanse-32b failed due to a bus error (P75745). We investigated and traced the issue to missing dependencies excluded when slimming by tracing the aya-expanse-8b model (T385173#10794682). Then rebuilt the slim image by tracing aya-expanse-32b (P75750); it remained ~26GB and successfully served both models (P75751, P75752).

We have also published the wmf-debian-vllm image build scripts and steps to a gitlab repo: https://gitlab.wikimedia.org/repos/machine-learning/wmf-debian-vllm.

We ran performance benchmarks on the wmf-debian-vllm image, verifying that porting the upstream image didn't cause a performance regression (T385173#10799184). As we prepared to add this image to the wikimedia docker registry, we optimized it as detailed in T385173#10816452. We further inspected compressed layers and identified hipblaslt (~10GB) and rocblas (~3.5GB) as the largest, then split those packages into smaller chunks to meet the 4GB compressed layer limit of the registry (T385173#10826281). We prepared a patch (1146891) to add this image to the wikimedia production images repo, addressed all reviews, and successfully tested the image on ml-lab1002, while resolving infra compatibility issues (P76252, P76288, P76290, P76308) with help from SREs. This patch is ready to merge, but the image build requires high resources (grafana dashboard), that may not be available on build200X. SREs advise that we'll build and push this image to the docker registry using ml-lab1002.

Spillover:

  • the slimmed docker image is ready, we need to have it in the docker registry. The new docker image does not have the layer size limitations that we faced (no layer is above 4GB)