Page MenuHomePhabricator

Upgrade production vLLM image to use vLLM version >= 0.19
Open, Needs TriagePublic

Description

Update the vllm production image to a newer version (vllm 0.19) as done in the WIP patch https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1285395 .

The deployment of Qwen 3.6-27B fails because the current current vLLM base image is amd-vllm014(vLLM 0.14).
Qwen 3.6-27B-FP8 uses model_type qwen3_5, a new architecture that vLLM 0.14 does not recognize.
vLLM maintains an internal model registry that maps model types to implementation classes — qwen3_5 was added in vLLM 0.17+ and is not present in the 0.14 registry.
Upgrading to vLLM 0.19 resolves the issue with the Qwen 3.6-27B model because it bundles a newer version of the model registry and an updated transformers dependency that both include native qwen3_5 support.

Event Timeline

isarantopoulos renamed this task from Update the vllm to a newer version. to Upgrade production vLLM image to use vLLM version >= 0.19.Wed, May 20, 6:05 AM

@kevinbazira @DPogorzelski-WMF Since you worked on the previous build of this image, is there any documentation available for this process?
Otherwise, given that it is a special case, as part of this task we should also add documentation to https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Inference_Services/Production_Image_Development

@kevinbazira @DPogorzelski-WMF Since you worked on the previous build of this image, is there any documentation available for this process?

Here is documentation on how to build WMF production base images: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/docker-images/production-images/+/refs/heads/master/README.md

Once the vLLM image has been built and tested, Dawid can help publish it to the wikimedia docker registry using ml-build1001 as only SREs have the rights to do so.

Hey folks, as FYI with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1290808 the /ml prefix in the Docker registry is served by an S3-backend, that will be hopefully more performant and stable compared to Swift. The vLLM images were all copied over so you shouldn't see any weird issue in pulling (let me know otherwise).

Bonus point: this change makes a little bit easier to bump the layer size limit on the Registry only for ML, if needed in the future. This doesn't mean that we can bump to 10G the limit and forget about doing all the work that Kevin did, that is still needed :) But we'll be able to overcome blockers like not being able to push because one layer exceeds by 500Mb from the limit after the Docker image has been optimized etc..

Let me know if you encounter any issue!

Change #1285395 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/docker-images/production-images@master] (WIP) ml: add vLLM 0.19.1 image

https://gerrit.wikimedia.org/r/1285395

Not relevant to the above messages, but we should try to use the latest vLLM version we can here cause it is unknown if we could support cope-b model with 0.19 (requires 0.20) . Qwen3.6 on the other hand is supported by 0.19.