Previously (T357986, T354870), we added a KServe huggingfaceserver to the ML isvc repo and tested it using the huggingface backend, because the base image it relied on (docker-registry.wikimedia.org/amd-pytorch23:2.3.0rocm6.0-2) doesn't support the vllm backend.
In T385173, we built a new base image (docker-registry.wikimedia.org/amd-vllm085) that supports ROCm-enabled vLLM and would like to use it to:
- build a KServe huggingfaceserver that supports the vllm backend
- run inference with the vllm backend
- compare inference latency between the old huggingfaceserver that used huggingface backend vs new huggingfaceserver that uses vllm backend
Update
Following T385173#11690913, the wmf-debian-vllm image is now available in the Wikimedia docker registry: https://docker-registry.wikimedia.org/ml/amd-vllm014/tags/
We used this ROCm-enabled vLLM in the embeddings isvc, and it performed better than HuggingFace transformers backend as detailed in: T395019#11712061