There is a plan to include a prebuilt model server for LLMs very close to what we were discussing which is also based on vllm runtime(kserve already has an experimental vllm runtime).
More specifically the huggingface model server is the implementation to support out of the box support for HF models.
Pasting from the README.md:
The Huggingface serving runtime implements a runtime that can serve huggingface transformer based model out of the box. The preprocess and post-process handlers are implemented based on different ML tasks, for example text classification, token-classification, text-generation, text2text generation. Based on the performance requirement, you can choose to perform the inference on a more optimized inference engine like triton inference server and vLLM for text generation.
This involves a custom runtime server which means that we'll need to mirror/upload the image to our docker registry in order to use it from kserve.
Until this moment this seems like the most prominent solution as we won't have to maintain the dependencies ourselves but we can engage more with the community and contribute if we need something that isn't supported yet.
Full HF model server with vllm integration is expected in kserve 0.12 with the new generate endpoint.
As part of this task we'll
- Add the upstream huggingface model server docker image to WMF's docker registry so that we can use it in Lift Wing.
- Test its support with ROCm and AMD GPU: if it supports it out of the box (as recent HF versions suggest) we are good to go, otherwise we'll build another image based on this that will include ROCm version of pytorch. This image will be use the kserve HF image as a base image and build a new one with blubber in the same we do for the rest of the inference services.