Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T353337 Q3 2024 Goal: Inference Optimization for Hugging face/Pytorch models | |||
Open | isarantopoulos | T354257 Investigate inference optimization frameworks for Large Language Models (LLMs) |
Event Timeline
There is a plan to include a prebuilt model server for LLMs very close to what we were discussing which is also based on vllm (kserve already has an experimental vllm runtime).
More specifically the huggingface model server is the implementation to support out of the box support for HF models.
Pasting from the README.md:
The Huggingface serving runtime implements a runtime that can serve huggingface transformer based model out of the box. The preprocess and post-process handlers are implemented based on different ML tasks, for example text classification, token-classification, text-generation, text2text generation. Based on the performance requirement, you can choose to perform the inference on a more optimized inference engine like triton inference server and vLLM for text generation.
This involves a custom runtime server which means that we'll need to mirror/upload the image to our docker registry in order to use it from kserve.
Until this moment this seems like the most prominent solution as we won't have to maintain the dependencies ourselves but we can engage more with the community and contribute if we need something that isn't supported yet.
Full HF model server with vllm integration is expected in kserve 0.12 with the new generate endpoint.