Page MenuHomePhabricator

Investigate inference optimization frameworks for Large Language Models (LLMs)
Open, Needs TriagePublic3 Estimated Story Points

Event Timeline

isarantopoulos renamed this task from Investigate inference optimization frameworks for Large models to Investigate inference optimization frameworks for Large Language Models (LLMs).Jan 3 2024, 2:39 PM
isarantopoulos updated the task description. (Show Details)
calbon set the point value for this task to 3.
calbon moved this task from Unsorted to Ready To Go on the Machine-Learning-Team board.

There is a plan to include a prebuilt model server for LLMs very close to what we were discussing which is also based on vllm (kserve already has an experimental vllm runtime).
More specifically the huggingface model server is the implementation to support out of the box support for HF models.
Pasting from the README.md:

The Huggingface serving runtime implements a runtime that can serve huggingface transformer based model out of the box. The preprocess and post-process handlers are implemented based on different ML tasks, for example text classification, token-classification, text-generation, text2text generation. Based on the performance requirement, you can choose to perform the inference on a more optimized inference engine like triton inference server and vLLM for text generation.

This involves a custom runtime server which means that we'll need to mirror/upload the image to our docker registry in order to use it from kserve.
Until this moment this seems like the most prominent solution as we won't have to maintain the dependencies ourselves but we can engage more with the community and contribute if we need something that isn't supported yet.
Full HF model server with vllm integration is expected in kserve 0.12 with the new generate endpoint.