Investigate inference optimization frameworks for Large Language Models (LLMs)
Open, Needs TriagePublic3 Estimated Story Points
Actions

Assigned To

Authored By

	isarantopoulos
	Jan 3 2024, 11:58 AM

Description

Resources:

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T353337 Q3 2024 Goal: Inference Optimization for Hugging face/Pytorch models
		Open		isarantopoulos	T354257 Investigate inference optimization frameworks for Large Language Models (LLMs)

Event Timeline

isarantopoulos created this task.Jan 3 2024, 11:58 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 3 2024, 11:58 AM

isarantopoulos renamed this task from Investigate inference optimization frameworks for Large models to Investigate inference optimization frameworks for Large Language Models (LLMs).Jan 3 2024, 2:39 PM

isarantopoulos added a parent task: T353337: Q3 2024 Goal: Inference Optimization for Hugging face/Pytorch models.

isarantopoulos updated the task description. (Show Details)

isarantopoulos updated the task description. (Show Details)Jan 3 2024, 2:47 PM

ChromboKen subscribed.Jan 5 2024, 7:29 AM

calbon assigned this task to isarantopoulos.Jan 9 2024, 3:23 PM

calbon set the point value for this task to 3.

calbon moved this task from Unsorted to Ready To Go on the Machine-Learning-Team board.

fbalicchia subscribed.Jan 9 2024, 5:13 PM

There is a plan to include a prebuilt model server for LLMs very close to what we were discussing which is also based on vllm (kserve already has an experimental vllm runtime).
More specifically the huggingface model server is the implementation to support out of the box support for HF models.
Pasting from the README.md:

The Huggingface serving runtime implements a runtime that can serve huggingface transformer based model out of the box. The preprocess and post-process handlers are implemented based on different ML tasks, for example text classification, token-classification, text-generation, text2text generation. Based on the performance requirement, you can choose to perform the inference on a more optimized inference engine like triton inference server and vLLM for text generation.

This involves a custom runtime server which means that we'll need to mirror/upload the image to our docker registry in order to use it from kserve.
Until this moment this seems like the most prominent solution as we won't have to maintain the dependencies ourselves but we can engage more with the community and contribute if we need something that isn't supported yet.
Full HF model server with vllm integration is expected in kserve 0.12 with the new generate endpoint.

isarantopoulos mentioned this in T354870: Deploy 7b parameter models from HF.Feb 19 2024, 3:52 PM

Investigate inference optimization frameworks for Large Language Models (LLMs)Open, Needs TriagePublic3 Estimated Story PointsActions

Description

Related ObjectsSearch...

Event Timeline

Investigate inference optimization frameworks for Large Language Models (LLMs)
Open, Needs TriagePublic3 Estimated Story Points
Actions

Related Objects
Search...