Q3 2024 Goal: Inference Optimization for Hugging face/Pytorch models
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	isarantopoulos
	Dec 13 2023, 3:54 PM

Description

As an engineer,
I want to optimize the inference performance of transformers models using PyTorch on AMD GPUs (MI100), so that I can achieve faster model predictions with Large Language Models.
My goal is to identify and mitigate performance bottlenecks by leveraging techniques like quantization and efficient/smart batching and also to explore the boundaries of the specific GPU: what would be the largest model we can host and how fast does it run?
As part of this task we want to document the extent of the support for rocm by libraries that are used for inference optimization (examples like accelerate, bitsandbytes, vllm etc.) and narrow down the options that we have for AMD GPUs.

Resources: PyTorch Model Inference optimization checklist, Huggingface GPU Inference optimization

Related Objects
Search...

Status	Assigned	Task
Open	None	T353337 Q3 2024 Goal: Inference Optimization for Hugging face/Pytorch models
Open	isarantopoulos	T354257 Investigate inference optimization frameworks for Large Language Models (LLMs)
Resolved	klausman	T354516 Requesting write access to ml-staging-codfw for ML team
Open	isarantopoulos	T357986 Use Huggingface model server image for HF LLMs

Event Timeline

isarantopoulos created this task.Dec 13 2023, 3:54 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 13 2023, 3:54 PM

isarantopoulos renamed this task from Goal: Inference Optimization for Hugging face models to Goal: Inference Optimization for Hugging face/Pytorch models.Dec 13 2023, 3:55 PM

isarantopoulos updated the task description. (Show Details)Dec 20 2023, 2:46 PM

isarantopoulos updated the task description. (Show Details)Dec 20 2023, 2:56 PM

calbon added a project: Goal.Dec 20 2023, 3:34 PM

santhosh subscribed.Dec 21 2023, 5:34 AM

isarantopoulos added a subtask: T354257: Investigate inference optimization frameworks for Large Language Models (LLMs).Jan 3 2024, 2:40 PM

isarantopoulos added a subtask: T354516: Requesting write access to ml-staging-codfw for ML team.Jan 9 2024, 3:44 PM

We now have access to be able to do operations on running pods in ml-staging-codfw edit/exec/delete) so we can start working directly on the GPU.

fbalicchia subscribed.Mar 3 2024, 7:43 AM

klausman closed subtask T354516: Requesting write access to ml-staging-codfw for ML team as Resolved.Mar 5 2024, 11:50 AM

Current status from relevant subtask
At the moment we are working on how to better serve 7B parameter models utilizing a GPU. We are using the huggingface runtime available by kserve.
We plan to experiment hosting larger models (>13B) and explore the tradeoffs between serving a 7B model vs 13B model (quantization or just downcasting) which would have similar serving times
If we remain in the area of 7B parameter models it is unlikely that we are going to utilize a vanilla version for any use case. However fine tuned versions of these LLMs may end up being good candidates for specific use cases.

calbon renamed this task from Goal: Inference Optimization for Hugging face/Pytorch models to Q3 2024 Goal: Inference Optimization for Hugging face/Pytorch models.Apr 16 2024, 2:51 PM

calbon moved this task from Current Quarter Goals to Previous Quarter Goals on the Machine-Learning-Team board.

Q3 2024 Goal: Inference Optimization for Hugging face/Pytorch modelsOpen, Needs TriagePublicActions

Description

Related ObjectsSearch...

Event Timeline

Q3 2024 Goal: Inference Optimization for Hugging face/Pytorch models
Open, Needs TriagePublic
Actions

Related Objects
Search...