As an engineer,
I want to optimize the inference performance of transformers models using PyTorch on AMD GPUs (MI100), so that I can achieve faster model predictions with Large Language Models.
My goal is to identify and mitigate performance bottlenecks by leveraging techniques like quantization and efficient/smart batching and also to explore the boundaries of the specific GPU: what would be the largest model we can host and how fast does it run?
As part of this task we want to document the extent of the support for rocm by libraries that are used for inference optimization (examples like accelerate, bitsandbytes, vllm etc.) and narrow down the options that we have for AMD GPUs.
Resources: PyTorch Model Inference optimization checklist, Huggingface GPU Inference optimization