As an engineer,
I want to deploy a 7b parameter model from HuggingFace on our MI100 AMD GPU on ml-staging,
so that I can identify potential challenges and bottlenecks on deploying such models.
As part of this task I will deploy falcon-7b which is one of the open source models available under an Apache 2.0 license.
After successfully deploying the model we want to record inference latency for the "vanilla" version and then incrementally experiment with inference optimization techniques like 8bit/4bit quantization and Flash Attention 2 to verify that these can be used on our GPU and record improvements.
Description
Details
Related Objects
- Mentioned In
- rMLIS1b0ac4aa9215: huggingface: bump vllm to 0.4.3
T360822: [Epic] LLM integration for task summaries in baseline metrics tool
rMLISe846331344a2: llm: update transformers module - Mentioned Here
- T365246: Upgrade Huggingface image to kserve 0.13-rc0 (torch 2.3.0 ROCm 6.0)
T354257: Investigate inference optimization frameworks for Large Language Models (LLMs)
T334583: [Spike] Run models and frameworks on AMD GPU and identify challenges
Event Timeline
We start by deploying the falcon 7b model (after the transformers update) so that we can continue where we left this work a couple of months ago https://phabricator.wikimedia.org/T334583.
At the time we hit a wall with our hardware (GPU with 16GB VRAM) as we could load a 7b model but wouldn't run in OOM errors while running inference. Also quantization wasn't working at the time on our AMD GPUs (not at least the way we tried it).
Change 989831 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):
[machinelearning/liftwing/inference-services@main] llm: update transformers module
Change 989913 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):
[operations/deployment-charts@master] WIP:ml-services: deploy falcon 7b on GPU
Change 989831 merged by jenkins-bot:
[machinelearning/liftwing/inference-services@main] llm: update transformers module
Change 989913 merged by jenkins-bot:
[operations/deployment-charts@master] ml-services: deploy falcon 7b on GPU
Change 990044 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):
[operations/deployment-charts@master] ml-services: increase limitranges for ml-staging
Change 990044 merged by jenkins-bot:
[operations/deployment-charts@master] ml-services: increase limitranges for ml-staging
Change 990699 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):
[operations/deployment-charts@master] ml-services: increase limitranges
Change 990699 merged by jenkins-bot:
[operations/deployment-charts@master] ml-services: increase limitranges
Change 990705 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):
[operations/deployment-charts@master] ml-services: increase falcon-7b pod memory
Change 990705 merged by jenkins-bot:
[operations/deployment-charts@master] ml-services: increase falcon-7b pod memory
Falcon, llama and mistral (and mixtral ) models have been incorporated in the transformers library so we don't need to use the trust_remote_code=True.
In the ml-staging deployment we're still getting some OOM (out of memory errors) but locally I was able to run it on cpu. Using the tranformers code (without the trust_remote_code argument which results in legacy code to run) prediction takes ~65seconds for a length of 100 tokens where the legacy code takes 76s.
Continuing to do some memory profiling to figure out why we get OOM errors.
Change 1003488 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):
[machinelearning/liftwing/inference-services@main] llm: enable quantization with AutoGPTQ
At the moment I'm working on utlizing quantization on our AMD GPUs. AutoGPTQ seems like a prominent solution for rocm.
However I'm facing issues with dependencies.
INFO:datasets:PyTorch version 2.0.1+rocm5.4.2 available. Traceback (most recent call last): File "/srv/app/llm/model.py", line 139, in <module> model = llm_class(model_name) File "/srv/app/llm/model.py", line 28, in __init__ self.model, self.tokenizer = self.load() File "/srv/app/llm/model.py", line 47, in load model = AutoModelForCausalLM.from_pretrained( File "/opt/lib/python/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained return model_class.from_pretrained( File "/opt/lib/python/site-packages/transformers/modeling_utils.py", line 3026, in from_pretrained from optimum.gptq import GPTQQuantizer File "/opt/lib/python/site-packages/optimum/gptq/__init__.py", line 15, in <module> from .quantizer import GPTQQuantizer, load_quantized_model File "/opt/lib/python/site-packages/optimum/gptq/quantizer.py", line 42, in <module> if is_auto_gptq_available(): File "/opt/lib/python/site-packages/optimum/utils/import_utils.py", line 121, in is_auto_gptq_available raise ImportError( ImportError: Found an incompatible version of auto-gptq. Found version 0.4.2+rocm5.4.2, but only version above 0.4.99 are supported
Version above 0.4.99 seems to be supported however only 0.4.2 is available with rocm5.4.2. I'l investigate if we can use a newer rocm version.
From what I've faced so far it seems that there are a lot of moving parts in the area of LLM deployments. Things are moving fast but at the same time they aren't stable.
In order to end up with more stable deployments I would recommend we pause on the falcon deployment and put our efforts on https://phabricator.wikimedia.org/T354257
Change #1017858 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):
[operations/deployment-charts@master] ml-services: deploy falcon7b-instruct
Change #1017858 merged by jenkins-bot:
[operations/deployment-charts@master] ml-services: deploy falcon7b-instruct
Change #1018633 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):
[operations/deployment-charts@master] ml-services: deploy mistral-7b-instruct
Change #1018633 merged by jenkins-bot:
[operations/deployment-charts@master] ml-services: deploy mistral-7b-instruct
Change #1018646 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):
[operations/deployment-charts@master] ml-services: fix indentation in mistral model resources
Change #1018646 merged by jenkins-bot:
[operations/deployment-charts@master] ml-services: fix indentation in mistral model resources and increase memory
Change #1003488 abandoned by Ilias Sarantopoulos:
[machinelearning/liftwing/inference-services@main] llm: enable quantization with AutoGPTQ
Reason:
We are focusing on working with the huggingface image so this is not needed at the moment
Change #1047106 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):
[operations/deployment-charts@master] ml-services: deploy llama3
Change #1047106 merged by jenkins-bot:
[operations/deployment-charts@master] ml-services: deploy llama3
I have deployed llama3-8B-instruct on ml-staging.
making a request using the OpenAI API completions endpoint:
time curl "https://inference-staging.svc.codfw.wmnet:30443/openai/v1/completions" -X POST -d '{"model": "llama3", "prompt": "What is Wikipedia?", "stream":false, "max_tokens": 50 }' -H "Host: llama3.experimental.wikimedia.org" -H "Content-Type: application/json" {"id":"aa9fe1cc-641e-487e-abab-2d8ddb5e7ea9","choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":" Wikipedia is a free online encyclopedia that allows anyone with an internet connection to access and contribute to its vast repository of knowledge. It was founded in 2001 by Jimmy Wales and Larry Sanger, and has since become one of the most popular and widely"}],"created":1718783716,"model":"llama3","system_fingerprint":null,"object":"text_completion","usage":{"completion_tokens":50,"prompt_tokens":5,"total_tokens":55}} real 0m2.359s user 0m0.022s sys 0m0.014s
The above utilizes the MI-100 AMD GPU using the huggingface backend.
Currently working on trying to overcome the following error on model server start when trying to use vllm backend :
kubectl logs llama3-predictor-00003-deployment-869b5f958-hd4b7 2024-06-19 08:02:38.600 1 kserve INFO [storage.py:download():66] Copying contents of /mnt/models to local WARNING 06-19 08:02:38 config.py:1086] Casting torch.bfloat16 to torch.float16. INFO 06-19 08:02:38 llm_engine.py:100] Initializing an LLM engine (v0.4.2) with config: model='/mnt/models', speculative_config=None, tokenizer='/mnt/models', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/mnt/models) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. INFO 06-19 08:02:39 pynccl_utils.py:17] Failed to import NCCL library: Cannot find librccl.so.1 in the system. INFO 06-19 08:02:39 pynccl_utils.py:18] It is expected if you are not running on NVIDIA GPUs. INFO 06-19 08:02:39 selector.py:37] Using ROCmFlashAttention backend. INFO 06-19 08:02:53 model_runner.py:175] Loading model weights took 14.9595 GB 2024-06-19 08:02:53.739 1 kserve ERROR [__main__.py:<module>():254] Failed to start model server: name 'vllm_ops' is not defined
The issue seems to be related to the current working directory and seems to have been fixed in latest vllm release.
Change #1047448 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):
[operations/deployment-charts@master] ml-services: switch to llama3-8B-instruct
Change #1047448 merged by jenkins-bot:
[operations/deployment-charts@master] ml-services: switch to llama3-8B-instruct
Change #1048012 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):
[machinelearning/liftwing/inference-services@main] huggingface: bump vllm to 0.4.3
Change #1048012 merged by jenkins-bot:
[machinelearning/liftwing/inference-services@main] huggingface: bump vllm to 0.4.3
I bumped into a fork of the vllm project by ROCm which also has different releases as well as a flash-attention implementation for ROCm.
I'm trying vllm 0.4.3 and if it fails I'll go with the official instructions for vllm and ROCm. They recommend we build vllm after we install torch-rocm which we are doing since we're using the base image and the only difference in the requirements-rocm.txt is that it requires ray==2.10.0 which we already have (pytest-asyncio as well but I doubt that is being needed to run anything). I'll try both ways and provide an update.
Got the same error with vllm==0.4.3 so I'll try to follow the documentation and see if anyone else is experiencing this issue.
WARNING 06-20 15:46:14 config.py:1155] Casting torch.bfloat16 to torch.float16. INFO 06-20 15:46:14 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='/mnt/models', speculative_config=None, tokenizer='/mnt/models', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/mnt/models) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. INFO 06-20 15:46:15 selector.py:56] Using ROCmFlashAttention backend. INFO 06-20 15:46:15 selector.py:56] Using ROCmFlashAttention backend. INFO 06-20 15:46:35 model_runner.py:146] Loading model weights took 14.9595 GB 2024-06-20 15:46:35.322 1 kserve ERROR [__main__.py:<module>():254] Failed to start model server: name 'vllm_ops' is not defined
Following up on the vllm support issue from https://phabricator.wikimedia.org/T365246#9826503
with the installation of the new MI210 which seems to be supported by vllm (according to official docs) we are now in a place to test the vllm backend with huggingface.
Following the official docs I am exploring 2 alternatives:
- vllm docs : The recommended way is to build it from source and use the rocm docker image variant provided in the repo
- ROCm docs: suggest a simpler way to just clone the ROCm fork of vllm and run :
PYTORCH_ROCM_ARCH=gfx90a python setup.py install
where gfx90a is the architecture for the MI200 series, while the ROCm fork offers a different docker image as well.
I'm exploring which is the simplest solution and easier to maintain.
If we decide to use vllm engine as an inference framework we can move its installation to a base pytorch image. However for the time being I would avoid to add it over there as versions are chaning quite often.
One other thing to figure out is the version discrepancy between vllm(latest versions v0.4.3) and ROCm-vllm fork (latest version 0.4.0) which ends up being inconsistent with huggingfaceserver python module requirements - vllm = { version = "^0.4.2", optional = true }- which can be tackled if we use our fork of kserve but ends up being one more custom step in the build/update process.