Page MenuHomePhabricator

Deploy 7b parameter models from HF
Open, Needs TriagePublic4 Estimated Story Points

Description

As an engineer,
I want to deploy a 7b parameter model from HuggingFace on our MI100 AMD GPU on ml-staging,
so that I can identify potential challenges and bottlenecks on deploying such models.
As part of this task I will deploy falcon-7b which is one of the open source models available under an Apache 2.0 license.
After successfully deploying the model we want to record inference latency for the "vanilla" version and then incrementally experiment with inference optimization techniques like 8bit/4bit quantization and Flash Attention 2 to verify that these can be used on our GPU and record improvements.

Event Timeline

We start by deploying the falcon 7b model (after the transformers update) so that we can continue where we left this work a couple of months ago https://phabricator.wikimedia.org/T334583.
At the time we hit a wall with our hardware (GPU with 16GB VRAM) as we could load a 7b model but wouldn't run in OOM errors while running inference. Also quantization wasn't working at the time on our AMD GPUs (not at least the way we tried it).

Change 989831 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] llm: update transformers module

https://gerrit.wikimedia.org/r/989831

Change 989913 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] WIP:ml-services: deploy falcon 7b on GPU

https://gerrit.wikimedia.org/r/989913

Change 989831 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] llm: update transformers module

https://gerrit.wikimedia.org/r/989831

Change 989913 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: deploy falcon 7b on GPU

https://gerrit.wikimedia.org/r/989913

Change 990044 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: increase limitranges for ml-staging

https://gerrit.wikimedia.org/r/990044

Change 990044 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: increase limitranges for ml-staging

https://gerrit.wikimedia.org/r/990044

Change 990699 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: increase limitranges

https://gerrit.wikimedia.org/r/990699

Change 990699 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: increase limitranges

https://gerrit.wikimedia.org/r/990699

Change 990705 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: increase falcon-7b pod memory

https://gerrit.wikimedia.org/r/990705

Change 990705 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: increase falcon-7b pod memory

https://gerrit.wikimedia.org/r/990705

Falcon, llama and mistral (and mixtral ) models have been incorporated in the transformers library so we don't need to use the trust_remote_code=True.
In the ml-staging deployment we're still getting some OOM (out of memory errors) but locally I was able to run it on cpu. Using the tranformers code (without the trust_remote_code argument which results in legacy code to run) prediction takes ~65seconds for a length of 100 tokens where the legacy code takes 76s.
Continuing to do some memory profiling to figure out why we get OOM errors.

Change 1003488 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] llm: enable quantization with AutoGPTQ

https://gerrit.wikimedia.org/r/1003488

At the moment I'm working on utlizing quantization on our AMD GPUs. AutoGPTQ seems like a prominent solution for rocm.

However I'm facing issues with dependencies.

INFO:datasets:PyTorch version 2.0.1+rocm5.4.2 available.
Traceback (most recent call last):
  File "/srv/app/llm/model.py", line 139, in <module>
    model = llm_class(model_name)
  File "/srv/app/llm/model.py", line 28, in __init__
    self.model, self.tokenizer = self.load()
  File "/srv/app/llm/model.py", line 47, in load
    model = AutoModelForCausalLM.from_pretrained(
  File "/opt/lib/python/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained
    return model_class.from_pretrained(
  File "/opt/lib/python/site-packages/transformers/modeling_utils.py", line 3026, in from_pretrained
    from optimum.gptq import GPTQQuantizer
  File "/opt/lib/python/site-packages/optimum/gptq/__init__.py", line 15, in <module>
    from .quantizer import GPTQQuantizer, load_quantized_model
  File "/opt/lib/python/site-packages/optimum/gptq/quantizer.py", line 42, in <module>
    if is_auto_gptq_available():
  File "/opt/lib/python/site-packages/optimum/utils/import_utils.py", line 121, in is_auto_gptq_available
    raise ImportError(
ImportError: Found an incompatible version of auto-gptq. Found version 0.4.2+rocm5.4.2, but only version above 0.4.99 are supported

Version above 0.4.99 seems to be supported however only 0.4.2 is available with rocm5.4.2. I'l investigate if we can use a newer rocm version.

From what I've faced so far it seems that there are a lot of moving parts in the area of LLM deployments. Things are moving fast but at the same time they aren't stable.
In order to end up with more stable deployments I would recommend we pause on the falcon deployment and put our efforts on https://phabricator.wikimedia.org/T354257

Change #1017858 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: deploy falcon7b-instruct

https://gerrit.wikimedia.org/r/1017858

Change #1017858 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: deploy falcon7b-instruct

https://gerrit.wikimedia.org/r/1017858

Change #1018633 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: deploy mistral-7b-instruct

https://gerrit.wikimedia.org/r/1018633

Change #1018633 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: deploy mistral-7b-instruct

https://gerrit.wikimedia.org/r/1018633

Change #1018646 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: fix indentation in mistral model resources

https://gerrit.wikimedia.org/r/1018646

Change #1018646 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: fix indentation in mistral model resources and increase memory

https://gerrit.wikimedia.org/r/1018646

Change #1003488 abandoned by Ilias Sarantopoulos:

[machinelearning/liftwing/inference-services@main] llm: enable quantization with AutoGPTQ

Reason:

We are focusing on working with the huggingface image so this is not needed at the moment

https://gerrit.wikimedia.org/r/1003488

Change #1047106 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: deploy llama3

https://gerrit.wikimedia.org/r/1047106

Change #1047106 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: deploy llama3

https://gerrit.wikimedia.org/r/1047106

I have deployed llama3-8B-instruct on ml-staging.
making a request using the OpenAI API completions endpoint:

time curl "https://inference-staging.svc.codfw.wmnet:30443/openai/v1/completions" -X POST -d '{"model": "llama3", "prompt": "What is Wikipedia?", "stream":false, "max_tokens": 50 }' -H  "Host: llama3.experimental.wikimedia.org" -H "Content-Type: application/json"
{"id":"aa9fe1cc-641e-487e-abab-2d8ddb5e7ea9","choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":" Wikipedia is a free online encyclopedia that allows anyone with an internet connection to access and contribute to its vast repository of knowledge. It was founded in 2001 by Jimmy Wales and Larry Sanger, and has since become one of the most popular and widely"}],"created":1718783716,"model":"llama3","system_fingerprint":null,"object":"text_completion","usage":{"completion_tokens":50,"prompt_tokens":5,"total_tokens":55}}
real	0m2.359s
user	0m0.022s
sys	0m0.014s

The above utilizes the MI-100 AMD GPU using the huggingface backend.
Currently working on trying to overcome the following error on model server start when trying to use vllm backend :

kubectl logs llama3-predictor-00003-deployment-869b5f958-hd4b7
2024-06-19 08:02:38.600 1 kserve INFO [storage.py:download():66] Copying contents of /mnt/models to local
WARNING 06-19 08:02:38 config.py:1086] Casting torch.bfloat16 to torch.float16.
INFO 06-19 08:02:38 llm_engine.py:100] Initializing an LLM engine (v0.4.2) with config: model='/mnt/models', speculative_config=None, tokenizer='/mnt/models', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/mnt/models)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 06-19 08:02:39 pynccl_utils.py:17] Failed to import NCCL library: Cannot find librccl.so.1 in the system.
INFO 06-19 08:02:39 pynccl_utils.py:18] It is expected if you are not running on NVIDIA GPUs.
INFO 06-19 08:02:39 selector.py:37] Using ROCmFlashAttention backend.
INFO 06-19 08:02:53 model_runner.py:175] Loading model weights took 14.9595 GB
2024-06-19 08:02:53.739 1 kserve ERROR [__main__.py:<module>():254] Failed to start model server: name 'vllm_ops' is not defined

The issue seems to be related to the current working directory and seems to have been fixed in latest vllm release.

Change #1047448 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: switch to llama3-8B-instruct

https://gerrit.wikimedia.org/r/1047448

Change #1047448 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: switch to llama3-8B-instruct

https://gerrit.wikimedia.org/r/1047448

Change #1048012 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] huggingface: bump vllm to 0.4.3

https://gerrit.wikimedia.org/r/1048012

Change #1048012 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] huggingface: bump vllm to 0.4.3

https://gerrit.wikimedia.org/r/1048012

I bumped into a fork of the vllm project by ROCm which also has different releases as well as a flash-attention implementation for ROCm.
I'm trying vllm 0.4.3 and if it fails I'll go with the official instructions for vllm and ROCm. They recommend we build vllm after we install torch-rocm which we are doing since we're using the base image and the only difference in the requirements-rocm.txt is that it requires ray==2.10.0 which we already have (pytest-asyncio as well but I doubt that is being needed to run anything). I'll try both ways and provide an update.

Got the same error with vllm==0.4.3 so I'll try to follow the documentation and see if anyone else is experiencing this issue.

WARNING 06-20 15:46:14 config.py:1155] Casting torch.bfloat16 to torch.float16.
INFO 06-20 15:46:14 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='/mnt/models', speculative_config=None, tokenizer='/mnt/models', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/mnt/models)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 06-20 15:46:15 selector.py:56] Using ROCmFlashAttention backend.
INFO 06-20 15:46:15 selector.py:56] Using ROCmFlashAttention backend.
INFO 06-20 15:46:35 model_runner.py:146] Loading model weights took 14.9595 GB
2024-06-20 15:46:35.322 1 kserve ERROR [__main__.py:<module>():254] Failed to start model server: name 'vllm_ops' is not defined

Following up on the vllm support issue from https://phabricator.wikimedia.org/T365246#9826503
with the installation of the new MI210 which seems to be supported by vllm (according to official docs) we are now in a place to test the vllm backend with huggingface.
Following the official docs I am exploring 2 alternatives:

PYTORCH_ROCM_ARCH=gfx90a python setup.py install

where gfx90a is the architecture for the MI200 series, while the ROCm fork offers a different docker image as well.
I'm exploring which is the simplest solution and easier to maintain.
If we decide to use vllm engine as an inference framework we can move its installation to a base pytorch image. However for the time being I would avoid to add it over there as versions are chaning quite often.

One other thing to figure out is the version discrepancy between vllm(latest versions v0.4.3) and ROCm-vllm fork (latest version 0.4.0) which ends up being inconsistent with huggingfaceserver python module requirements - vllm = { version = "^0.4.2", optional = true }- which can be tackled if we use our fork of kserve but ends up being one more custom step in the build/update process.