Deploy 7b parameter models from HF
Open, Needs TriagePublic4 Estimated Story Points
Actions

Assigned To

Authored By

	isarantopoulos
	Jan 11 2024, 2:39 PM

Description

As an engineer,
I want to deploy a 7b parameter model from HuggingFace on our MI100 AMD GPU on ml-staging,
so that I can identify potential challenges and bottlenecks on deploying such models.
As part of this task I will deploy falcon-7b which is one of the open source models available under an Apache 2.0 license.
After successfully deploying the model we want to record inference latency for the "vanilla" version and then incrementally experiment with inference optimization techniques like 8bit/4bit quantization and Flash Attention 2 to verify that these can be used on our GPU and record improvements.

Details

Subject	Repo	Branch	Lines +/-
huggingface: bump vllm to 0.4.3	machinelearning/liftwing/inference-services	main	+27 -18
ml-services: switch to llama3-8B-instruct	operations/deployment-charts	master	+1 -1
ml-services: deploy llama3	operations/deployment-charts	master	+5 -5
llm: enable quantization with AutoGPTQ	machinelearning/liftwing/inference-services	main	+20 -8
ml-services: fix indentation in mistral model resources and increase memory	operations/deployment-charts	master	+16 -14
ml-services: deploy mistral-7b-instruct	operations/deployment-charts	master	+3 -3
ml-services: deploy falcon7b-instruct	operations/deployment-charts	master	+17 -1
ml-services: increase falcon-7b pod memory	operations/deployment-charts	master	+4 -4
ml-services: increase limitranges	operations/deployment-charts	master	+2 -2
ml-services: increase limitranges for ml-staging	operations/deployment-charts	master	+2 -2
ml-services: deploy falcon 7b on GPU	operations/deployment-charts	master	+17 -1
llm: update transformers module	machinelearning/liftwing/inference-services	main	+4 -4

Related Objects

Mentioned In: rMLIS1b0ac4aa9215: huggingface: bump vllm to 0.4.3
T360822: [Epic] LLM integration for task summaries in baseline metrics tool
rMLISe846331344a2: llm: update transformers module
Mentioned Here: T365246: Upgrade Huggingface image to kserve 0.13-rc0 (torch 2.3.0 ROCm 6.0)
T354257: Investigate inference optimization frameworks for Large Language Models (LLMs)
T334583: [Spike] Run models and frameworks on AMD GPU and identify challenges

Event Timeline

isarantopoulos created this task.Jan 11 2024, 2:39 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 11 2024, 2:39 PM

isarantopoulos claimed this task.Jan 11 2024, 2:39 PM

isarantopoulos updated the task description. (Show Details)Jan 11 2024, 6:08 PM

We start by deploying the falcon 7b model (after the transformers update) so that we can continue where we left this work a couple of months ago https://phabricator.wikimedia.org/T334583.
At the time we hit a wall with our hardware (GPU with 16GB VRAM) as we could load a 7b model but wouldn't run in OOM errors while running inference. Also quantization wasn't working at the time on our AMD GPUs (not at least the way we tried it).

Change 989831 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] llm: update transformers module

https://gerrit.wikimedia.org/r/989831

gerritbot added a project: Patch-For-Review.Jan 11 2024, 6:15 PM

Change 989913 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] WIP:ml-services: deploy falcon 7b on GPU

https://gerrit.wikimedia.org/r/989913

Change 989831 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] llm: update transformers module

https://gerrit.wikimedia.org/r/989831

isarantopoulos mentioned this in rMLISe846331344a2: llm: update transformers module.Jan 12 2024, 8:21 AM

Change 989913 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: deploy falcon 7b on GPU

https://gerrit.wikimedia.org/r/989913

Maintenance_bot removed a project: Patch-For-Review.Jan 12 2024, 11:30 AM

Change 990044 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: increase limitranges for ml-staging

https://gerrit.wikimedia.org/r/990044

gerritbot added a project: Patch-For-Review.Jan 12 2024, 11:38 AM

Change 990044 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: increase limitranges for ml-staging

https://gerrit.wikimedia.org/r/990044

Maintenance_bot removed a project: Patch-For-Review.Jan 12 2024, 3:30 PM

Change 990699 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: increase limitranges

https://gerrit.wikimedia.org/r/990699

gerritbot added a project: Patch-For-Review.Jan 15 2024, 2:11 PM

Change 990699 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: increase limitranges

https://gerrit.wikimedia.org/r/990699

Maintenance_bot removed a project: Patch-For-Review.Jan 15 2024, 2:30 PM

Change 990705 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: increase falcon-7b pod memory

https://gerrit.wikimedia.org/r/990705

gerritbot added a project: Patch-For-Review.Jan 15 2024, 2:34 PM

Change 990705 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: increase falcon-7b pod memory

https://gerrit.wikimedia.org/r/990705

Maintenance_bot removed a project: Patch-For-Review.Jan 15 2024, 3:30 PM

Falcon, llama and mistral (and mixtral ) models have been incorporated in the transformers library so we don't need to use the trust_remote_code=True.
In the ml-staging deployment we're still getting some OOM (out of memory errors) but locally I was able to run it on cpu. Using the tranformers code (without the trust_remote_code argument which results in legacy code to run) prediction takes ~65seconds for a length of 100 tokens where the legacy code takes 76s.
Continuing to do some memory profiling to figure out why we get OOM errors.

isarantopoulos moved this task from Ready To Go to In Progress on the Machine-Learning-Team board.Jan 15 2024, 5:24 PM

isarantopoulos moved this task from In Progress to Blocked on the Machine-Learning-Team board.Jan 29 2024, 11:22 AM

isarantopoulos moved this task from Blocked to In Progress on the Machine-Learning-Team board.Feb 13 2024, 2:55 PM

isarantopoulos set the point value for this task to 4.Feb 13 2024, 3:03 PM

Change 1003488 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] llm: enable quantization with AutoGPTQ

https://gerrit.wikimedia.org/r/1003488

gerritbot added a project: Patch-For-Review.Feb 14 2024, 5:04 PM

At the moment I'm working on utlizing quantization on our AMD GPUs. AutoGPTQ seems like a prominent solution for rocm.

However I'm facing issues with dependencies.

INFO:datasets:PyTorch version 2.0.1+rocm5.4.2 available.
Traceback (most recent call last):
  File "/srv/app/llm/model.py", line 139, in <module>
    model = llm_class(model_name)
  File "/srv/app/llm/model.py", line 28, in __init__
    self.model, self.tokenizer = self.load()
  File "/srv/app/llm/model.py", line 47, in load
    model = AutoModelForCausalLM.from_pretrained(
  File "/opt/lib/python/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained
    return model_class.from_pretrained(
  File "/opt/lib/python/site-packages/transformers/modeling_utils.py", line 3026, in from_pretrained
    from optimum.gptq import GPTQQuantizer
  File "/opt/lib/python/site-packages/optimum/gptq/__init__.py", line 15, in <module>
    from .quantizer import GPTQQuantizer, load_quantized_model
  File "/opt/lib/python/site-packages/optimum/gptq/quantizer.py", line 42, in <module>
    if is_auto_gptq_available():
  File "/opt/lib/python/site-packages/optimum/utils/import_utils.py", line 121, in is_auto_gptq_available
    raise ImportError(
ImportError: Found an incompatible version of auto-gptq. Found version 0.4.2+rocm5.4.2, but only version above 0.4.99 are supported

Version above 0.4.99 seems to be supported however only 0.4.2 is available with rocm5.4.2. I'l investigate if we can use a newer rocm version.

From what I've faced so far it seems that there are a lot of moving parts in the area of LLM deployments. Things are moving fast but at the same time they aren't stable.
In order to end up with more stable deployments I would recommend we pause on the falcon deployment and put our efforts on https://phabricator.wikimedia.org/T354257

isarantopoulos moved this task from In Progress to Ready To Go on the Machine-Learning-Team board.Feb 21 2024, 12:25 PM

cjming mentioned this in T360822: [Epic] LLM integration for task summaries in baseline metrics tool.Mar 22 2024, 8:43 PM

Change #1017858 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: deploy falcon7b-instruct

https://gerrit.wikimedia.org/r/1017858

Change #1017858 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: deploy falcon7b-instruct

https://gerrit.wikimedia.org/r/1017858

Change #1018633 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: deploy mistral-7b-instruct

https://gerrit.wikimedia.org/r/1018633

Change #1018633 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: deploy mistral-7b-instruct

https://gerrit.wikimedia.org/r/1018633

Change #1018646 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: fix indentation in mistral model resources

https://gerrit.wikimedia.org/r/1018646

Change #1018646 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: fix indentation in mistral model resources and increase memory

https://gerrit.wikimedia.org/r/1018646

Change #1003488 abandoned by Ilias Sarantopoulos:

[machinelearning/liftwing/inference-services@main] llm: enable quantization with AutoGPTQ

Reason:

We are focusing on working with the huggingface image so this is not needed at the moment

https://gerrit.wikimedia.org/r/1003488

Maintenance_bot removed a project: Patch-For-Review.Apr 26 2024, 6:15 PM

Change #1047106 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: deploy llama3

https://gerrit.wikimedia.org/r/1047106

gerritbot added a project: Patch-For-Review.Tue, Jun 18, 3:02 PM

Change #1047106 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: deploy llama3

https://gerrit.wikimedia.org/r/1047106

Maintenance_bot removed a project: Patch-For-Review.Tue, Jun 18, 3:30 PM

I have deployed llama3-8B-instruct on ml-staging.
making a request using the OpenAI API completions endpoint:

time curl "https://inference-staging.svc.codfw.wmnet:30443/openai/v1/completions" -X POST -d '{"model": "llama3", "prompt": "What is Wikipedia?", "stream":false, "max_tokens": 50 }' -H  "Host: llama3.experimental.wikimedia.org" -H "Content-Type: application/json"
{"id":"aa9fe1cc-641e-487e-abab-2d8ddb5e7ea9","choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":" Wikipedia is a free online encyclopedia that allows anyone with an internet connection to access and contribute to its vast repository of knowledge. It was founded in 2001 by Jimmy Wales and Larry Sanger, and has since become one of the most popular and widely"}],"created":1718783716,"model":"llama3","system_fingerprint":null,"object":"text_completion","usage":{"completion_tokens":50,"prompt_tokens":5,"total_tokens":55}}
real	0m2.359s
user	0m0.022s
sys	0m0.014s

The above utilizes the MI-100 AMD GPU using the huggingface backend.
Currently working on trying to overcome the following error on model server start when trying to use vllm backend :

kubectl logs llama3-predictor-00003-deployment-869b5f958-hd4b7
2024-06-19 08:02:38.600 1 kserve INFO [storage.py:download():66] Copying contents of /mnt/models to local
WARNING 06-19 08:02:38 config.py:1086] Casting torch.bfloat16 to torch.float16.
INFO 06-19 08:02:38 llm_engine.py:100] Initializing an LLM engine (v0.4.2) with config: model='/mnt/models', speculative_config=None, tokenizer='/mnt/models', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/mnt/models)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 06-19 08:02:39 pynccl_utils.py:17] Failed to import NCCL library: Cannot find librccl.so.1 in the system.
INFO 06-19 08:02:39 pynccl_utils.py:18] It is expected if you are not running on NVIDIA GPUs.
INFO 06-19 08:02:39 selector.py:37] Using ROCmFlashAttention backend.
INFO 06-19 08:02:53 model_runner.py:175] Loading model weights took 14.9595 GB
2024-06-19 08:02:53.739 1 kserve ERROR [__main__.py:<module>():254] Failed to start model server: name 'vllm_ops' is not defined

The issue seems to be related to the current working directory and seems to have been fixed in latest vllm release.

Change #1047448 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: switch to llama3-8B-instruct

https://gerrit.wikimedia.org/r/1047448

gerritbot added a project: Patch-For-Review.Wed, Jun 19, 9:00 AM

Change #1047448 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: switch to llama3-8B-instruct

https://gerrit.wikimedia.org/r/1047448

Change #1048012 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] huggingface: bump vllm to 0.4.3

https://gerrit.wikimedia.org/r/1048012

Change #1048012 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] huggingface: bump vllm to 0.4.3

https://gerrit.wikimedia.org/r/1048012

isarantopoulos mentioned this in rMLIS1b0ac4aa9215: huggingface: bump vllm to 0.4.3.Thu, Jun 20, 3:25 PM

I bumped into a fork of the vllm project by ROCm which also has different releases as well as a flash-attention implementation for ROCm.
I'm trying vllm 0.4.3 and if it fails I'll go with the official instructions for vllm and ROCm. They recommend we build vllm after we install torch-rocm which we are doing since we're using the base image and the only difference in the requirements-rocm.txt is that it requires ray==2.10.0 which we already have (pytest-asyncio as well but I doubt that is being needed to run anything). I'll try both ways and provide an update.

Got the same error with vllm==0.4.3 so I'll try to follow the documentation and see if anyone else is experiencing this issue.

WARNING 06-20 15:46:14 config.py:1155] Casting torch.bfloat16 to torch.float16.
INFO 06-20 15:46:14 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='/mnt/models', speculative_config=None, tokenizer='/mnt/models', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/mnt/models)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 06-20 15:46:15 selector.py:56] Using ROCmFlashAttention backend.
INFO 06-20 15:46:15 selector.py:56] Using ROCmFlashAttention backend.
INFO 06-20 15:46:35 model_runner.py:146] Loading model weights took 14.9595 GB
2024-06-20 15:46:35.322 1 kserve ERROR [__main__.py:<module>():254] Failed to start model server: name 'vllm_ops' is not defined

Following up on the vllm support issue from https://phabricator.wikimedia.org/T365246#9826503
with the installation of the new MI210 which seems to be supported by vllm (according to official docs) we are now in a place to test the vllm backend with huggingface.
Following the official docs I am exploring 2 alternatives:

vllm docs : The recommended way is to build it from source and use the rocm docker image variant provided in the repo

ROCm docs: suggest a simpler way to just clone the ROCm fork of vllm and run :

PYTORCH_ROCM_ARCH=gfx90a python setup.py install

where gfx90a is the architecture for the MI200 series, while the ROCm fork offers a different docker image as well.
I'm exploring which is the simplest solution and easier to maintain.
If we decide to use vllm engine as an inference framework we can move its installation to a base pytorch image. However for the time being I would avoid to add it over there as versions are chaning quite often.

One other thing to figure out is the version discrepancy between vllm(latest versions v0.4.3) and ROCm-vllm fork (latest version 0.4.0) which ends up being inconsistent with huggingfaceserver python module requirements - vllm = { version = "^0.4.2", optional = true }- which can be tackled if we use our fork of kserve but ends up being one more custom step in the build/update process.

Deploy 7b parameter models from HFOpen, Needs TriagePublic4 Estimated Story PointsActions

Description

Details

Related Objects

Event Timeline

Deploy 7b parameter models from HF
Open, Needs TriagePublic4 Estimated Story Points
Actions