Page MenuHomePhabricator

Compare performance of KServe huggingfaceserver with HuggingFace vs vLLM backend
Closed, ResolvedPublic

Description

Previously (T357986, T354870), we added a KServe huggingfaceserver to the ML isvc repo and tested it using the huggingface backend, because the base image it relied on (docker-registry.wikimedia.org/amd-pytorch23:2.3.0rocm6.0-2) doesn't support the vllm backend.

In T385173, we built a new base image (docker-registry.wikimedia.org/amd-vllm085) that supports ROCm-enabled vLLM and would like to use it to:

  • build a KServe huggingfaceserver that supports the vllm backend
  • run inference with the vllm backend
  • compare inference latency between the old huggingfaceserver that used huggingface backend vs new huggingfaceserver that uses vllm backend
Update

Following T385173#11690913, the wmf-debian-vllm image is now available in the Wikimedia docker registry: https://docker-registry.wikimedia.org/ml/amd-vllm014/tags/

We used this ROCm-enabled vLLM in the embeddings isvc, and it performed better than HuggingFace transformers backend as detailed in: T395019#11712061

Event Timeline

To accurately compare the performance of the KServe huggingfaceserver with the huggingface vs vllm backends, the ideal testing ground would be our staging environment, which closely mirrors production. However, since the new vllm-compatible base image (docker-registry.wikimedia.org/amd-vllm085) has not yet been pushed to the wikimedia docker registry (see T394778 for work in progress), these initial comparison tests were conducted on ml-lab1002, where the image is currently accessible.

The following sections detail the setup and results for each server configuration.

Old KServe huggingfaceserver (huggingface backend)

1.Built the old huggingfaceserver using the dockerfile below:

1FROM docker-registry.wikimedia.org/amd-pytorch23:2.3.0rocm6.0-2
2
3USER root
4
5WORKDIR /srv/app
6
7ARG http_proxy
8ENV https_proxy=${http_proxy}
9ENV http_proxy=${http_proxy}
10ENV PYTHONPATH=/srv/app:/opt/lib/python/site-packages:/opt/lib/venv/lib/python3.11/site-packages
11
12RUN apt-get update && apt-get install -y build-essential git curl python3-venv
13RUN git clone --branch liftwing https://github.com/wikimedia/kserve.git kserve_repo
14RUN git clone https://github.com/wikimedia/machinelearning-liftwing-inference-services.git
15RUN pip install --break-system-packages -r machinelearning-liftwing-inference-services/src/models/huggingface_modelserver/requirements.txt
16# install transformers v4.52.1 as v4.52.2 is causing: https://github.com/huggingface/transformers/issues/38269
17RUN pip install --break-system-packages transformers==4.52.1

2.Started this model-server to serve the aya-expanse-8b model with the huggingface backend:

1$ docker run --rm --network=host -it \
2-e CUDA_VISIBLE_DEVICES=0 \
3-e HF_TOKEN=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 \ # remember to replace this token with yours as I have invalidated this one
4--device=/dev/kfd --device=/dev/dri \
5--group-add=$(getent group video | cut -d: -f3) \
6--group-add=$(getent group render | cut -d: -f3) \
7--ipc=host \
8--security-opt seccomp=unconfined \
9-v /srv/hf-cache:/home/vllm/.cache/huggingface \
10kserve-huggingfaceserver:hf \
11python3 -m huggingfaceserver --model_id=CohereForAI/aya-expanse-8b --model_name=aya-expanse-8b --backend=huggingface
12config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 634/634 [00:00<00:00, 8.95MB/s]
132025-05-22 11:36:00.689 1 kserve INFO [__main__.py:load_model():204] Loading generative model for task 'text_generation' in torch.float16
142025-05-22 11:36:01.383 1 kserve INFO [generative_model.py:load():206] Decoder-only model detected. Setting padding side to left.
15tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 8.64k/8.64k [00:00<00:00, 67.1MB/s]
16tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12.8M/12.8M [00:00<00:00, 149MB/s]
17special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 439/439 [00:00<00:00, 3.09MB/s]
182025-05-22 11:36:03.590 1 kserve INFO [generative_model.py:load():223] Successfully loaded tokenizer
19model.safetensors.index.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 21.0k/21.0k [00:00<00:00, 81.7MB/s]
20model-00004-of-00004.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 1.22G/1.22G [00:09<00:00, 136MB/s]
21model-00003-of-00004.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 5.00G/5.00G [00:24<00:00, 205MB/s]
22model-00002-of-00004.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 4.92G/4.92G [00:26<00:00, 183MB/s]
23model-00001-of-00004.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 4.92G/4.92G [00:26<00:00, 183MB/s]
24Fetching 4 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:27<00:00, 6.76s/it]
25Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:07<00:00, 1.96s/it]
26generation_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 137/137 [00:00<00:00, 1.28MB/s]
272025-05-22 11:36:53.844 1 kserve INFO [generative_model.py:load():244] Successfully loaded huggingface model from path CohereForAI/aya-expanse-8b
282025-05-22 11:37:22.163 1 kserve INFO [model_server.py:register_model():406] Registering model: aya-expanse-8b
292025-05-22 11:37:22.164 1 kserve INFO [model_server.py:start():276] Setting max asyncio worker threads as 32
302025-05-22 11:37:22.164 1 kserve INFO [model_server.py:serve():282] Starting uvicorn with 1 workers
312025-05-22 11:37:22.210 uvicorn.error INFO: Started server process [1]
322025-05-22 11:37:22.211 uvicorn.error INFO: Waiting for application startup.
332025-05-22 11:37:22.216 1 kserve INFO [server.py:start():68] Starting gRPC server on [::]:8081
342025-05-22 11:37:22.217 uvicorn.error INFO: Application startup complete.
352025-05-22 11:37:22.217 uvicorn.error INFO: Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)

3.Query the isvc with a test completion request:

1$ time curl --noproxy "*" -v localhost:8080/openai/v1/completions -H "Content-Type: application/json" -d '{
2 "model":"aya-expanse-8b",
3 "prompt":"Hello, world!",
4 "max_tokens":50,
5 "stream":false
6 }'
7* Trying 127.0.0.1:8080...
8* Connected to localhost (127.0.0.1) port 8080 (#0)
9> POST /openai/v1/completions HTTP/1.1
10> Host: localhost:8080
11> User-Agent: curl/7.88.1
12> Accept: */*
13> Content-Type: application/json
14> Content-Length: 105
15>
16< HTTP/1.1 200 OK
17< date: Thu, 22 May 2025 11:41:09 GMT
18< server: uvicorn
19< content-length: 474
20< content-type: application/json
21<
22* Connection #0 to host localhost left intact
23{"id":"f63ebd9b-c46d-4e5f-86d9-843e06b4e377","choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":"\nI’m a new blogger, and I’m excited to share my experiences and thoughts with you. I’m a 20-something year old living in the beautiful city of Toronto, Canada. I’m a huge fan of"}],"created":1747914075,"model":"aya-expanse-8b","system_fingerprint":null,"object":"text_completion","usage":{"completion_tokens":50,"prompt_tokens":5,"total_tokens":55}}
24real 0m5.456s
25user 0m0.009s
26sys 0m0.005s

New KServe huggingfaceserver (vllm backend)

1.Built the new huggingfaceserver using the dockerfile below:

1FROM docker-registry.wikimedia.org/amd-vllm085:gfx90arocm6.3.1pytorch2.8.0flash-attn2.7.4vllm0.8.5-1
2
3USER root
4
5WORKDIR /srv/app
6
7ARG http_proxy
8ENV https_proxy=${http_proxy}
9ENV http_proxy=${http_proxy}
10ENV PYTHONPATH=/srv/app:/srv/venv/lib/python3.11/site-packages:/srv/venv/lib64/python3.11/site-packages
11
12RUN apt-get update && apt-get install -y build-essential git curl
13RUN git clone --branch liftwing-vllm https://github.com/wikimedia/kserve.git kserve_repo
14RUN git clone https://github.com/wikimedia/machinelearning-liftwing-inference-services.git
15RUN pip install -r machinelearning-liftwing-inference-services/src/models/huggingface_modelserver/requirements.txt

2.Started this model-server to serve the aya-expanse-8b model with the vllm backend:

1$ docker run --rm --network=host -it \
2-e CUDA_VISIBLE_DEVICES=0 \
3-e HF_TOKEN=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 \ # remember to replace this token with yours as I have invalidated this one
4-e VLLM_USE_TRITON_FLASH_ATTN=0 \
5--device=/dev/kfd --device=/dev/dri \
6--group-add=$(getent group video | cut -d: -f3) \
7--group-add=$(getent group render | cut -d: -f3) \
8--ipc=host \
9--security-opt seccomp=unconfined \
10-v /srv/hf-cache:/home/vllm/.cache/huggingface \
11kserve-huggingfaceserver:vllm \
12python3 -m huggingfaceserver --model_id=CohereForAI/aya-expanse-8b --model_name=aya-expanse-8b --backend=vllm
13INFO 05-22 11:46:39 [importing.py:53] Triton module has been replaced with a placeholder.
14INFO 05-22 11:46:39 [__init__.py:239] Automatically detected platform rocm.
15config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 634/634 [00:00<00:00, 3.88MB/s]
162025-05-22 11:46:42.489 1 kserve INFO [model_server.py:register_model():398] Registering model: aya-expanse-8b
172025-05-22 11:46:42.491 1 kserve INFO [model_server.py:setup_event_loop():278] Setting max asyncio worker threads as 32
18INFO 05-22 11:46:54 [config.py:716] This model supports multiple tasks: {'score', 'reward', 'generate', 'classify', 'embed'}. Defaulting to 'generate'.
19INFO 05-22 11:46:55 [arg_utils.py:1699] rocm is experimental on VLLM_USE_V1=1. Falling back to V0 Engine.
20INFO 05-22 11:46:55 [config.py:1770] Disabled the custom all-reduce kernel because it is not supported on current platform.
21INFO 05-22 11:46:55 [llm_engine.py:242] Initializing a V0 LLM engine (v0.8.5.dev243+gc53e0730c) with config: model='CohereForAI/aya-expanse-8b', speculative_config=None, tokenizer='CohereForAI/aya-expanse-8b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=CohereForAI/aya-expanse-8b, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
22tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 8.64k/8.64k [00:00<00:00, 38.3MB/s]
23tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12.8M/12.8M [00:00<00:00, 327MB/s]
24special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 439/439 [00:00<00:00, 3.24MB/s]
25generation_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 137/137 [00:00<00:00, 1.02MB/s]
26INFO 05-22 11:46:56 [rocm.py:186] None is not supported in AMD GPUs.
27INFO 05-22 11:46:56 [rocm.py:187] Using ROCmFlashAttention backend.
28[W522 11:46:56.869303270 ProcessGroupNCCL.cpp:1028] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
29[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
30[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
31[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
32[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
33INFO 05-22 11:46:56 [parallel_state.py:946] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
34INFO 05-22 11:46:56 [model_runner.py:1120] Starting to load model CohereForAI/aya-expanse-8b...
35INFO 05-22 11:46:57 [weight_utils.py:265] Using model weights format ['*.safetensors']
36model-00004-of-00004.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 1.22G/1.22G [00:14<00:00, 83.0MB/s]
37model-00002-of-00004.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 4.92G/4.92G [00:15<00:00, 318MB/s]
38model-00001-of-00004.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 4.92G/4.92G [00:15<00:00, 307MB/s]
39model-00003-of-00004.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 5.00G/5.00G [00:29<00:00, 169MB/s]
40INFO 05-22 11:47:27 [weight_utils.py:281] Time spent downloading weights for CohereForAI/aya-expanse-8b: 29.806498 seconds███████▊| 4.91G/4.92G [00:15<00:00, 443MB/s]
41model.safetensors.index.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 21.0k/21.0k [00:00<00:00, 78.0MB/s]
42Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
43Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:00<00:01, 1.78it/s]
44Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:02<00:02, 1.49s/it]
45Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:05<00:01, 1.91s/it]
46Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:07<00:00, 2.14s/it]
47Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:07<00:00, 1.90s/it]
48
49INFO 05-22 11:47:35 [loader.py:458] Loading weights took 7.95 seconds
50INFO 05-22 11:47:35 [model_runner.py:1152] Model loading took 15.1406 GiB and 38.408186 seconds
51INFO 05-22 11:48:17 [worker.py:287] Memory profiling takes 42.26 seconds
52INFO 05-22 11:48:17 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (0.90) = 57.59GiB
53INFO 05-22 11:48:17 [worker.py:287] model weights take 15.14GiB; non_torch_memory takes 0.28GiB; PyTorch activation peak memory takes 2.38GiB; the rest of the memory reserved for KV Cache is 39.78GiB.
54INFO 05-22 11:48:17 [executor_base.py:112] # rocm blocks: 20368, # CPU blocks: 2048
55INFO 05-22 11:48:17 [executor_base.py:117] Maximum concurrency for 8192 tokens per request: 39.78x
56INFO 05-22 11:48:18 [model_runner.py:1462] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
57Capturing CUDA graph shapes: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:18<00:00, 1.88it/s]
58INFO 05-22 11:48:37 [model_runner.py:1604] Graph capturing finished in 19 secs, took 0.24 GiB
59INFO 05-22 11:48:37 [llm_engine.py:443] init engine (profile, create kv cache, warmup model) took 61.82 seconds
602025-05-22 11:48:37.722 1 kserve INFO [utils.py:build_async_engine_client_from_engine_args():123] V0 AsyncLLMEngine build complete
612025-05-22 11:48:37.775 1 kserve INFO [server.py:_register_endpoints():108] OpenAI endpoints registered
622025-05-22 11:48:37.775 1 kserve INFO [server.py:start():161] Starting uvicorn with 1 workers
632025-05-22 11:48:37.793 1 uvicorn.error INFO: Started server process [1]
642025-05-22 11:48:37.793 1 uvicorn.error INFO: Waiting for application startup.
652025-05-22 11:48:37.799 1 kserve INFO [server.py:start():70] Starting gRPC server with 4 workers
662025-05-22 11:48:37.800 1 kserve INFO [server.py:start():71] Starting gRPC server on [::]:8081
672025-05-22 11:48:37.800 1 uvicorn.error INFO: Application startup complete.
682025-05-22 11:48:37.800 1 uvicorn.error INFO: Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)

3.Query the isvc with a test completion request:

1$ time curl --noproxy "*" -v localhost:8080/openai/v1/completions -H "Content-Type: application/json" -d '{
2 "model":"aya-expanse-8b",
3 "prompt":"Hello, world!",
4 "max_tokens":50,
5 "stream":false
6 }'
7* Trying 127.0.0.1:8080...
8* Connected to localhost (127.0.0.1) port 8080 (#0)
9> POST /openai/v1/completions HTTP/1.1
10> Host: localhost:8080
11> User-Agent: curl/7.88.1
12> Accept: */*
13> Content-Type: application/json
14> Content-Length: 105
15>
16< HTTP/1.1 200 OK
17< date: Thu, 22 May 2025 11:51:39 GMT
18< server: uvicorn
19< content-length: 580
20< content-type: application/json
21<
22* Connection #0 to host localhost left intact
23{"id":"cmpl-959b0577923649938e9a4808b6b4c59d","object":"text_completion","created":1747914700,"model":"aya-expanse-8b","choices":[{"index":0,"text":"\nThis site will allow you to share your designs, available as pictures and mentioned information and project ideas. Better yet, you can post your own design, or your comments and knowledge on the topic, or even request content.\nEnjoy )\nWelcome","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":5,"total_tokens":55,"completion_tokens":50,"prompt_tokens_details":null}}
24real 0m0.834s
25user 0m0.010s
26sys 0m0.003s

This initial comparison on ml-lab1002 shows a performance improvement when using the vllm backend. The huggingfaceserver with the vllm backend completed the inference request in ~0.834s, compared to ~5.456s for the server using the huggingface backend.

More rigorous performance tests using tools like locust will be conducted in our staging environment once the new docker-registry.wikimedia.org/amd-vllm085 base image is deployed to the docker registry (tracked in T394778). This will provide a better understanding of the performance gains in an environment closer to production.

In T418976#11705174, we migrated the embeddings isvc inference backend from HuggingFace Transformers to vLLM. The locust load test results show that RPS increased by ~3x, and median latency decreased by ~5x. These results are similar to the initial performance improvements we observed in T395019#10847999.

kevinbazira updated the task description. (Show Details)
kevinbazira moved this task from Blocked to 2025-2026 Q2 Done on the Machine-Learning-Team board.