Page MenuHomePhabricator
Paste P76837

simple vllm+kserve isvc serving microsoft/Phi-4-mini-instruct LLM on ml-lab1002
ActivePublic

Authored by kevinbazira on Jun 2 2025, 2:45 PM.
$ python3 simple_vllm_model.py
INFO 06-02 13:24:22 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 06-02 13:24:23 [__init__.py:239] Automatically detected platform rocm.
INFO 06-02 13:24:24 [config.py:209] Replacing legacy 'type' key with 'rope_type'
INFO 06-02 13:24:40 [config.py:716] This model supports multiple tasks: {'reward', 'generate', 'score', 'embed', 'classify'}. Defaulting to 'generate'.
INFO 06-02 13:24:48 [arg_utils.py:1699] rocm is experimental on VLLM_USE_V1=1. Falling back to V0 Engine.
INFO 06-02 13:24:48 [config.py:1770] Disabled the custom all-reduce kernel because it is not supported on current platform.
INFO 06-02 13:24:48 [llm_engine.py:242] Initializing a V0 LLM engine (v0.8.5.dev243+gc53e0730c) with config: model='/srv/app/models/Phi-4-mini-instruct', speculative_
config=None, tokenizer='/srv/app/models/Phi-4-mini-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/srv/app/models/Phi-4-mini-instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
INFO 06-02 13:24:48 [rocm.py:186] None is not supported in AMD GPUs.
INFO 06-02 13:24:48 [rocm.py:187] Using ROCmFlashAttention backend.
[W602 13:24:48.833832012 ProcessGroupNCCL.cpp:1028] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
INFO 06-02 13:24:48 [parallel_state.py:946] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 06-02 13:24:48 [model_runner.py:1120] Starting to load model /srv/app/models/Phi-4-mini-instruct...
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:01<00:01, 1.60s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:04<00:00, 2.43s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:04<00:00, 2.30s/it]
INFO 06-02 13:24:54 [loader.py:458] Loading weights took 4.91 seconds
INFO 06-02 13:24:54 [model_runner.py:1152] Model loading took 8.5215 GiB and 5.464753 seconds
INFO 06-02 13:24:55 [worker.py:287] Memory profiling takes 0.85 seconds
INFO 06-02 13:24:55 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (1.00) = 63.98GiB
INFO 06-02 13:24:55 [worker.py:287] model weights take 8.52GiB; non_torch_memory takes 0.29GiB; PyTorch activation peak memory takes 1.84GiB; the rest of the memory reserved for KV Cache is 53.33GiB.
INFO 06-02 13:24:55 [executor_base.py:112] # rocm blocks: 27305, # CPU blocks: 2048
INFO 06-02 13:24:55 [executor_base.py:117] Maximum concurrency for 4096 tokens per request: 106.66x
INFO 06-02 13:24:56 [model_runner.py:1462] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:16<00:00, 2.10it/s]
INFO 06-02 13:25:13 [model_runner.py:1604] Graph capturing finished in 17 secs, took 0.20 GiB
INFO 06-02 13:25:13 [llm_engine.py:443] init engine (profile, create kv cache, warmup model) took 18.56 seconds
2025-06-02 13:25:13.410 1996 kserve INFO [model_server.py:register_model():398] Registering model: simple-vllm
2025-06-02 13:25:13.411 1996 kserve INFO [model_server.py:setup_event_loop():278] Setting max asyncio worker threads as 32
2025-06-02 13:25:13.429 1996 kserve INFO [server.py:_register_endpoints():108] OpenAI endpoints registered
2025-06-02 13:25:13.430 1996 kserve INFO [server.py:start():161] Starting uvicorn with 1 workers
2025-06-02 13:25:13.441 1996 uvicorn.error INFO: Started server process [1996]
2025-06-02 13:25:13.441 1996 uvicorn.error INFO: Waiting for application startup.
2025-06-02 13:25:13.444 1996 kserve INFO [server.py:start():70] Starting gRPC server with 4 workers
2025-06-02 13:25:13.444 1996 kserve INFO [server.py:start():71] Starting gRPC server on [::]:8081
2025-06-02 13:25:13.445 1996 uvicorn.error INFO: Application startup complete.
2025-06-02 13:25:13.445 1996 uvicorn.error INFO: Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)