Page MenuHomePhabricator

kevinbazira (Kevin Bazira, KBazira)
Software Engineer (Machine Learning)

Today

  • No visible events.

Tomorrow

  • No visible events.

Saturday

  • No visible events.

User Details

User Since
Aug 3 2019, 6:58 AM (358 w, 5 d)
Availability
Available
IRC Nick
kevinbazira
LDAP User
Kevin Bazira
MediaWiki User
KBazira (WMF) [ Global Accounts ]

Recent Activity

Today

kevinbazira added a comment to T427497: Deploy CoPE-B-A4B on LiftWing.

We have run load tests for the cope-b-a4b isvc, and it can handle ~32 requests/second with a median latency of ~36ms as shown below:

TypeNameRequest CountFailure CountMedian Response TimeAverage Response TimeMin Response TimeMax Response TimeAverage Content SizeRequests/sFailures/s50%66%75%80%90%95%98%99%99.9%99.99%100%
POST/v1/models/cope-b-a4b:predict383003648.3199362100804822.2367700189352041550.04274845123363.032.134387153638780.0364552587610019028074016001600
Aggregated383003648.3199362100804822.2367700189352041550.04274845123363.032.134387153638780.0364552587610019028074016001600
Thu, Jun 18, 9:52 AM · Lift-Wing, Machine-Learning-Team (Q4 FY2025-26)

Yesterday

kevinbazira added a comment to T427497: Deploy CoPE-B-A4B on LiftWing.

Thanks for the clarification, @Tchanders! The cope-b-a4b isvc response has been trimmed to violation, p_violation, p_safe as shown below:

$ time curl "https://inference.svc.eqiad.wmnet:30443/v1/models/cope-b-a4b:predict" -X POST \
-d '{ "content": "CLICK HERE TO WIN $10000!!! Visit http://totallylegit.biz NOW!!!", "policy": "Content must not contain spam, phishing attempts, or deceptive links." }' \
-H  "Host: cope-b-a4b.experimental.wikimedia.org" \
-H "Content-Type: application/json" --http1.1
{"violation":1,"p_violation":1.0,"p_safe":1.522997974471263e-8}
real	0m0.042s
user	0m0.011s
sys	0m0.004s
$ 
$ 
$ 
$ time curl "https://inference.svc.eqiad.wmnet:30443/v1/models/cope-b-a4b:predict" -X POST \
-d '{ 
"content": "The library opens at 9am on weekdays and 10am on weekends.",
"policy": "Content must not contain spam, phishing attempts, or deceptive links."
 }' \
-H  "Host: cope-b-a4b.experimental.wikimedia.org" \
-H "Content-Type: application/json" --http1.1
{"violation":0,"p_violation":1.538173465229056e-7,"p_safe":0.9999998807907176}
real	0m0.045s
user	0m0.015s
sys	0m0.000s
$ 
$ 
$ 
$ time curl "https://inference.svc.eqiad.wmnet:30443/v1/models/cope-b-a4b:predict" -X POST \
-d '{ 
"content": "Check out my new blog where I review productivity apps. Link in my bio!",
"policy": "Content must not contain spam, phishing attempts, or deceptive links."
 }' \
-H  "Host: cope-b-a4b.experimental.wikimedia.org" \
-H "Content-Type: application/json" --http1.1
{"violation":0,"p_violation":8.939699493298122e-6,"p_safe":0.999991059383269}
real	0m0.045s
user	0m0.015s
sys	0m0.000s
Wed, Jun 17, 8:59 AM · Lift-Wing, Machine-Learning-Team (Q4 FY2025-26)
kevinbazira created P94204 example of gpt-oss-safeguard-20b isvc prompt and response with confidence score.
Wed, Jun 17, 6:20 AM · Machine-Learning-Team

Tue, Jun 16

kevinbazira added a comment to T427497: Deploy CoPE-B-A4B on LiftWing.

A vLLM 0.22.1 base image was published in T428577. This enabled us to migrate the cope-b-a4b model-server from HF transformers to vLLM. The latest cope-b-a4b isvc has been deployed in the prod experimental ns:

1$ kubectl logs cope-b-a4b-predictor-00001-deployment-57d5857fdd-zxkdx
2+ source common_settings.sh
3+++ /srv/venv/bin/python -c 'from python.resource_utils import get_cpu_count; print(get_cpu_count())'
4++ CPU_COUNT=6
5++ echo 'CPU count detected from get_cpu_count: 6'
6CPU count detected from get_cpu_count: 6
7OMP_NUM_THREADS set to: 6
8++ export OMP_NUM_THREADS=6
9++ OMP_NUM_THREADS=6
10++ echo 'OMP_NUM_THREADS set to: 6'
11+ MODEL_SERVER_PATH=model.py
12+ exec /srv/venv/bin/python model.py
13/srv/venv/lib/python3.12/site-packages/torchvision/io/image.py:14: UserWarning: Failed to load image Python extension: 'Could not load this library: /srv/venv/lib/python3.12/site-packages/torchvision/image.so'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
14 warn(
15INFO:root:Loading vLLM model...
16INFO 06-16 13:59:48 [utils.py:278] non-default args: {'dtype': 'bfloat16', 'max_model_len': 16384, 'disable_log_stats': True, 'model': '/mnt/models'}
17INFO 06-16 13:59:58 [model.py:617] Resolved architecture: Gemma4ForCausalLM
18INFO 06-16 13:59:58 [model.py:1752] Using max model len 16384
19INFO 06-16 13:59:59 [scheduler.py:239] Chunked prefill is enabled with max_num_batched_tokens=16384.
20INFO 06-16 13:59:59 [config.py:100] Gemma4 model has heterogeneous head dimensions (head_dim=256, global_head_dim=512). Forcing TRITON_ATTN backend to prevent mixed-backend numerical divergence.
21INFO 06-16 13:59:59 [vllm.py:977] Asynchronous scheduling is enabled.
22INFO 06-16 13:59:59 [kernel.py:270] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native'])
23WARNING 06-16 14:00:04 [system_utils.py:157] We must use the `spawn` multiprocessing start method. Overriding VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/usage/troubleshooting.html#python-multiprocessing for more information. Reasons: CUDA is initialized
24/srv/venv/lib/python3.12/site-packages/torchvision/io/image.py:14: UserWarning: Failed to load image Python extension: 'Could not load this library: /srv/venv/lib/python3.12/site-packages/torchvision/image.so'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
25 warn(
26(EngineCore pid=128) INFO 06-16 14:00:11 [core.py:112] Initializing a V1 LLM engine (v0.22.1) with config: model='/mnt/models', speculative_config=None, tokenizer='/mnt/models', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=16384, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/mnt/models, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['+sparse_attn_indexer', 'none'], 'ir_enable_torch_wrap': True, 'splitting_ops': ['vllm::unified_attention_with_output', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::qwen_gdn_attention_core', 'vllm::gdn_attention_core_xpu', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::deepseek_v4_attention', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': [], 'compile_ranges_endpoints': [16384], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_rope_kvcache_cat_mla': False, 'fuse_act_padding': False, 'fuse_mla_dual_rms_norm': False, 'fuse_rope_kvcache': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native']), enable_flashinfer_autotune=True, moe_backend='auto', linear_backend='auto')
27(EngineCore pid=128) INFO 06-16 14:00:13 [parallel_state.py:1422] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.67.19.34:37883 backend=nccl
28(EngineCore pid=128) INFO 06-16 14:00:13 [parallel_state.py:1735] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
29(EngineCore pid=128) INFO 06-16 14:00:13 [gpu_model_runner.py:5037] Starting to load model /mnt/models...
30(EngineCore pid=128) INFO 06-16 14:00:14 [rocm.py:507] Using TRITON_ATTN backend (selected via --attention-backend).
31(EngineCore pid=128) WARNING 06-16 14:00:14 [activation.py:349] [ROCm] PyTorch's native GELU with tanh approximation is unstable with torch.compile. For native implementation, fallback to 'none' approximation. The custom kernel implementation is unaffected.
32(EngineCore pid=128) INFO 06-16 14:00:14 [unquantized.py:285] Using TRITON Unquantized MoE backend out of potential backends: ['ROCm AITER', 'TRITON', 'BATCHED_TRITON'].
33(EngineCore pid=128) WARNING 06-16 14:00:14 [compilation.py:1303] Op 'sparse_attn_indexer' not present in model, enabling with '+sparse_attn_indexer' has no effect
34(EngineCore pid=128) INFO 06-16 14:00:14 [weight_utils.py:922] Filesystem type for checkpoints: EXT4. Checkpoint size: 47.00 GiB. Available RAM: 1405.13 GiB.
35(EngineCore pid=128) INFO 06-16 14:00:14 [weight_utils.py:945] Auto-prefetch is disabled because the filesystem (EXT4) is not a recognized network FS (NFS/Lustre). If you want to force prefetching, start vLLM with --safetensors-load-strategy=prefetch.
36Loading safetensors checkpoint shards: 0% Completed | 0/11 [00:00<?, ?it/s]
37Loading safetensors checkpoint shards: 9% Completed | 1/11 [00:01<00:19, 1.99s/it]
38Loading safetensors checkpoint shards: 18% Completed | 2/11 [00:04<00:21, 2.37s/it]
39Loading safetensors checkpoint shards: 27% Completed | 3/11 [00:07<00:19, 2.44s/it]
40Loading safetensors checkpoint shards: 36% Completed | 4/11 [00:09<00:17, 2.49s/it]
41Loading safetensors checkpoint shards: 45% Completed | 5/11 [00:12<00:15, 2.50s/it]
42Loading safetensors checkpoint shards: 55% Completed | 6/11 [00:14<00:12, 2.53s/it]
43Loading safetensors checkpoint shards: 64% Completed | 7/11 [00:17<00:10, 2.52s/it]
44Loading safetensors checkpoint shards: 73% Completed | 8/11 [00:19<00:07, 2.54s/it]
45Loading safetensors checkpoint shards: 82% Completed | 9/11 [00:22<00:05, 2.52s/it]
46Loading safetensors checkpoint shards: 91% Completed | 10/11 [00:24<00:02, 2.54s/it]
47Loading safetensors checkpoint shards: 100% Completed | 11/11 [00:26<00:00, 2.11s/it]
48Loading safetensors checkpoint shards: 100% Completed | 11/11 [00:26<00:00, 2.37s/it]
49(EngineCore pid=128)
50(EngineCore pid=128) INFO 06-16 14:00:40 [default_loader.py:397] Loading weights took 26.26 seconds
51(EngineCore pid=128) INFO 06-16 14:00:40 [unquantized.py:341] Using MoEPrepareAndFinalizeNoDPEPModular
52(EngineCore pid=128) INFO 06-16 14:00:41 [gpu_model_runner.py:5132] Model loading took 47.42 GiB memory and 26.679009 seconds
53(EngineCore pid=128) INFO 06-16 14:00:49 [backends.py:1089] Using cache directory: /srv/.cache/vllm/torch_compile_cache/99fcc63784/rank_0_0/backbone for vLLM's torch.compile
54(EngineCore pid=128) INFO 06-16 14:00:49 [backends.py:1148] Dynamo bytecode transform time: 7.18 s
55(EngineCore pid=128) INFO 06-16 14:00:59 [backends.py:378] Cache the graph of compile range (1, 16384) for later use
56(EngineCore pid=128) INFO 06-16 14:01:21 [backends.py:393] Compiling a graph for compile range (1, 16384) takes 32.23 s
57(EngineCore pid=128) INFO 06-16 14:01:23 [decorators.py:708] saved AOT compiled function to /srv/.cache/vllm/torch_compile_cache/torch_aot_compile/f2b68215aafdf5c86b836bde9dab705f4bd5a24e89ab6bc4933dbb9ad4391bbd/rank_0_0/model
58(EngineCore pid=128) INFO 06-16 14:01:23 [monitor.py:53] torch.compile took 41.53 s in total
59(EngineCore pid=128) WARNING 06-16 14:01:24 [fused_moe.py:1073] Using default MoE config. Performance might be sub-optimal! Config file not found at /srv/venv/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/configs/E=128,N=704,device_name=AMD_Instinct_MI300X.json
60(EngineCore pid=128) INFO 06-16 14:01:26 [monitor.py:81] Initial profiling/warmup run took 2.70 s
61(EngineCore pid=128) INFO 06-16 14:01:29 [gpu_worker.py:466] Available KV cache memory: 123.23 GiB
62(EngineCore pid=128) INFO 06-16 14:01:29 [kv_cache_utils.py:1733] GPU KV cache size: 586,815 tokens
63(EngineCore pid=128) INFO 06-16 14:01:29 [kv_cache_utils.py:1734] Maximum concurrency for 16,384 tokens per request: 35.82x
64Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 51/51 [00:06<00:00, 7.74it/s]
65Capturing CUDA graphs (decode, FULL): 100%|██████████| 51/51 [00:09<00:00, 5.27it/s]
66(EngineCore pid=128) INFO 06-16 14:01:47 [gpu_model_runner.py:6456] Graph capturing finished in 17 secs, took 3.40 GiB
67(EngineCore pid=128) INFO 06-16 14:01:47 [jit_monitor.py:54] Kernel JIT monitor activated — Triton JIT compilations during inference will be logged as warnings.
68(EngineCore pid=128) INFO 06-16 14:01:47 [core.py:302] init engine (profile, create kv cache, warmup model) took 65.71 s (compilation: 41.53 s)
69(EngineCore pid=128) INFO 06-16 14:01:49 [vllm.py:977] Asynchronous scheduling is enabled.
70(EngineCore pid=128) INFO 06-16 14:01:49 [kernel.py:270] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native'])
71INFO:root:Model loaded successfully!
72INFO:kserve:Registering model: cope-b-a4b
73INFO:kserve:Setting max asyncio worker threads as 32
74INFO:kserve:OpenAI endpoints registered
75INFO:kserve:Time series endpoints not registered
76INFO:kserve:Starting uvicorn with 1 workers
77INFO:uvicorn.error:Started server process [1]
78INFO:uvicorn.error:Waiting for application startup.
79INFO:kserve:Starting gRPC server with 4 workers
80INFO:kserve:Starting gRPC server on [::]:8081
81INFO:uvicorn.error:Application startup complete.
82INFO:uvicorn.error:Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)

Tue, Jun 16, 2:17 PM · Lift-Wing, Machine-Learning-Team (Q4 FY2025-26)
kevinbazira created P94190 migrated cope-b-a4b model-server deployed successfully in prod experimental ns on MI300x GPU using vLLM 0.22.1.
Tue, Jun 16, 2:06 PM · Machine-Learning-Team

Mon, Jun 15

kevinbazira created P94140 [cope-b-a4b] common_settings.sh defaults to python3.11 path (/usr/bin/python3) yet the vLLM 0.22.1 image now uses python3.12 path (/srv/venv/bin/python).
Mon, Jun 15, 1:10 PM · Machine-Learning-Team
kevinbazira updated the task description for T427497: Deploy CoPE-B-A4B on LiftWing.
Mon, Jun 15, 10:58 AM · Lift-Wing, Machine-Learning-Team (Q4 FY2025-26)
kevinbazira moved T428577: Update WMF Debian vLLM image to use pre-built wheels from upstream from In Progress to Done on the Machine-Learning-Team (Q4 FY2025-26) board.
Mon, Jun 15, 10:57 AM · Machine-Learning-Team (Q4 FY2025-26), Essential-Work, Lift-Wing
kevinbazira closed T428577: Update WMF Debian vLLM image to use pre-built wheels from upstream as Resolved.
Mon, Jun 15, 10:56 AM · Machine-Learning-Team (Q4 FY2025-26), Essential-Work, Lift-Wing

Fri, Jun 12

kevinbazira added a comment to T419288: Q4 FY2025-26 Goal: Text-to-Speech.

Weekly Update:

Fri, Jun 12, 4:43 PM · Patch-For-Review, Goal, Machine-Learning-Team (Q4 FY2025-26), Research

Thu, Jun 11

kevinbazira added a comment to T426766: Upgrade production vLLM image to use vLLM version >= 0.19.

A vLLM 0.22.1 image has been published to the docker registry in: T428577#12008949

Thu, Jun 11, 9:39 AM · Patch-For-Review, Lift-Wing, Machine-Learning-Team (Q4 FY2025-26)
kevinbazira added a comment to T428577: Update WMF Debian vLLM image to use pre-built wheels from upstream.

The updated WMF Debian vLLM image that supports the latest upstream software stack as of June 2026 is now available in the wikimedia docker registry: https://docker-registry.wikimedia.org/ml/amd-vllm022/tags/

Thu, Jun 11, 9:37 AM · Machine-Learning-Team (Q4 FY2025-26), Essential-Work, Lift-Wing
kevinbazira added a comment to T428577: Update WMF Debian vLLM image to use pre-built wheels from upstream.

This image successfully served both facebook/opt-125m and Qwen/Qwen3.6-27B LLMs as shown below:

1$ docker run --rm --network=host -it \
2-e http_proxy=$http_proxy \
3-e https_proxy=$https_proxy \
4--device=/dev/kfd --device=/dev/dri \
5--group-add=$(getent group video | cut -d: -f3) \
6--group-add=$(getent group render | cut -d: -f3) \
7--ipc=host \
8--security-opt seccomp=unconfined \
963fb5c7fcd30 /srv/venv/bin/python -c "
10from vllm import LLM, SamplingParams; \
11llm = LLM('facebook/opt-125m'); \
12print(llm.generate('Hello, world!', SamplingParams(max_tokens=5))[0].outputs[0].text)"
13/srv/venv/lib/python3.12/site-packages/torchvision/io/image.py:14: UserWarning: Failed to load image Python extension: 'Could not load this library: /srv/venv/lib/python3.12/site-packages/torchvision/image.so'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
14 warn(
15INFO 06-10 17:12:46 [utils.py:278] non-default args: {'disable_log_stats': True, 'model': 'facebook/opt-125m'}
16config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 651/651 [00:00<00:00, 6.52MB/s]
17Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
18INFO 06-10 17:12:57 [model.py:617] Resolved architecture: OPTForCausalLM
19INFO 06-10 17:12:57 [model.py:1752] Using max model len 2048
20INFO 06-10 17:12:58 [scheduler.py:239] Chunked prefill is enabled with max_num_batched_tokens=8192.
21INFO 06-10 17:12:58 [vllm.py:977] Asynchronous scheduling is enabled.
22INFO 06-10 17:12:58 [kernel.py:270] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native'])
23tokenizer_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 685/685 [00:00<00:00, 3.66MB/s]
24vocab.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 899k/899k [00:00<00:00, 112MB/s]
25merges.txt: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 97.3MB/s]
26special_tokens_map.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 441/441 [00:00<00:00, 2.40MB/s]
27generation_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 137/137 [00:00<00:00, 930kB/s]
28WARNING 06-10 17:12:58 [system_utils.py:157] We must use the `spawn` multiprocessing start method. Overriding VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/usage/troubleshooting.html#python-multiprocessing for more information. Reasons: CUDA is initialized
29/srv/venv/lib/python3.12/site-packages/torchvision/io/image.py:14: UserWarning: Failed to load image Python extension: 'Could not load this library: /srv/venv/lib/python3.12/site-packages/torchvision/image.so'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
30 warn(
31(EngineCore pid=477) INFO 06-10 17:13:04 [core.py:112] Initializing a V1 LLM engine (v0.22.1) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=True, quantization=None, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=facebook/opt-125m, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['+sparse_attn_indexer', 'none'], 'ir_enable_torch_wrap': True, 'splitting_ops': ['vllm::unified_attention_with_output', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::qwen_gdn_attention_core', 'vllm::gdn_attention_core_xpu', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::deepseek_v4_attention', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_rope_kvcache_cat_mla': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native']), enable_flashinfer_autotune=True, moe_backend='auto', linear_backend='auto')
32(EngineCore pid=477) INFO 06-10 17:13:06 [parallel_state.py:1422] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.64.152.11:37955 backend=nccl
33(EngineCore pid=477) INFO 06-10 17:13:06 [parallel_state.py:1735] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
34(EngineCore pid=477) INFO 06-10 17:13:06 [gpu_model_runner.py:5037] Starting to load model facebook/opt-125m...
35(EngineCore pid=477) INFO 06-10 17:13:07 [rocm.py:552] Found incompatible backend(s) [TURBOQUANT] with AttentionType.DECODER. Overriding with ROCM_ATTN out of potential backends: ['ROCM_ATTN', 'TRITON_ATTN'].
36(EngineCore pid=477) Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
37(EngineCore pid=477) INFO 06-10 17:13:08 [weight_utils.py:603] Time spent downloading weights for facebook/opt-125m: 1.600762 seconds
38Loading pt checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
39Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 5.17it/s]
40Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 5.16it/s]
41(EngineCore pid=477)
42(EngineCore pid=477) INFO 06-10 17:13:09 [default_loader.py:397] Loading weights took 0.20 seconds
43(EngineCore pid=477) INFO 06-10 17:13:09 [gpu_model_runner.py:5132] Model loading took 0.24 GiB memory and 2.016705 seconds
44(EngineCore pid=477) INFO 06-10 17:13:11 [backends.py:1089] Using cache directory: /srv/.cache/vllm/torch_compile_cache/f64c3edfed/rank_0_0/backbone for vLLM's torch.compile
45(EngineCore pid=477) INFO 06-10 17:13:11 [backends.py:1148] Dynamo bytecode transform time: 1.66 s
46(EngineCore pid=477) INFO 06-10 17:13:12 [backends.py:378] Cache the graph of compile range (1, 8192) for later use
47(EngineCore pid=477) INFO 06-10 17:13:15 [backends.py:393] Compiling a graph for compile range (1, 8192) takes 3.87 s
48(EngineCore pid=477) INFO 06-10 17:13:15 [decorators.py:708] saved AOT compiled function to /srv/.cache/vllm/torch_compile_cache/torch_aot_compile/002cb7b67b8ebb41717ce5226a8ef81d7dd619df4a8f0c3395906b5491b10778/rank_0_0/model
49(EngineCore pid=477) INFO 06-10 17:13:15 [monitor.py:53] torch.compile took 5.83 s in total
50(EngineCore pid=477) INFO 06-10 17:13:16 [monitor.py:81] Initial profiling/warmup run took 0.60 s
51(EngineCore pid=477) INFO 06-10 17:13:20 [gpu_worker.py:466] Available KV cache memory: 58.08 GiB
52(EngineCore pid=477) INFO 06-10 17:13:20 [kv_cache_utils.py:1733] GPU KV cache size: 1,691,616 tokens
53(EngineCore pid=477) INFO 06-10 17:13:20 [kv_cache_utils.py:1734] Maximum concurrency for 2,048 tokens per request: 825.98x
54Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|███████████████████████████████████████████████████████████████████████| 51/51 [00:00<00:00, 59.13it/s]
55Capturing CUDA graphs (decode, FULL): 100%|██████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:00<00:00, 57.22it/s]
56(EngineCore pid=477) INFO 06-10 17:13:22 [gpu_model_runner.py:6456] Graph capturing finished in 2 secs, took 1.49 GiB
57(EngineCore pid=477) INFO 06-10 17:13:22 [jit_monitor.py:54] Kernel JIT monitor activated — Triton JIT compilations during inference will be logged as warnings.
58(EngineCore pid=477) INFO 06-10 17:13:22 [core.py:302] init engine (profile, create kv cache, warmup model) took 12.84 s (compilation: 5.83 s)
59(EngineCore pid=477) INFO 06-10 17:13:23 [vllm.py:977] Asynchronous scheduling is enabled.
60(EngineCore pid=477) INFO 06-10 17:13:23 [kernel.py:270] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native'])
61Rendering prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 46.42it/s]
62Processed prompts: 0%| | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s](EngineCore pid=477) WARNING 06-10 17:13:23 [jit_monitor.py:103] Triton kernel JIT compilation during inference: _compute_slot_mapping_kernel. This causes a latency spike; consider extending warmup to cover this shape/config.
63(EngineCore pid=477) WARNING 06-10 17:13:24 [jit_monitor.py:103] Triton kernel JIT compilation during inference: _fwd_kernel. This causes a latency spike; consider extending warmup to cover this shape/config.
64Processed prompts: 100%|███████████████████████████████████████████████████████████| 1/1 [00:01<00:00, 1.91s/it, est. speed input: 2.61 toks/s, output: 2.61 toks/s]
65 That is my dad.
66(EngineCore pid=477) INFO 06-10 17:13:25 [core.py:1266] Shutdown initiated (timeout=0)
67(EngineCore pid=477) INFO 06-10 17:13:25 [core.py:1289] Shutdown complete

1$ docker run --rm --network=host -it \
2-e http_proxy=$http_proxy \
3-e https_proxy=$https_proxy \
4--device=/dev/kfd --device=/dev/dri \
5--group-add=$(getent group video | cut -d: -f3) \
6--group-add=$(getent group render | cut -d: -f3) \
7--ipc=host \
8--security-opt seccomp=unconfined \
963fb5c7fcd30 /srv/venv/bin/python -c "
10from vllm import LLM, SamplingParams; \
11llm = LLM('Qwen/Qwen3.6-27B', max_model_len=32768, gpu_memory_utilization=0.95, max_num_seqs=128); \
12print(llm.generate('Hello, world!', SamplingParams(max_tokens=5))[0].outputs[0].text)"
13/srv/venv/lib/python3.12/site-packages/torchvision/io/image.py:14: UserWarning: Failed to load image Python extension: 'Could not load this library: /srv/venv/lib/python3.12/site-packages/torchvision/image.so'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
14 warn(
15INFO 06-10 17:50:39 [utils.py:278] non-default args: {'max_model_len': 32768, 'gpu_memory_utilization': 0.95, 'max_num_seqs': 128, 'disable_log_stats': True, 'model': 'Qwen/Qwen3.6-27B'}
16config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.31k/4.31k [00:00<00:00, 12.8MB/s]
17Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
18preprocessor_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 390/390 [00:00<00:00, 1.92MB/s]
19INFO 06-10 17:50:51 [model.py:617] Resolved architecture: Qwen3_5ForConditionalGeneration
20INFO 06-10 17:50:51 [model.py:1752] Using max model len 32768
21INFO 06-10 17:50:51 [scheduler.py:239] Chunked prefill is enabled with max_num_batched_tokens=8192.
22INFO 06-10 17:50:51 [vllm.py:977] Asynchronous scheduling is enabled.
23INFO 06-10 17:50:51 [kernel.py:270] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native'])
24tokenizer_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 16.7k/16.7k [00:00<00:00, 37.9MB/s]
25vocab.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6.72M/6.72M [00:00<00:00, 115MB/s]
26merges.txt: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.35M/3.35M [00:00<00:00, 171MB/s]
27tokenizer.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 12.8M/12.8M [00:00<00:00, 38.5MB/s]
28chat_template.jinja: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 7.76k/7.76k [00:00<00:00, 21.9MB/s]
29[transformers] `Qwen2VLImageProcessorFast` is deprecated. The `Fast` suffix for image processors has been removed; use `Qwen2VLImageProcessor` instead.
30generation_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 202/202 [00:00<00:00, 1.28MB/s]
31video_preprocessor_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 385/385 [00:00<00:00, 3.72MB/s]
32[transformers] The `use_fast` parameter is deprecated and will be removed in a future version. Use `backend="torchvision"` instead of `use_fast=True`, or `backend="pil"` instead of `use_fast=False`.
33WARNING 06-10 17:51:01 [system_utils.py:157] We must use the `spawn` multiprocessing start method. Overriding VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/usage/troubleshooting.html#python-multiprocessing for more information. Reasons: CUDA is initialized
34/srv/venv/lib/python3.12/site-packages/torchvision/io/image.py:14: UserWarning: Failed to load image Python extension: 'Could not load this library: /srv/venv/lib/python3.12/site-packages/torchvision/image.so'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
35 warn(
36(EngineCore pid=527) INFO 06-10 17:51:07 [core.py:112] Initializing a V1 LLM engine (v0.22.1) with config: model='Qwen/Qwen3.6-27B', speculative_config=None, tokenizer='Qwen/Qwen3.6-27B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=True, quantization=None, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen/Qwen3.6-27B, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['+sparse_attn_indexer', 'none'], 'ir_enable_torch_wrap': True, 'splitting_ops': ['vllm::unified_attention_with_output', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::qwen_gdn_attention_core', 'vllm::gdn_attention_core_xpu', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::deepseek_v4_attention', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_rope_kvcache_cat_mla': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 256, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native']), enable_flashinfer_autotune=True, moe_backend='auto', linear_backend='auto')
37(EngineCore pid=527) [transformers] `Qwen2VLImageProcessorFast` is deprecated. The `Fast` suffix for image processors has been removed; use `Qwen2VLImageProcessor` instead.
38(EngineCore pid=527) Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
39(EngineCore pid=527) INFO 06-10 17:51:09 [parallel_state.py:1422] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.64.152.11:52117 backend=nccl
40(EngineCore pid=527) INFO 06-10 17:51:09 [parallel_state.py:1735] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
41(EngineCore pid=527) [transformers] The `use_fast` parameter is deprecated and will be removed in a future version. Use `backend="torchvision"` instead of `use_fast=True`, or `backend="pil"` instead of `use_fast=False`.
42(EngineCore pid=527) INFO 06-10 17:51:15 [gpu_model_runner.py:5037] Starting to load model Qwen/Qwen3.6-27B...
43(EngineCore pid=527) INFO 06-10 17:51:16 [rocm.py:606] Using Flash Attention backend for ViT model.
44(EngineCore pid=527) WARNING 06-10 17:51:16 [activation.py:728] [ROCm] PyTorch's native GELU with tanh approximation is unstable. Falling back to GELU(approximate='none').
45(EngineCore pid=527) INFO 06-10 17:51:16 [mm_encoder_attention.py:372] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
46(EngineCore pid=527) INFO 06-10 17:51:16 [qwen_gdn_linear_attn.py:228] Using Triton/FLA GDN prefill kernel (requested=auto, head_k_dim=None).
47(EngineCore pid=527) INFO 06-10 17:51:16 [rocm.py:552] Found incompatible backend(s) [TURBOQUANT] with AttentionType.DECODER. Overriding with ROCM_ATTN out of potential backends: ['ROCM_ATTN', 'TRITON_ATTN'].
48(EngineCore pid=527) WARNING 06-10 17:51:16 [compilation.py:1303] Op 'sparse_attn_indexer' not present in model, enabling with '+sparse_attn_indexer' has no effect
49model.safetensors.index.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 112k/112k [00:00<00:00, 344MB/s]
50(EngineCore pid=527) INFO 06-10 17:54:11 [weight_utils.py:603] Time spent downloading weights for Qwen/Qwen3.6-27B: 174.922396 seconds
51(EngineCore pid=527) INFO 06-10 17:54:11 [weight_utils.py:922] Filesystem type for checkpoints: OVERLAY. Checkpoint size: 51.75 GiB. Available RAM: 366.91 GiB.
52(EngineCore pid=527) INFO 06-10 17:54:11 [weight_utils.py:945] Auto-prefetch is disabled because the filesystem (OVERLAY) is not a recognized network FS (NFS/Lustre). If you want to force prefetching, start vLLM with --safetensors-load-strategy=prefetch.
53Loading safetensors checkpoint shards: 0% Completed | 0/15 [00:00<?, ?it/s]
54Loading safetensors checkpoint shards: 7% Completed | 1/15 [00:01<00:22, 1.60s/it]
55Loading safetensors checkpoint shards: 13% Completed | 2/15 [00:03<00:22, 1.69s/it]
56Loading safetensors checkpoint shards: 20% Completed | 3/15 [00:05<00:20, 1.73s/it]
57Loading safetensors checkpoint shards: 27% Completed | 4/15 [00:06<00:19, 1.74s/it]
58Loading safetensors checkpoint shards: 33% Completed | 5/15 [00:08<00:17, 1.75s/it]
59Loading safetensors checkpoint shards: 40% Completed | 6/15 [00:10<00:15, 1.78s/it]
60Loading safetensors checkpoint shards: 47% Completed | 7/15 [00:12<00:14, 1.81s/it]
61Loading safetensors checkpoint shards: 53% Completed | 8/15 [00:14<00:13, 1.89s/it]
62Loading safetensors checkpoint shards: 60% Completed | 9/15 [00:16<00:11, 1.86s/it]
63Loading safetensors checkpoint shards: 67% Completed | 10/15 [00:17<00:09, 1.83s/it]
64Loading safetensors checkpoint shards: 73% Completed | 11/15 [00:19<00:07, 1.81s/it]
65Loading safetensors checkpoint shards: 80% Completed | 12/15 [00:21<00:05, 1.80s/it]
66Loading safetensors checkpoint shards: 87% Completed | 13/15 [00:23<00:03, 1.72s/it]
67Loading safetensors checkpoint shards: 93% Completed | 14/15 [00:24<00:01, 1.75s/it]
68Loading safetensors checkpoint shards: 100% Completed | 15/15 [00:25<00:00, 1.36s/it]
69Loading safetensors checkpoint shards: 100% Completed | 15/15 [00:25<00:00, 1.69s/it]
70(EngineCore pid=527)
71(EngineCore pid=527) INFO 06-10 17:54:37 [default_loader.py:397] Loading weights took 25.35 seconds
72(EngineCore pid=527) INFO 06-10 17:54:37 [gpu_model_runner.py:5132] Model loading took 51.1 GiB memory and 201.167954 seconds
73(EngineCore pid=527) INFO 06-10 17:54:37 [interface.py:649] Setting attention block size to 784 tokens to ensure that attention page size is >= mamba page size.
74(EngineCore pid=527) INFO 06-10 17:54:37 [interface.py:673] Padding mamba page size by 0.13% to ensure that mamba page size and attention page size are exactly equal.
75(EngineCore pid=527) INFO 06-10 17:54:38 [gpu_model_runner.py:6136] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
76(EngineCore pid=527) INFO 06-10 17:54:48 [backends.py:1089] Using cache directory: /srv/.cache/vllm/torch_compile_cache/cc01156159/rank_0_0/backbone for vLLM's torch.compile
77(EngineCore pid=527) INFO 06-10 17:54:48 [backends.py:1148] Dynamo bytecode transform time: 8.01 s
78(EngineCore pid=527) INFO 06-10 17:54:51 [backends.py:378] Cache the graph of compile range (1, 8192) for later use
79(EngineCore pid=527) INFO 06-10 17:55:34 [backends.py:393] Compiling a graph for compile range (1, 8192) takes 45.55 s
80(EngineCore pid=527) INFO 06-10 17:55:36 [decorators.py:708] saved AOT compiled function to /srv/.cache/vllm/torch_compile_cache/torch_aot_compile/0f16e4fcd10f6a7a86bc79d9feb083a5b3472a09cf196adf6871bd2fb02cf4b5/rank_0_0/model
81(EngineCore pid=527) INFO 06-10 17:55:36 [monitor.py:53] torch.compile took 55.84 s in total
82(EngineCore pid=527) INFO 06-10 17:56:38 [monitor.py:81] Initial profiling/warmup run took 62.19 s
83(EngineCore pid=527) INFO 06-10 17:56:43 [gpu_worker.py:466] Available KV cache memory: 6.73 GiB
84(EngineCore pid=527) INFO 06-10 17:56:43 [kv_cache_utils.py:1733] GPU KV cache size: 101,944 tokens
85(EngineCore pid=527) INFO 06-10 17:56:43 [kv_cache_utils.py:1734] Maximum concurrency for 32,768 tokens per request: 3.11x
86Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|███████████████████████████████████████████████████████████████████████| 35/35 [00:05<00:00, 6.43it/s]
87Capturing CUDA graphs (decode, FULL): 0%| | 0/19 [00:00<?, ?it/s](EngineCore pid=527) WARNING 06-10 17:56:49 [chunked_prefill_paged_decode.py:414] Cannot use ROCm custom paged attention kernel, falling back to Triton implementation.
88Capturing CUDA graphs (decode, FULL): 100%|██████████████████████████████████████████████████████████████████████████████████████████| 19/19 [00:03<00:00, 4.78it/s]
89(EngineCore pid=527) INFO 06-10 17:56:53 [gpu_model_runner.py:6456] Graph capturing finished in 10 secs, took 4.64 GiB
90(EngineCore pid=527) INFO 06-10 17:56:53 [jit_monitor.py:54] Kernel JIT monitor activated — Triton JIT compilations during inference will be logged as warnings.
91(EngineCore pid=527) INFO 06-10 17:56:53 [core.py:302] init engine (profile, create kv cache, warmup model) took 135.86 s (compilation: 55.84 s)
92(EngineCore pid=527) INFO 06-10 17:56:54 [vllm.py:977] Asynchronous scheduling is enabled.
93(EngineCore pid=527) INFO 06-10 17:56:54 [kernel.py:270] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native'])
94Rendering prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 43.52it/s]
95Processed prompts: 0%| | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s](EngineCore pid=527) WARNING 06-10 17:56:54 [jit_monitor.py:103] Triton kernel JIT compilation during inference: _zero_kv_blocks_kernel. This causes a latency spike; consider extending warmup to cover this shape/config.
96(EngineCore pid=527) WARNING 06-10 17:56:54 [jit_monitor.py:103] Triton kernel JIT compilation during inference: _compute_slot_mapping_kernel. This causes a latency spike; consider extending warmup to cover this shape/config.
97(EngineCore pid=527) WARNING 06-10 17:56:55 [jit_monitor.py:103] Triton kernel JIT compilation during inference: _causal_conv1d_fwd_kernel. This causes a latency spike; consider extending warmup to cover this shape/config.
98(EngineCore pid=527) WARNING 06-10 17:56:55 [jit_monitor.py:103] Triton kernel JIT compilation during inference: _fused_post_conv_kernel. This causes a latency spike; consider extending warmup to cover this shape/config.
99(EngineCore pid=527) WARNING 06-10 17:56:57 [jit_monitor.py:103] Triton kernel JIT compilation during inference: _fwd_kernel. This causes a latency spike; consider extending warmup to cover this shape/config.
100Processed prompts: 100%|███████████████████████████████████████████████████████████| 1/1 [00:03<00:00, 3.94s/it, est. speed input: 1.02 toks/s, output: 1.27 toks/s]
101 I'm now working on
102(EngineCore pid=527) INFO 06-10 17:56:58 [core.py:1266] Shutdown initiated (timeout=0)
103(EngineCore pid=527) INFO 06-10 17:56:58 [core.py:1289] Shutdown complete

Thu, Jun 11, 7:46 AM · Machine-Learning-Team (Q4 FY2025-26), Essential-Work, Lift-Wing
kevinbazira added a comment to T428577: Update WMF Debian vLLM image to use pre-built wheels from upstream.

Using the official upstream pre-built wheels, I've upgraded the wmf-debian-vllm image to support the latest vLLM software stack as of June 2026. The key updates are:

Thu, Jun 11, 7:27 AM · Machine-Learning-Team (Q4 FY2025-26), Essential-Work, Lift-Wing
kevinbazira added a comment to T428577: Update WMF Debian vLLM image to use pre-built wheels from upstream.

Following T428577#11998794, we added ROCm 7.2.0 packages to the Wikimedia bookworm mirror as shown here: https://apt-browser.toolforge.org/bookworm-wikimedia/thirdparty/amd-rocm72/

Thu, Jun 11, 7:01 AM · Machine-Learning-Team (Q4 FY2025-26), Essential-Work, Lift-Wing

Wed, Jun 10

kevinbazira created P94022 vllm serving Qwen/Qwen3.6-27B in wmf-debian-vllm container that supports ROCm 7.2 and vLLM 0.22 on MI210 GPU.
Wed, Jun 10, 5:59 PM · Machine-Learning-Team
kevinbazira edited P93986 vllm serving facebook/opt-125m in wmf-debian-vllm container that supports ROCm 7.2 and vLLM 0.22 on MI210 GPU.
Wed, Jun 10, 5:14 PM · Machine-Learning-Team
kevinbazira created P93986 vllm serving facebook/opt-125m in wmf-debian-vllm container that supports ROCm 7.2 and vLLM 0.22 on MI210 GPU.
Wed, Jun 10, 9:02 AM · Machine-Learning-Team

Tue, Jun 9

kevinbazira added a comment to T428577: Update WMF Debian vLLM image to use pre-built wheels from upstream.

The Wikimedia bookworm mirror currently contains ROCm 7.0 as the latest packages:

Wikimedia bookworm mirror with ROCm 7.0 - Screenshot from 2026-06-09 13-54-38 (1,850×1,129 px, 533 KB)

Tue, Jun 9, 11:27 AM · Machine-Learning-Team (Q4 FY2025-26), Essential-Work, Lift-Wing
kevinbazira moved T428577: Update WMF Debian vLLM image to use pre-built wheels from upstream from Backlog to In Progress on the Machine-Learning-Team (Q4 FY2025-26) board.
Tue, Jun 9, 11:14 AM · Machine-Learning-Team (Q4 FY2025-26), Essential-Work, Lift-Wing
kevinbazira edited projects for T428577: Update WMF Debian vLLM image to use pre-built wheels from upstream, added: Machine-Learning-Team (Q4 FY2025-26); removed Machine-Learning-Team.
Tue, Jun 9, 11:14 AM · Machine-Learning-Team (Q4 FY2025-26), Essential-Work, Lift-Wing
kevinbazira created T428577: Update WMF Debian vLLM image to use pre-built wheels from upstream.
Tue, Jun 9, 11:12 AM · Machine-Learning-Team (Q4 FY2025-26), Essential-Work, Lift-Wing
kevinbazira attached a referenced file: F87590062: Library page that shows pre-generated articles in TTS prototype - 2026-06-09 08-00-35.mp4.
Tue, Jun 9, 7:21 AM · Machine-Learning-Team (Q4 FY2025-26)
kevinbazira added a comment to T428435: TTS batch generation pipeline for Featured Articles.

We have also added a library page to the prototype so that you can browse what has been generated so far:

Tue, Jun 9, 7:20 AM · Machine-Learning-Team (Q4 FY2025-26)
kevinbazira added a comment to T428435: TTS batch generation pipeline for Featured Articles.

We have started running the batch-generation pipeline using the steps documented in the project README:
https://gitlab.wikimedia.org/toolforge-repos/wiki-tts/-/tree/f95b9c0c0d4642ea2b95d5995f2747b3f20596e7#7-batch-tts-generation-pipeline

Tue, Jun 9, 7:19 AM · Machine-Learning-Team (Q4 FY2025-26)

Mon, Jun 8

kevinbazira moved T427262: Add static audio URL for native OS audio players in TTS prototype from In Progress to Done on the Machine-Learning-Team (Q4 FY2025-26) board.
Mon, Jun 8, 11:22 AM · Machine-Learning-Team (Q4 FY2025-26)
kevinbazira closed T427262: Add static audio URL for native OS audio players in TTS prototype, a subtask of T424378: Explore options to run TTS models for evaluation, as Resolved.
Mon, Jun 8, 11:21 AM · Machine-Learning-Team (Q4 FY2025-26)
kevinbazira closed T427262: Add static audio URL for native OS audio players in TTS prototype as Resolved.

Closing this task as this feature is now live. Please feel free to re-open if needed.

Mon, Jun 8, 11:21 AM · Machine-Learning-Team (Q4 FY2025-26)
kevinbazira moved T427488: Add word-level timestamps in TTS prototype from In Progress to Done on the Machine-Learning-Team (Q4 FY2025-26) board.
Mon, Jun 8, 11:19 AM · Machine-Learning-Team (Q4 FY2025-26)
kevinbazira closed T427488: Add word-level timestamps in TTS prototype as Resolved.
Mon, Jun 8, 11:19 AM · Machine-Learning-Team (Q4 FY2025-26)
kevinbazira closed T427488: Add word-level timestamps in TTS prototype, a subtask of T424378: Explore options to run TTS models for evaluation, as Resolved.
Mon, Jun 8, 11:19 AM · Machine-Learning-Team (Q4 FY2025-26)
kevinbazira moved T428435: TTS batch generation pipeline for Featured Articles from Backlog to In Progress on the Machine-Learning-Team (Q4 FY2025-26) board.
Mon, Jun 8, 11:18 AM · Machine-Learning-Team (Q4 FY2025-26)
kevinbazira created T428435: TTS batch generation pipeline for Featured Articles.
Mon, Jun 8, 11:18 AM · Machine-Learning-Team (Q4 FY2025-26)

Fri, Jun 5

kevinbazira claimed T427497: Deploy CoPE-B-A4B on LiftWing.
Fri, Jun 5, 3:45 PM · Lift-Wing, Machine-Learning-Team (Q4 FY2025-26)
kevinbazira added a comment to T419288: Q4 FY2025-26 Goal: Text-to-Speech.

Weekly Update:

Fri, Jun 5, 11:41 AM · Patch-For-Review, Goal, Machine-Learning-Team (Q4 FY2025-26), Research

Thu, Jun 4

kevinbazira added a comment to T427497: Deploy CoPE-B-A4B on LiftWing.

We have updated the liftwing_client to support this new cope-b-a4b endpoint and shared it with the PSI team to continue using it to fine-tune their policies.

Thu, Jun 4, 7:03 AM · Lift-Wing, Machine-Learning-Team (Q4 FY2025-26)
kevinbazira added a comment to T427497: Deploy CoPE-B-A4B on LiftWing.

The zentropi-ai/cope-b-a4b docs show we have 3 hosting options: zentropi API, vLLM, HF transformers. Since our vLLM base image doesn't yet support cope-b-a4b as shown in: P93623, we built the model-server on HF transformers instead, which loads and runs the model successfully (see P93624).

Thu, Jun 4, 6:30 AM · Lift-Wing, Machine-Learning-Team (Q4 FY2025-26)
kevinbazira created P93837 cope-b-a4b LLM loaded successfully in prod experimental ns on MI300x GPU.
Thu, Jun 4, 6:22 AM · Machine-Learning-Team

Wed, Jun 3

kevinbazira edited P93624 HF transformers successfully loaded cope-b-a4b LLM.
Wed, Jun 3, 8:58 AM · Machine-Learning-Team
kevinbazira added a subtask for T418267: Q2 FY2025-26 Goal: Host a content policy evaluation model on LiftWing: T427497: Deploy CoPE-B-A4B on LiftWing.
Wed, Jun 3, 7:02 AM · WE4.12 Content policy model evaluation, Machine-Learning-Team (Q4 FY2025-26), Goal
kevinbazira added a parent task for T427497: Deploy CoPE-B-A4B on LiftWing: T418267: Q2 FY2025-26 Goal: Host a content policy evaluation model on LiftWing.
Wed, Jun 3, 7:02 AM · Lift-Wing, Machine-Learning-Team (Q4 FY2025-26)
kevinbazira added a comment to P93623 vllm0.14 fails to load cope-b-a4b LLM.

HF transformers successfully loaded in: P93624

Wed, Jun 3, 4:07 AM · Machine-Learning-Team
kevinbazira created P93624 HF transformers successfully loaded cope-b-a4b LLM.
Wed, Jun 3, 4:04 AM · Machine-Learning-Team
kevinbazira added a comment to P93623 vllm0.14 fails to load cope-b-a4b LLM.

The cope-b-a4b model requires vLLM ≥ 0.20.2 based on: https://huggingface.co/zentropi-ai/cope-b-a4b#system-requirements

Wed, Jun 3, 3:40 AM · Machine-Learning-Team
kevinbazira created P93623 vllm0.14 fails to load cope-b-a4b LLM.
Wed, Jun 3, 3:38 AM · Machine-Learning-Team

Tue, Jun 2

kevinbazira added a comment to T427488: Add word-level timestamps in TTS prototype.

Sharing this from slack for posterity:

Tue, Jun 2, 11:17 AM · Machine-Learning-Team (Q4 FY2025-26)

Mon, Jun 1

kevinbazira added a comment to T427488: Add word-level timestamps in TTS prototype.

We have also added a demo of this feature in the TTS prototype UI. Now when you play a section's audio, the transcript below the audio player highlights each spoken word in real-time so that you can follow-along:

Mon, Jun 1, 4:44 PM · Machine-Learning-Team (Q4 FY2025-26)
kevinbazira added a comment to T427488: Add word-level timestamps in TTS prototype.

We have added word-level timestamps to the TTS prototype. Each audio section (.mp3) now comes with a companion WebVTT caption file (.vtt) with per-word start and end times:

EndpointPurposeResponses
https://wiki-tts.toolforge.org/audio/Earth/Lead.mp3Serving of .mp3 (existed)HTTP 200 if .mp3 exists, HTTP 404 if .mp3 doesn't exist on disk
https://wiki-tts.toolforge.org/audio/Earth/Lead.vttServing of .vtt (new)HTTP 200 if .vtt exists, HTTP 404 if .vtt doesn't exist on disk
Mon, Jun 1, 4:38 PM · Machine-Learning-Team (Q4 FY2025-26)

Fri, May 29

kevinbazira added a comment to T419288: Q4 FY2025-26 Goal: Text-to-Speech.

Weekly Update:

Fri, May 29, 11:16 AM · Patch-For-Review, Goal, Machine-Learning-Team (Q4 FY2025-26), Research
kevinbazira updated the task description for T426756: Fix text normalization edge cases in TTS prototype.
Fri, May 29, 10:47 AM · Machine-Learning-Team (Q4 FY2025-26)
kevinbazira added a comment to T426756: Fix text normalization edge cases in TTS prototype.

We found another text normalization edge case while working on T427488: Add word-level timestamps in TTS prototype.

Fri, May 29, 10:46 AM · Machine-Learning-Team (Q4 FY2025-26)
kevinbazira updated the task description for T427488: Add word-level timestamps in TTS prototype.
Fri, May 29, 7:06 AM · Machine-Learning-Team (Q4 FY2025-26)

Thu, May 28

kevinbazira moved T427488: Add word-level timestamps in TTS prototype from Backlog to In Progress on the Machine-Learning-Team (Q4 FY2025-26) board.
Thu, May 28, 9:23 AM · Machine-Learning-Team (Q4 FY2025-26)
kevinbazira created T427488: Add word-level timestamps in TTS prototype.
Thu, May 28, 9:22 AM · Machine-Learning-Team (Q4 FY2025-26)

Wed, May 27

kevinbazira created P93273 TTS service runs into OOM in a 2GB RAM container when word-level timestamps feature is activated.
Wed, May 27, 3:33 PM · Machine-Learning-Team
kevinbazira added a comment to T427262: Add static audio URL for native OS audio players in TTS prototype.

We have added a new endpoint that provides static audio URLs alongside the existing endpoint that handles on-demand generation and serving:

EndpointPurposeResponses
https://wiki-tts.toolforge.org/audio?article=Earth&section=LeadOn-demand generation + serving of .mp3 (existed)HTTP 200 if .mp3 exists, HTTP 202 queue generation if .mp3 doesn't exist on disk, HTTP 404 if article/section doesn't exist on Wikipedia
https://wiki-tts.toolforge.org/audio/Earth/Lead.mp3Serving of .mp3 (new)HTTP 200 if .mp3 exists, HTTP 404 if .mp3 doesn't exist on disk
Wed, May 27, 9:06 AM · Machine-Learning-Team (Q4 FY2025-26)
kevinbazira updated the task description for T427262: Add static audio URL for native OS audio players in TTS prototype.
Wed, May 27, 9:05 AM · Machine-Learning-Team (Q4 FY2025-26)

Tue, May 26

kevinbazira moved T427262: Add static audio URL for native OS audio players in TTS prototype from Backlog to In Progress on the Machine-Learning-Team (Q4 FY2025-26) board.
Tue, May 26, 10:51 AM · Machine-Learning-Team (Q4 FY2025-26)
kevinbazira created T427262: Add static audio URL for native OS audio players in TTS prototype.
Tue, May 26, 10:51 AM · Machine-Learning-Team (Q4 FY2025-26)
kevinbazira moved T427173: Add section heading announcements in TTS prototype from In Progress to Done on the Machine-Learning-Team (Q4 FY2025-26) board.
Tue, May 26, 7:32 AM · Machine-Learning-Team (Q4 FY2025-26)
kevinbazira closed T427173: Add section heading announcements in TTS prototype, a subtask of T424378: Explore options to run TTS models for evaluation, as Resolved.
Tue, May 26, 7:31 AM · Machine-Learning-Team (Q4 FY2025-26)
kevinbazira closed T427173: Add section heading announcements in TTS prototype as Resolved.
Tue, May 26, 7:31 AM · Machine-Learning-Team (Q4 FY2025-26)
kevinbazira moved T426756: Fix text normalization edge cases in TTS prototype from In Progress to Done on the Machine-Learning-Team (Q4 FY2025-26) board.
Tue, May 26, 7:01 AM · Machine-Learning-Team (Q4 FY2025-26)
kevinbazira closed T426756: Fix text normalization edge cases in TTS prototype, a subtask of T424378: Explore options to run TTS models for evaluation, as Resolved.
Tue, May 26, 7:01 AM · Machine-Learning-Team (Q4 FY2025-26)
kevinbazira closed T426756: Fix text normalization edge cases in TTS prototype as Resolved.
Tue, May 26, 7:01 AM · Machine-Learning-Team (Q4 FY2025-26)

Mon, May 25

kevinbazira added a comment to T427173: Add section heading announcements in TTS prototype.

Following T427173#11952147, we investigated how industry leading TTS engines control pause durations and found that Google Cloud TTS, Amazon Polly, and Azure TTS use a <break> tag, which is an element in the W3C Speech Synthesis Markup Language (SSML) that allows one to add or modify pauses and silences in the generated audio.

Mon, May 25, 11:17 AM · Machine-Learning-Team (Q4 FY2025-26)
kevinbazira added a comment to T427173: Add section heading announcements in TTS prototype.

The Kokoro model demo shows we can use punctuation (; : , . ! ? — … " ( ) “ ”) to add pauses and intonation between words.

Mon, May 25, 11:09 AM · Machine-Learning-Team (Q4 FY2025-26)
kevinbazira moved T427173: Add section heading announcements in TTS prototype from Backlog to In Progress on the Machine-Learning-Team (Q4 FY2025-26) board.
Mon, May 25, 7:11 AM · Machine-Learning-Team (Q4 FY2025-26)
kevinbazira created T427173: Add section heading announcements in TTS prototype.
Mon, May 25, 7:10 AM · Machine-Learning-Team (Q4 FY2025-26)
kevinbazira added a comment to T426766: Upgrade production vLLM image to use vLLM version >= 0.19.

@kevinbazira @DPogorzelski-WMF Since you worked on the previous build of this image, is there any documentation available for this process?

Mon, May 25, 4:39 AM · Patch-For-Review, Lift-Wing, Machine-Learning-Team (Q4 FY2025-26)

Fri, May 22

kevinbazira added a comment to T419288: Q4 FY2025-26 Goal: Text-to-Speech.

Weekly Update:

Fri, May 22, 12:59 PM · Patch-For-Review, Goal, Machine-Learning-Team (Q4 FY2025-26), Research

Thu, May 21

kevinbazira updated the task description for T426756: Fix text normalization edge cases in TTS prototype.
Thu, May 21, 11:43 AM · Machine-Learning-Team (Q4 FY2025-26)
kevinbazira added a comment to T426756: Fix text normalization edge cases in TTS prototype.

We have also added a nemo_whitelist.tsv. Without it, NeMo would treat unrecognised domain-specific vocabulary words like "UNESCO" as regular words, causing the TTS service to read them as "an eh sko" rather than "yoo neh sko". The whitelist preserves these terms as-is so the TTS service handles pronunciation correctly. The best part is that this list can be expanded whenever we notice new custom/wikipedia-specific/domain-specific terms that aren't being spoken well.

>>> from wiki_tts.text import clean_spoken_text, init_nemo
>>> 
>>> init_nemo() 
 NeMo-text-processing :: INFO     :: Post processing graph was restored from /tmp/wiki-tts-nemo-grammars/en_tn_post_processing.far.
 NeMo-text-processing :: INFO     :: ClassifyFst.fst was restored from /tmp/wiki-tts-nemo-grammars/en_tn_True_deterministic_cased_nemo_whitelist.tsv_tokenize.far.
 NeMo-text-processing :: INFO     :: VerbalizeFinalFst graph was restored from /tmp/wiki-tts-nemo-grammars/en_tn_True_deterministic_verbalizer.far.
>>> 
>>> # Custom/Wikipedia-specific terms: Domain vocabulary not handled by general-purpose text normalization
>>> clean_spoken_text("NASA launched a mission.")
'NASA launched a mission.'
>>> clean_spoken_text("UNESCO declared a world heritage site.")
'yoo neh sko declared a world heritage site.'
>>> clean_spoken_text("DNA and RNA are nucleic acids.")
'DNA and RNA are nucleic acids.'
>>> clean_spoken_text("AI technology is advancing.")
'AI technology is advancing.'
Thu, May 21, 11:42 AM · Machine-Learning-Team (Q4 FY2025-26)
kevinbazira added a comment to T426756: Fix text normalization edge cases in TTS prototype.

Following T426756#11944373, we integrated the NeMo text processing library into the TTS protottype since it handles a majority of the nuanced text normalization edge cases outlined in this task's description.

Thu, May 21, 11:40 AM · Machine-Learning-Team (Q4 FY2025-26)
kevinbazira updated the task description for T426756: Fix text normalization edge cases in TTS prototype.
Thu, May 21, 11:39 AM · Machine-Learning-Team (Q4 FY2025-26)
kevinbazira added a comment to T426756: Fix text normalization edge cases in TTS prototype.

One thing we found while fixing subscript/superscript edge cases is that Wikipedia's plain-text extract API:
https://en.wikipedia.org/w/api.php?action=query&titles=Square_metre&prop=extracts&explaintext=1&format=json
strips all formatting. Content like m<sub>2</sub> i.e m₂ and m<sup>2</sup> i.e both arrive as m2. This ends up being read as "m two". The digit is pronounced correctly, but the superscript meaning ("squared") is lost.

Thu, May 21, 11:37 AM · Machine-Learning-Team (Q4 FY2025-26)

Wed, May 20

kevinbazira updated the task description for T426756: Fix text normalization edge cases in TTS prototype.
Wed, May 20, 3:54 PM · Machine-Learning-Team (Q4 FY2025-26)

May 19 2026

kevinbazira moved T426756: Fix text normalization edge cases in TTS prototype from Backlog to In Progress on the Machine-Learning-Team (Q4 FY2025-26) board.
May 19 2026, 2:25 PM · Machine-Learning-Team (Q4 FY2025-26)
kevinbazira created T426756: Fix text normalization edge cases in TTS prototype.
May 19 2026, 2:25 PM · Machine-Learning-Team (Q4 FY2025-26)

May 15 2026

kevinbazira added a comment to T419288: Q4 FY2025-26 Goal: Text-to-Speech.

Weekly Update:

May 15 2026, 9:57 AM · Patch-For-Review, Goal, Machine-Learning-Team (Q4 FY2025-26), Research
kevinbazira added a comment to T424378: Explore options to run TTS models for evaluation.

In T424378#11903226, the focus was on vertical scaling. However, after discussions in T425804#11913283 and T425804#11914308, this project was allocated 24Gi RAM to enable horizontal scaling. This approach involves deploying 10 small worker replicas (1CPU and 2Gi RAM each) alongside the web server, as vertical scaling is not currently supported on toolforge. Below are the steps I used to host this TTS prototype on toolforge:

May 15 2026, 9:48 AM · Machine-Learning-Team (Q4 FY2025-26)
kevinbazira added a comment to T425804: Request increased quota for wiki-tts Toolforge tool.

Thanks a lot, everyone : )

May 15 2026, 7:50 AM · Machine-Learning-Team (Q4 FY2025-26), Toolforge (Quota-requests)
kevinbazira added a comment to P92483 Test horizontal scaling of TTS prototype celery workers on toolforge.

Following the quota bump requested in T425804#11914308, we are able to run 10 replicas with 1CPU and 2Gi RAM each:

$ toolforge jobs delete celery-worker
$ toolforge jobs run celery-worker \
--command "export ORT_NUM_THREADS=1 OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 OPENBLAS_NUM_THREADS=1 VECLIB_MAXIMUM_THREADS=1 NUMEXPR_NUM_THREADS=1 && cd ~/www/python/src && ~/www/python/venv/bin/celery -A worker worker --pool solo --loglevel=info" \
--image python3.11 \
--continuous \
--replicas 10 \
--mem 2Gi \
--cpu 1
May 15 2026, 7:49 AM · Machine-Learning-Team

May 13 2026

kevinbazira added a comment to T425804: Request increased quota for wiki-tts Toolforge tool.

+1 approved

thanks!

May 13 2026, 4:19 PM · Machine-Learning-Team (Q4 FY2025-26), Toolforge (Quota-requests)
kevinbazira updated the task description for T425804: Request increased quota for wiki-tts Toolforge tool.
May 13 2026, 4:17 PM · Machine-Learning-Team (Q4 FY2025-26), Toolforge (Quota-requests)

May 12 2026

kevinbazira added a comment to T425804: Request increased quota for wiki-tts Toolforge tool.

Do you have the option of scaling horizontally, having many small pods rather than one giant one? That could be as simple as "toolforge jobs run --replicas N". In case you want to go this route, I'm also tagging @fnegri who might be able to provide more detailed support.

Thanks for the suggestion @Andrew, I tested the horizontal scaling approach (toolforge jobs run --replicas N) and it works as detailed in P92483. I was able to scale up to 3 replicas before hitting our current 8Gi namespace memory limit.

May 12 2026, 6:08 PM · Machine-Learning-Team (Q4 FY2025-26), Toolforge (Quota-requests)
kevinbazira created P92483 Test horizontal scaling of TTS prototype celery workers on toolforge.
May 12 2026, 6:02 PM · Machine-Learning-Team
kevinbazira added a comment to T425909: Request creation of wikitts VPS project.

Thanks @komla! I shared feedback from the ML and APPs team: T425804#11913105

May 12 2026, 2:56 PM · Machine-Learning-Team (Q4 FY2025-26), Cloud-VPS (Project-requests)
kevinbazira added a comment to T425804: Request increased quota for wiki-tts Toolforge tool.

Hi folks! We are discussing this task during our weekly meeting.

We would really like to support you on toolforge rather than move you to cloud-vps, but to do that we need to plan and implement some backend changes that will take a few weeks. Is it possible for you to wait a bit until we have a plan?

May 12 2026, 2:55 PM · Machine-Learning-Team (Q4 FY2025-26), Toolforge (Quota-requests)

May 11 2026

kevinbazira added a comment to T425804: Request increased quota for wiki-tts Toolforge tool.

Thanks everyone for your suggestions, we have moved forward to T425909: Request creation of wikitts VPS project

May 11 2026, 8:09 AM · Machine-Learning-Team (Q4 FY2025-26), Toolforge (Quota-requests)
kevinbazira added a project to T425909: Request creation of wikitts VPS project: Machine-Learning-Team (Q4 FY2025-26).
May 11 2026, 8:07 AM · Machine-Learning-Team (Q4 FY2025-26), Cloud-VPS (Project-requests)
kevinbazira added a project to T425804: Request increased quota for wiki-tts Toolforge tool: Machine-Learning-Team (Q4 FY2025-26).
May 11 2026, 8:06 AM · Machine-Learning-Team (Q4 FY2025-26), Toolforge (Quota-requests)
kevinbazira added a parent task for T425909: Request creation of wikitts VPS project: T424378: Explore options to run TTS models for evaluation.
May 11 2026, 8:02 AM · Machine-Learning-Team (Q4 FY2025-26), Cloud-VPS (Project-requests)
kevinbazira added a subtask for T424378: Explore options to run TTS models for evaluation: T425909: Request creation of wikitts VPS project.
May 11 2026, 8:02 AM · Machine-Learning-Team (Q4 FY2025-26)
kevinbazira created T425909: Request creation of wikitts VPS project.
May 11 2026, 8:00 AM · Machine-Learning-Team (Q4 FY2025-26), Cloud-VPS (Project-requests)

May 8 2026

kevinbazira added a comment to T419288: Q4 FY2025-26 Goal: Text-to-Speech.

Weekly Update:

May 8 2026, 6:09 PM · Patch-For-Review, Goal, Machine-Learning-Team (Q4 FY2025-26), Research
kevinbazira added a parent task for T425804: Request increased quota for wiki-tts Toolforge tool: T424378: Explore options to run TTS models for evaluation.
May 8 2026, 5:55 PM · Machine-Learning-Team (Q4 FY2025-26), Toolforge (Quota-requests)
kevinbazira added a subtask for T424378: Explore options to run TTS models for evaluation: T425804: Request increased quota for wiki-tts Toolforge tool.
May 8 2026, 5:55 PM · Machine-Learning-Team (Q4 FY2025-26)
kevinbazira created T425804: Request increased quota for wiki-tts Toolforge tool.
May 8 2026, 5:53 PM · Machine-Learning-Team (Q4 FY2025-26), Toolforge (Quota-requests)
kevinbazira added a comment to T424378: Explore options to run TTS models for evaluation.

Following T424378#11888422, we started preparing to host the TTS prototype on toolforge. We found that the celery-worker job can currently only run with --mem 4Gi --cpu 2 at most on toolforge because of the resource quotas shown below:

tools.wiki-tts@tools-bastion-15:~$ kubectl describe resourcequotas
Name:                   tool-wiki-tts
Namespace:              tool-wiki-tts
Resource                Used   Hard
--------                ----   ----
configmaps              4      10
count/cronjobs.batch    0      50
count/deployments.apps  1      16
count/jobs.batch        0      15
limits.cpu              500m   16
limits.memory           512Mi  8Gi
persistentvolumeclaims  0      0
pods                    1      16
requests.cpu            125m   16
requests.memory         256Mi  8Gi
secrets                 4      64
services                1      16
services.nodeports      0      0
May 8 2026, 5:16 PM · Machine-Learning-Team (Q4 FY2025-26)