Page MenuHomePhabricator

Host Qwen 3.6-27B as an inference service
Open, MediumPublic

Description

Summary

Deploy Qwen 3.6-27B (official FP8 quantized weights, ~30 GB) on Lift Wing using vLLM AsyncLLMEngine with tensor parallelism across 2 GPUs, supporting configurable reasoning and non-reasoning modes per request.

Technical notes

Model: Qwen/Qwen3.6-27B-FP8 (Apache 2.0), 27B dense parameters, hybrid attention (Gated DeltaNet + Gated Attention), 262K native context. Official fine-grained FP8 quantization (block size 128). Text-only serving with vision encoder disabled. Same FP8 model used everywhere.

Serving: vLLM AsyncLLMEngine via the same KServe + Blubber pattern as gpt-oss-safeguard-20b. Initial base image is amd-vllm014 (vLLM 0.14); if FP8 support is absent, a new base image with vLLM >= 0.19.0 will be needed. Reasoning mode toggled via reasoning field in the request payload. GPU_MEMORY_UTILIZATION set to 0.85, MAX_MODEL_LEN to 32768
(conservative).

Deployment: 2x MI300X GPU partitions (2 amd.com/gpu). There are no MI300X nodes in the staging (codfw) cluster, so both testing and production run in the experimental namespace on ml-serve-eqiad (ml-serve1012-15). Follow the same deployment-charts pattern as gpt-oss-safeguard-20b in
helmfile.d/ml-services/experimental/values-ml-serve-eqiad.yaml, with these env overrides: MODEL_NAME=qwen36-27b, STORAGE_URI=s3://wmf-ml-models/llm/qwen36-27b/, TRUST_REMOTE_CODE=True, DTYPE=auto, GPU_MEMORY_UTILIZATION=0.85, MAX_MODEL_LEN=32768, TENSOR_PARALLEL_SIZE=2. Model weights need to be uploaded to S3.

Acceptance criteria

  • Model server loads Qwen/Qwen3.6-27B-FP8 and serves predictions via vLLM AsyncLLMEngine
  • Upload model to swift -> s3://wmf-ml-models/llm/Qwen3.6-27B-FP8
  • CI pipeline publishes the machinelearning-liftwing-inference-services-qwen36 image
  • Service deployed in experimental namespace on ml-serve-eqiad and verified with curl

Event Timeline

Change #1284645 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] (WIP)feat: add qwen36-27b model server for Qwen 3.6 FP8 inference

https://gerrit.wikimedia.org/r/1284645

Change #1285374 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] (WIP) qwen36-27b: add model server for Qwen 3.6 FP8 inference

https://gerrit.wikimedia.org/r/1285374

Change #1285375 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] (WIP) qwen36-27b: add model server for Qwen 3.6 FP8 inference

https://gerrit.wikimedia.org/r/1285375

Change #1285374 abandoned by Ilias Sarantopoulos:

[machinelearning/liftwing/inference-services@main] (WIP) qwen36-27b: add model server for Qwen 3.6 FP8 inference

Reason:

duplicate patch

https://gerrit.wikimedia.org/r/1285374

Change #1284645 abandoned by Ilias Sarantopoulos:

[machinelearning/liftwing/inference-services@main] (WIP) qwen36-27b: add model server for Qwen 3.6 FP8 inference

Reason:

duplicate patch

https://gerrit.wikimedia.org/r/1284645

Change #1285395 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/docker-images/production-images@master] (WIP) ml: add vLLM 0.19.1 image

https://gerrit.wikimedia.org/r/1285395

Change #1286313 had a related patch set uploaded (by Gkyziridis; author: Gkyziridis):

[integration/config@master] inference-services: Add qwen36 llm model CI pipelines.

https://gerrit.wikimedia.org/r/1286313

Change #1286313 merged by jenkins-bot:

[integration/config@master] inference-services: Add qwen36 llm model CI/CD pipelines.

https://gerrit.wikimedia.org/r/1286313

Mentioned in SAL (#wikimedia-releng) [2026-05-12T11:50:45Z] <James_F> Zuul: [machinelearning/liftwing/inference-services] Add qwen36 llm model CI/CD pipelines, for T425680

The model has been uploaded to swift under s3://wmf-ml-models/llm/Qwen3.6-27B-FP8

@gkyziridis has added the CI pipelines for this service.
The patch that adds the model is now ready for review https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1285375

This is an initial report for qwen model deployment using the optimize-model skill.
I found it pretty detailed I am pasting it here.

Qwen36-27B Optimization Report

Current State

Model: Qwen 3.6 27B FP8 (~30GB, TP=2 on MI300X)
Code: src/models/qwen36/model_server/model.py
Base Image: amd-vllm014:gfx90agfx942rocm7.0.0pytorch2.10.0mori0.1flash-attn2.8.3aiter0.1.7vllm0.14
Deployment Chart: None yet — needs to be created in operations/deployment-charts/helmfile.d/ml-services/experimental/


Findings by Layer

Inference Backend

FindingSeverityDetail
AsyncLLMEngine used correctlyNon-blocking continuous batching — correct choice
enable_prefix_caching=TrueReduces latency for repeated prompts
reasoning_parser="qwen3"Correct for Qwen3 thinking mode
FP8 quantizationGood match for MI300X; model is pre-quantized, no runtime overhead

vLLM Configuration — 2 Critical, 2 Major

FindingSeverityDetail
max_model_len=262144CriticalReserves KV cache for full 262K context. A 27B model at 262K context on MI300X with TP=2 would OOM. Typical Lift Wing usage is well under 64K tokens. Recommend: 65536
No max_num_seqs / max_num_batched_tokensCriticalNot passed to AsyncEngineArgs. Defaults to 256/2048 internally, but can't be tuned via env vars in the chart. Must add env var support in code
gpu_memory_utilization=0.85MajorLeaves ~29GB idle on 192GB HBM3. With TP=2 overhead, 0.90-0.92 is safe. Recommend: 0.90
No block_size configMajorDefault 16 is suboptimal for AMD. The gpt-oss-safeguard-20b reference uses 64. Recommend: 64, configurable via env

ROCm / GPU Tuning — 3 Critical, 1 Major

FindingSeverityDetail
Missing VLLM_ROCM_USE_AITER=1CriticalAITER is available in the base image but not enabled. Without it, flash attention falls back to Triton. For Qwen, the ROCm guide recommends AITER MHA (standard, not Unified Attention). Must be set in chart env vars.
Missing TORCH_BLAS_PREFER_HIPBLASLT=1CriticalOptimized GEMM kernels for MI300X. The gpt-oss-safeguard-20b reference sets this. Significant throughput impact.
Missing P2P communication configCriticalTP=2 requires proper P2P setup. Missing: HSA_FORCE_FINE_GRAIN_PCIE=1, HSA_ENABLE_IPC_MODE_LEGACY=0, NCCL_P2P_DISABLE=0, NCCL_SOCKET_IFNAME=lo, VLLM_HOST_IP=127.0.0.1, VLLM_SKIP_P2P_CHECK=0, TORCH_SYMM_MEM_DISABLE_MULTICAST=1
Missing /dev/shm volumeMajorTP=2 needs shared memory for NCCL/RCCL collective ops. The reference config uses emptyDir with sizeLimit: 8Gi. Without it, falls back to slower socket-based communication.

External Dependencies — Clean

No findings. Qwen36 has no external API calls — pure text-in/text-out LLM. No caching, retry, or timeout concerns.

KServe / Model Server

FindingSeverityDetail
ModelServer().start([model]) — no workersCorrect for GPU vLLM. AsyncLLMEngine handles concurrency internally. Adding workers would duplicate the engine.
preprocess is synchronous⚠️ SuggestionTokenizer apply_chat_template is fast (~ms). Not worth making async. Only matters if external API calls were added.

Python / Code — 1 Major

FindingSeverityDetail
AsyncEngineArgs missing key parametersMajormax_num_seqs, max_num_batched_tokens, and block_size are not passed to the engine. These control throughput vs. latency tradeoffs and must be tunable without code changes. Add env var support.
from distutils.util import strtobool⚠️ Suggestiondistutils is deprecated in Python 3.12. Use a custom helper or str(payload.get("reasoning", False)).lower() in ("true", "1", "yes") — but this is cosmetic.

Kubernetes Resources — Needs Creation

The deployment chart doesn't exist. The reference is gpt-oss-safeguard-20b in llm/values-ml-serve-eqiad.yaml. For experimental staging in codfw, add to experimental/values-ml-staging-codfw.yaml:

yaml
qwen36-27b:
  predictor:
    image: "machinelearning-liftwing-inference-services-qwen36"
    image_version: "<ci-published-tag>"
    custom_env:
      - name: MODEL_NAME
        value: "qwen36-27b"
      - name: STORAGE_URI
        value: "s3://wmf-ml-models/qwen36-27b/"
      - name: TRUST_REMOTE_CODE
        value: "False"
      - name: GPU_MEMORY_UTILIZATION
        value: "0.90"
      - name: MAX_MODEL_LEN
        value: "65536"
      - name: TENSOR_PARALLEL_SIZE
        value: "2"
      - name: MAX_NUM_BATCHED_TOKENS
        value: "32768"
      - name: MAX_NUM_SEQS
        value: "256"
      - name: BLOCK_SIZE
        value: "64"
      - name: VLLM_ROCM_USE_AITER
        value: "1"
      - name: TORCH_BLAS_PREFER_HIPBLASLT
        value: "1"
      - name: HSA_FORCE_FINE_GRAIN_PCIE
        value: "1"
      - name: HSA_ENABLE_IPC_MODE_LEGACY
        value: "0"
      - name: NCCL_P2P_DISABLE
        value: "0"
      - name: NCCL_SOCKET_IFNAME
        value: lo
      - name: NCCL_SHM_DISABLE
        value: "0"
      - name: NCCL_DEBUG
        value: "INFO"
      - name: VLLM_SKIP_P2P_CHECK
        value: "0"
      - name: VLLM_HOST_IP
        value: "127.0.0.1"
      - name: RCCL_MSCCL_ENABLE
        value: "0"
      - name: MAX_JOBS
        value: "1"
      - name: AMDGCN_USE_BUFFER_OPS
        value: "0"
      - name: TORCH_SYMM_MEM_DISABLE_MULTICAST
        value: "1"
    container:
      resources:
        limits:
          cpu: "16"
          memory: 80Gi
          amd.com/gpu: "2"
        requests:
          cpu: "16"
          memory: 80Gi
          amd.com/gpu: "2"
      volumeMounts:
        - mountPath: /dev/shm
          name: dshm
    volumes:
      - name: dshm
        emptyDir:
          medium: Memory
          sizeLimit: "8Gi"
    config:
      maxReplicas: 1
      tolerations:
        - key: "dedicated"
          operator: "Equal"
          value: "mi300x-experiments"
          effect: "NoSchedule"
        - key: "dedicated"
          operator: "Equal"
          value: "mi300x-experiments"
          effect: "NoExecute"
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: "kubernetes.io/hostname"
                    operator: "In"
                    values:
                      - "ml-serve1012.eqiad.wmnet"
                      - "ml-serve1013.eqiad.wmnet"
                      - "ml-serve1014.eqiad.wmnet"
                      - "ml-serve1015.eqiad.wmnet"

Summary of Required Code Changes

src/models/qwen36/model_server/model.py

  1. Add max_num_seqs, max_num_batched_tokens, block_size to __init__ and AsyncEngineArgs
  2. Change MAX_MODEL_LEN default: 26214465536
  3. Change GPU_MEMORY_UTILIZATION default: 0.850.90

Deployment Chart

Add qwen36-27b to experimental/values-ml-staging-codfw.yaml as shown above.


That is interesting indeed! Some things are off though so it is worth looking into it with some judgemenet.

  1. This refers to a deployment but we're just talking about the code for now.
  2. the vLLM Configuration is totally off because it assumes that we will use 192GB of VRAM while we have partitioning and plan to start with 2-4 partitions.

All these parameters are configurable via env vars so we can adjust them as needed.

What I find really useful is the connection between max_model_len and kv cache and the gpu utilization parameters but I would definitely do the math for a couple of iterations to make sure that the above makes sense.

I've implemented the streaming responses by using the OpenAIChatAdapter. Although this works locally I'm pretty sure that we're going to face some issues in the cluster as requests might be terminated. Looking forward to testing it though!

Can be tested by disabling curl buffering and using the "stream" parameter in the payload

curl -N localhost:8080/openai/v1/completions -X POST \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen3-0.6b","prompt":"What is 2+2?","max_tokens":50,"stream":true}'

This how responses look like. The content/text can be extracted by the "text" field in the "choices" list.

data: {"id":"ae6d8ca612fb4e3c","object":"text_completion","created":1778748470,"model":"qwen3-0.6b","choices":[{"index":0,"text":" Is","logprobs":null,"finish_reason":null,"stop_reason":null,"prompt_token_ids":null,"token_ids":null}],"usage":null,"system_fingerprint":null}

data: {"id":"ae6d8ca612fb4e3c","object":"text_completion","created":1778748470,"model":"qwen3-0.6b","choices":[{"index":0,"text":" it","logprobs":null,"finish_reason":null,"stop_reason":null,"prompt_token_ids":null,"token_ids":null}],"usage":null,"system_fingerprint":null}

data: {"id":"ae6d8ca612fb4e3c","object":"text_completion","created":1778748470,"model":"qwen3-0.6b","choices":[{"index":0,"text":" possible","logprobs":null,"finish_reason":null,"stop_reason":null,"prompt_token_ids":null,"token_ids":null}],"usage":null,"system_fingerprint":null}
....

Change #1287362 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] (WIP)ml-services: add qwen36-27b to experimental

https://gerrit.wikimedia.org/r/1287362

Change #1285375 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] qwen36-27b: add model server for Qwen 3.6 FP8 inference

https://gerrit.wikimedia.org/r/1285375

Change #1287362 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: add qwen36-27b to experimental ns

https://gerrit.wikimedia.org/r/1287362

The initial attempt for this deployment unfortunately fails horribly due to the missing support for this architecture from the current vllm + transformers versions as model_type `qwen3_5 is not supported by vllm 0.14. I have reverted the deployment-charts patch (so basically undeployed the whole thing)
The deployment results in a CrashLoopBackOff and the logs show the following stack trace:

Traceback (most recent call last):
  File "/srv/app/model.py", line 404, in <module>
    model.load()
  File "/srv/app/model.py", line 93, in load
    raise kserve.errors.ModelMissingError(error_message)
kserve.errors.ModelMissingError: Failed to load model. Reason: 1 validation error for ModelConfig
  Value error, The checkpoint you are trying to load has model type `qwen3_5` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.

You can update Transformers with the command `pip install --upgrade transformers`. If this does not work, and the checkpoint is very new, then there may not be a release version that supports this model yet. In this case, you can get the most up-to-date code by installing Transformers from source with the command `pip install git+https://github.com/huggingface/transformers.git` [type=value_error, input_value=ArgsKwargs((), {'model': ...rocessor_plugin': None}), input_type=ArgsKwargs]
    For further information visit https://errors.pydantic.dev/2.12/v/value_error

I see the following alternatives (which can also be combined):

  1. Update the vllm production image to a newer version (vllm 0.19) as done in the WIP patch https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1285395
  2. switch to an older model that would be supported by the current stack and work our way towards 1 incrementally. A good candidate would be https://huggingface.co/Qwen/Qwen3-14B or https://huggingface.co/Qwen/Qwen3-32B with more partitions

I would recommend to start with option 2, deploy a model which uses the same architecture with the embedding model that we have and then move to update the base vllm image.

Change #1289372 had a related patch set uploaded (by Gkyziridis; author: Gkyziridis):

[operations/deployment-charts@master] ml-services: Deploy qwen3-14b model in experimental ns.

https://gerrit.wikimedia.org/r/1289372

Change #1289372 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: Deploy qwen3-14b model in experimental ns.

https://gerrit.wikimedia.org/r/1289372

Update

The deployment of Qwen3-14B model on experimental eqiad was successful.
Streaming is working as well via openai/v1/chat/completions:
API call with "stream": true:

curl -Ns https://inference.svc.eqiad.wmnet:30443/openai/v1/chat/completions \
  -X POST \
  -H "Content-Type: application/json" \
  -H "Host: qwen3-14b.experimental.wikimedia.org" \
  -d '{
    "model": "qwen3-14b",
    "messages": [
      {"role": "user", "content": "Explain gravity in one sentence."}
    ],
    "max_tokens": 50,
    "stream": true
  }'

Response:

data: {"id":"8cecd046abf71d62","object":"chat.completion.chunk","created":1779278509,"model":"qwen3-14b","choices":[{"index":0,"delta":{"role":"assistant","content":"Gravity","reasoning":null,"reasoning_content":null,"tool_calls":[]},"logprobs":null,"finish_reason":null,"stop_reason":null,"token_ids":null}],"usage":null,"prompt_token_ids":null,"system_fingerprint":null}

data: {"id":"8cecd046abf71d62","object":"chat.completion.chunk","created":1779278509,"model":"qwen3-14b","choices":[{"index":0,"delta":{"role":"assistant","content":" is","reasoning":null,"reasoning_content":null,"tool_calls":[]},"logprobs":null,"finish_reason":null,"stop_reason":null,"token_ids":null}],"usage":null,"prompt_token_ids":null,"system_fingerprint":null}

data: {"id":"8cecd046abf71d62","object":"chat.completion.chunk","created":1779278509,"model":"qwen3-14b","choices":[{"index":0,"delta":{"role":"assistant","content":" the","reasoning":null,"reasoning_content":null,"tool_calls":[]},"logprobs":null,"finish_reason":null,"stop_reason":null,"token_ids":null}],"usage":null,"prompt_token_ids":null,"system_fingerprint":null}

data: {"id":"8cecd046abf71d62","object":"chat.completion.chunk","created":1779278509,"model":"qwen3-14b","choices":[{"index":0,"delta":{"role":"assistant","content":" fundamental","reasoning":null,"reasoning_content":null,"tool_calls":[]},"logprobs":null,"finish_reason":null,"stop_reason":null,"token_ids":null}],"usage":null,"prompt_token_ids":null,"system_fingerprint":null}

...

data: {"id":"8cecd046abf71d62","object":"chat.completion.chunk","created":1779278509,"model":"qwen3-14b","choices":[{"index":0,"delta":{"role":"assistant","content":" between","reasoning":null,"reasoning_content":null,"tool_calls":[]},"logprobs":null,"finish_reason":null,"stop_reason":null,"token_ids":null}],"usage":null,"prompt_token_ids":null,"system_fingerprint":null}

data: {"id":"8cecd046abf71d62","object":"chat.completion.chunk","created":1779278509,"model":"qwen3-14b","choices":[{"index":0,"delta":{"role":"assistant","content":" them","reasoning":null,"reasoning_content":null,"tool_calls":[]},"logprobs":null,"finish_reason":null,"stop_reason":null,"token_ids":null}],"usage":null,"prompt_token_ids":null,"system_fingerprint":null}

data: [DONE]

Classic Kserve v1 API call:

curl -s -i https://inference.svc.eqiad.wmnet:30443/v1/models/qwen3-14b:predict \
-X POST \
-H "Content-Type: application/json" \
-H "Host: qwen3-14b.experimental.wikimedia.org" \
-d '{"prompt": "What is the capital of France?", "max_tokens": 50}'

Response:

HTTP/2 200 
content-length: 117
content-type: application/json
date: Wed, 20 May 2026 12:04:46 GMT
server: istio-envoy
x-envoy-upstream-service-time: 75

{"model_name":"qwen3-14b","response":"The capital of France is **Paris**.","prompt_tokens":19,"completion_tokens":10}

Text Generation openai/v1/completion endpoint uses "prompt" instead of "messages":

curl -Ns https://inference.svc.eqiad.wmnet:30443/openai/v1/completions \
  -X POST \
  -H "Content-Type: application/json" \
  -H "Host: qwen3-14b.experimental.wikimedia.org" \
  -d '{
    "model": "qwen3-14b",
    "prompt": "Explain gravity in one sentence.",
    "max_tokens": 50,
    "stream": true
  }'

Response:

data: {"id":"88d30502576326c3","object":"text_completion","created":1779364754,"model":"qwen3-14b","choices":[{"index":0,"text":" Gravity","logprobs":null,"finish_reason":null,"stop_reason":null,"prompt_token_ids":null,"token_ids":null}],"usage":null,"system_fingerprint":null}

data: {"id":"88d30502576326c3","object":"text_completion","created":1779364754,"model":"qwen3-14b","choices":[{"index":0,"text":" is","logprobs":null,"finish_reason":null,"stop_reason":null,"prompt_token_ids":null,"token_ids":null}],"usage":null,"system_fingerprint":null}

data: {"id":"88d30502576326c3","object":"text_completion","created":1779364754,"model":"qwen3-14b","choices":[{"index":0,"text":" the","logprobs":null,"finish_reason":null,"stop_reason":null,"prompt_token_ids":null,"token_ids":null}],"usage":null,"system_fingerprint":null}

...

data: {"id":"86e68e07b53363e6","object":"text_completion","created":1779364817,"model":"qwen3-14b","choices":[{"index":0,"text":".","logprobs":null,"finish_reason":null,"stop_reason":null,"prompt_token_ids":null,"token_ids":null}],"usage":null,"system_fingerprint":null}

data: {"id":"86e68e07b53363e6","object":"text_completion","created":1779364817,"model":"qwen3-14b","choices":[{"index":0,"text":" Let","logprobs":null,"finish_reason":null,"stop_reason":null,"prompt_token_ids":null,"token_ids":null}],"usage":null,"system_fingerprint":null}

data: {"id":"86e68e07b53363e6","object":"text_completion","created":1779364817,"model":"qwen3-14b","choices":[{"index":0,"text":" me","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_token_ids":null,"token_ids":null}],"usage":{"prompt_tokens":7,"total_tokens":57,"completion_tokens":50,"prompt_tokens_details":null},"system_fingerprint":null}

data: [DONE]

Change #1289996 had a related patch set uploaded (by Gkyziridis; author: Gkyziridis):

[operations/deployment-charts@master] api-gateway: Configure qwen3-14b in api gateway.

https://gerrit.wikimedia.org/r/1289996

Change #1290019 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] gateway-check: inference post-migration cleanup

https://gerrit.wikimedia.org/r/1290019

Change #1290019 merged by Clément Goubert:

[operations/puppet@production] gateway-check: inference post-migration cleanup

https://gerrit.wikimedia.org/r/1290019

Change #1289996 merged by jenkins-bot:

[operations/deployment-charts@master] rest-gateway: Configure qwen3-14b in rest-gateway

https://gerrit.wikimedia.org/r/1289996

rest-gateway routing is operational:

cgoubert@deploy1003$ curl -s https://rest-gateway.discovery.wmnet:4113/service/lw/inference/v1/models/qwen3-14b/openai/v1/chat/completions   -X POST   -H "Content-Type: application/json"      -d '{                     
    "model": "qwen3-14b",                                                                                                                
    "messages": [                                                                                                                              {"role": "user", "content": "Explain gravity in one sentence."}                                                                        ],                                                                                                                                   
    "max_tokens": 50,                                                                                                                    
    "stream": true                                                                                                                         }'                                                                                                                                     data: {"id":"8d973f418089162a","object":"chat.completion.chunk","created":1779790003,"model":"qwen3-14b","choices":[{"index":0,"delta":{"
role":"assistant","content":"Gravity","reasoning":null,"reasoning_content":null,"tool_calls":[]},"logprobs":null,"finish_reason":null,"st
op_reason":null,"token_ids":null}],"usage":null,"prompt_token_ids":null,"system_fingerprint":null}
[...]
cgoubert@deploy2003$ curl  https://rest-gateway.discovery.wmnet:4113/service/lw/inference/v1/models/qwen3-14b:predict   -X POST   -H "Content-Type: application/json"      -d '{"prompt": "What is the capital of France?", "max_tokens": 50}'
{"model_name":"qwen3-14b","response":"The capital of France is **Paris**.","prompt_tokens":19,"completion_tokens":10}

so is the ATS patch:

❯ curl https://api.wikimedia.org/service/lw/inference/v1/models/qwen3-14b:predict   -X POST   -H "Content-Type: application/json"      -d '{"prompt": "What is the capital of France?", "max_tokens": 50}'
{"model_name":"qwen3-14b","response":"The capital of France is Paris.","prompt_tokens":19,"completion_tokens":8}
❯ curl https://api.wikimedia.org/service/lw/inference/v1/models/qwen3-14b/openai/v1/chat/completions   -X POST   -H "Content-Type: application/json"      -d '{
    "model": "qwen3-14b",
    "messages": [
      {"role": "user", "content": "Explain gravity in one sentence."}
    ],
    "max_tokens": 50,
    "stream": true
  }'
data: {"id":"879ea166c28ad33f","object":"chat.completion.chunk","created":1779790199,"model":"qwen3-14b","choices":[{"index":0,"delta":{"role":"assistant","content":"Gravity","reasoning":null,"reasoning_content":null,"tool_calls":[]},"logprobs":null,"finish_reason":null,"stop_reason":null,"token_ids":null}],"usage":null,"prompt_token_ids":null,"system_fingerprint":null}

Thanks for the work here! One thing I noticed: SSE responses are buffered when going through ATS

Through rest-gateway directly (internal), chunks arrive incrementally over the full generation window, so it is actually streaming. However, through api.wikimedia.org all chunks for a 500-token generation land in a single ~50 ms burst at the end, all sharing the same created timestamp.

curl -Ns https://api.wikimedia.org/service/lw/inference/v1/models/qwen3-14b/openai/v1/chat/completions \
  -X POST -H "Content-Type: application/json" \
  -d '{"model":"qwen3-14b","messages":[{"role":"user","content":"Write a 500-word essay on ML."}],"max_tokens":600,"stream":true}'

If this turns out to be something more difficult we should track it separately.
@Clement_Goubert do you know what would be required in order to achieve this? I'm wondering if adding no-transform to the response cache-control we defined would be enough to get an unbuffered response through ATS

response_headers_to_add:
  - key: cache-control
    value: no-cache, no-transform

After some testing, both the rest-gateway and ATS stream the response correctly. The issue is in the upper-layer of the edge cache, where the whole of api.wikimedia.org has a normal caching configuration vs a pipe configuration. See the difference between api.wikimedia.org and stream.wikimedia.org.

Once thing we can do is add a pipe configuration for that subpath, but if you plan on adding more streaming APIs, it may be smarter to have a standard path for streaming APIs, for instance /service/lw/inference/v1/streaming/.+

I'll tag in Traffic if they have ideas or objections

Change #1293746 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] cache::text: pipe caching for lw streaming API

https://gerrit.wikimedia.org/r/1293746

Change #1293746 merged by Clément Goubert:

[operations/puppet@production] cache::text: pipe caching for lw streaming API

https://gerrit.wikimedia.org/r/1293746

I've merged the fix and tested it on the cache servers I hit from the outside, https://api.wikimedia.org/service/lw/inference/v1/models/qwen3-14b/openai/v1/chat/completions now streams correctly. Fix should be live on all cache servers within 30 minutes.

jijiki triaged this task as Medium priority.Wed, May 27, 10:27 AM

Cool! Moving it to Radar on our side, feel free to ping me on task if you need us again on this task.