Summary
Deploy Qwen 3.6-27B (official FP8 quantized weights, ~30 GB) on Lift Wing using vLLM AsyncLLMEngine with tensor parallelism across 2 GPUs, supporting configurable reasoning and non-reasoning modes per request.
Technical notes
Model: Qwen/Qwen3.6-27B-FP8 (Apache 2.0), 27B dense parameters, hybrid attention (Gated DeltaNet + Gated Attention), 262K native context. Official fine-grained FP8 quantization (block size 128). Text-only serving with vision encoder disabled. Same FP8 model used everywhere.
Serving: vLLM AsyncLLMEngine via the same KServe + Blubber pattern as gpt-oss-safeguard-20b. Initial base image is amd-vllm014 (vLLM 0.14); if FP8 support is absent, a new base image with vLLM >= 0.19.0 will be needed. Reasoning mode toggled via reasoning field in the request payload. GPU_MEMORY_UTILIZATION set to 0.85, MAX_MODEL_LEN to 32768
(conservative).
Deployment: 2x MI300X GPU partitions (2 amd.com/gpu). There are no MI300X nodes in the staging (codfw) cluster, so both testing and production run in the experimental namespace on ml-serve-eqiad (ml-serve1012-15). Follow the same deployment-charts pattern as gpt-oss-safeguard-20b in
helmfile.d/ml-services/experimental/values-ml-serve-eqiad.yaml, with these env overrides: MODEL_NAME=qwen36-27b, STORAGE_URI=s3://wmf-ml-models/llm/qwen36-27b/, TRUST_REMOTE_CODE=True, DTYPE=auto, GPU_MEMORY_UTILIZATION=0.85, MAX_MODEL_LEN=32768, TENSOR_PARALLEL_SIZE=2. Model weights need to be uploaded to S3.
Acceptance criteria
- Model server loads Qwen/Qwen3.6-27B-FP8 and serves predictions via vLLM AsyncLLMEngine
- Upload model to swift -> s3://wmf-ml-models/llm/Qwen3.6-27B-FP8
- CI pipeline publishes the machinelearning-liftwing-inference-services-qwen36 image
- Service deployed in experimental namespace on ml-serve-eqiad and verified with curl