Page MenuHomePhabricator

Deploy CoPE-A on LiftWing
Open, Needs TriagePublic

Description

We'd like to be able to evaluate the performance of CoPE-A on LiftWing. We will also want the ability to use custom weights that can be downloaded from Zentropi.ai

Event Timeline

kostajh updated the task description. (Show Details)

I've managed to spin up the CoPE-A model on ml-lab1002 machine on single MI210 GPU and tested it with sample request.

Some important early findings:

  1. First important note is that gpt-oss-safeguard-20b is more memory-efficient despite being "bigger" model due to MoE architecture and it's published with mxfp4 quantization support. This enables us to run it on 16GB VRAM GPU.
  2. CoPE-A-9B requires 17.27 GiB VRAM to store model weights on the GPU, which is problematic as our MI300X GPUs are partitioned to 16GB VRAM per instance on LiftWing. This could be optimized by applying quantization on current BF16, however MI210GPU does not support FP8 quantization, thus we would require to apply/test quantization directly on MI300X (might be hard as currently we don't have development environment with MI300X).
  3. gpt-oss-safeguard gives reasoning and rationale, whereas CoPE-A-9B is designed to only give binary yes/no answers.

Based on those findings, deploying CoPE-A on LiftWing will definitely require more development time as we can't just run it out-of-the-box as we could do it with gpt-oss-safeguard-20b.
We should also consider if we want to pursue this direction if CoPE-A supports only binary answers.


Here are steps to run the model on ml-lab1002:

  1. You need to log into HuggingFace and accept licenses for Gemma-2 and CoPE-A to download model weights!
  1. Download models:
python3 -m venv venv
source venv/bin/activate
pip install huggingface_hub

export HF_TOKEN="your_token_here"

https_proxy="http://webproxy:8080" python -c \
  "from huggingface_hub import snapshot_download; snapshot_download('google/gemma-2-9b')"

https_proxy="http://webproxy:8080" python -c \
  "from huggingface_hub import snapshot_download; snapshot_download('zentropi-ai/cope-a-9b')"
  1. Merge LoRA adapter
https_proxy="http://webproxy:8080" python -c "
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

base = AutoModelForCausalLM.from_pretrained(
    'google/gemma-2-9b', torch_dtype=torch.bfloat16, device_map='cpu'
)
model = PeftModel.from_pretrained(base, 'zentropi-ai/cope-a-9b')
merged = model.merge_and_unload()

merged.save_pretrained('./cope-a-9b-merged')
AutoTokenizer.from_pretrained('google/gemma-2-9b').save_pretrained('./cope-a-9b-merged')
print('Done')
"
  1. Fix tokenizer compatibility to match transformers installed in our VLLM image:
python -c "
import json

with open('./cope-a-9b-merged/tokenizer_config.json', 'r') as f:
    config = json.load(f)

if 'extra_special_tokens' in config and isinstance(config['extra_special_tokens'], list):
    config['extra_special_tokens'] = {t: t for t in config['extra_special_tokens']}

with open('./cope-a-9b-merged/tokenizer_config.json', 'w') as f:
    json.dump(config, f, indent=2)

print('Fixed')
"
  1. Start vLLM
docker run --device=/dev/kfd --device=/dev/dri \
  --group-add 105 \
  --group-add video \
  --security-opt seccomp=unconfined \
  --shm-size 16g \
  -e HF_HUB_OFFLINE=1 \
  -v $(pwd)/cope-a-9b-merged:/model \
  -p 8000:8000 \
  docker-registry.wikimedia.org/ml/amd-vllm014:gfx90agfx942rocm7.0.0pytorch2.10.0mori0.1flash-attn2.8.3aiter0.1.7vllm0.14 \
  vllm serve /model \
      --dtype bfloat16 \
      --max-model-len 8192 \
      --gpu-memory-utilization 0.9
  1. Test the model

Create sample_request.json:

{
  "model": "/model",
  "prompt": "INSTRUCTIONS\n============\n\nExamine the given POLICY and determine if the given CONTENT meets the criteria for ANY of the LABELS. Answer \"1\" if yes, and \"0\" if no.\n\nPOLICY\n======\n\n# Criteria\n## Overview\nDetect spam content.\n\n## Definition of Labels\n### (SP): Spam\n#### Includes\n- Unsolicited promotional content with urgency tactics\n- Phishing attempts or deceptive links\n#### Excludes\n- Genuine product reviews\n\nCONTENT\n=======\nCLICK HERE TO WIN $10000!!! Visit http://totallylegit.biz NOW!!!\n\nANSWER\n======\n",
  "max_tokens": 1,
  "temperature": 0
}

Run the query:

curl -s http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d @sample_request_spam.json | python -m json.tool

Update on quantization experiments

I attempted to quantize CoPE-A-9B to fit within the 16 GB VRAM available on our partitioned MI300X GPUs on LiftWing. I've tried 2 approaches, but both seem to be blocked on the MI210 GPU.

To progress with the quantization and verifying we can host in on LiftWing, we would require access to MI300X GPU for experimentation. However, we currently only have MI300X deployed on as production LiftWing servers, which are not well suited for experimentation work.


Details of tried approaches:

1. GPTQ INT8 quantization

I performed GPTQ INT8 quantization on ml-lab1002 using the GPTQModel library. The quantization itself was successful, reducing model weights from 17.21 GB to 9.65 GB (44% reduction) - this would comfortably fit in 16 GB VRAM. However, serving the quantized model with vLLM 0.14 on MI210 is blocked due to a three-way dtype incompatibility:

  • vLLM's GPTQ kernels require --dtype float16
  • MI210 (gfx90a) on our vLLM build does not support float16 and forces bfloat16
  • Gemma-2 architecture also blocks float16 due to "numerical instability"

There is no workaround for this conflict in vLLM 0.14 on ROCm. The quantized model is saved on ml-lab1002 in /home/bwojtowicz/cope-a-9b-merged-int8/ and may work on MI300X if the vLLM build there supports float16.

GPTQ quantization reproduction steps

Install dependencies:

source venv/bin/activate
pip install --upgrade pip setuptools wheel
SKIP_ROCM_VERSION_CHECK=1 https_proxy="http://webproxy:8080" \
  pip install -v gptqmodel --no-build-isolation
https_proxy="http://webproxy:8080" pip install datasets

Quantize (takes ~8 hours on a single MI210):

https_proxy="http://webproxy:8080" python -c "
from gptqmodel import GPTQModel, QuantizeConfig
from datasets import load_dataset

calibration_dataset = load_dataset(
    'allenai/c4',
    data_files='en/c4-train.00001-of-01024.json.gz',
    split='train',
).select(range(256))['text']

quant_config = QuantizeConfig(bits=8, group_size=128)

model = GPTQModel.load('./cope-a-9b-merged', quant_config)
model.quantize(calibration_dataset, batch_size=2)
model.save('./cope-a-9b-merged-int8')
"

Copy tokenizer files and fix compatibility:

cp ./cope-a-9b-merged/tokenizer* ./cope-a-9b-merged-int8/
cp ./cope-a-9b-merged/special_tokens_map.json ./cope-a-9b-merged-int8/ 2>/dev/null || true

python -c "
import json
with open('./cope-a-9b-merged-int8/tokenizer_config.json', 'r') as f:
    config = json.load(f)
if 'extra_special_tokens' in config and isinstance(config['extra_special_tokens'], list):
    config['extra_special_tokens'] = {t: t for t in config['extra_special_tokens']}
with open('./cope-a-9b-merged-int8/tokenizer_config.json', 'w') as f:
    json.dump(config, f, indent=2)
"

Attempt to serve (fails with dtype error):

docker run --device=/dev/kfd --device=/dev/dri \
  --group-add 105 \
  --group-add video \
  --security-opt seccomp=unconfined \
  --shm-size 16g \
  -e HF_HUB_OFFLINE=1 \
  -v $(pwd)/cope-a-9b-merged-int8:/model \
  -p 8000:8000 \
  docker-registry.wikimedia.org/ml/amd-vllm014:gfx90agfx942rocm7.0.0pytorch2.10.0mori0.1flash-attn2.8.3aiter0.1.7vllm0.14 \
  vllm serve /model \
    --dtype bfloat16 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.9

2. bitsandbytes INT8 quantization

As an alternative, I tried bitsandbytes on-the-fly INT8 quantization (--quantization bitsandbytes --load-format bitsandbytes), which would quantize the BF16 model at load time without an offline quantization step. This is explicitly blocked in vLLM 0.14 on ROCm:

bitsandbytes quantization is currently not supported in rocm.

The error occurs at config validation before any model loading begins.

bitsandbytes reproduction steps

No offline quantization needed - just add the flags to the serve command:

docker run --device=/dev/kfd --device=/dev/dri \
  --group-add 105 \
  --group-add video \
  --security-opt seccomp=unconfined \
  --shm-size 16g \
  -e HF_HUB_OFFLINE=1 \
  -v $(pwd)/cope-a-9b-merged:/model \
  -p 8000:8000 \
  docker-registry.wikimedia.org/ml/amd-vllm014:gfx90agfx942rocm7.0.0pytorch2.10.0mori0.1flash-attn2.8.3aiter0.1.7vllm0.14 \
  vllm serve /model \
    --dtype bfloat16 \
    --quantization bitsandbytes \
    --load-format bitsandbytes \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.9

Change #1249948 had a related patch set uploaded (by Bartosz Wójtowicz; author: Bartosz Wójtowicz):

[machinelearning/liftwing/inference-services@main] policy-violation: Add CoPE-A-9B model server alongside gpt-oss-safeguard-20b.

https://gerrit.wikimedia.org/r/1249948

Change #1250529 had a related patch set uploaded (by Bartosz Wójtowicz; author: Bartosz Wójtowicz):

[integration/config@master] inference-services: Split policy violation CI into separate model jobs.

https://gerrit.wikimedia.org/r/1250529

Change #1250529 merged by jenkins-bot:

[integration/config@master] inference-services: Split policy violation CI into separate model jobs.

https://gerrit.wikimedia.org/r/1250529

Mentioned in SAL (#wikimedia-releng) [2026-03-11T11:12:34Z] <hashar> Reloaded Zuul for https://gerrit.wikimedia.org/r/c/integration/config/+/1250529 "inference-services: Split policy violation CI into separate model jobs." - T418832

Small update on the progress.

First of all, I have been _very_ incorrect previously about our partitioning for MI300X GPUs and what is available on production LiftWing. Our MI300x GPUs are actually partitioned to 8x24GB (not 12x16GB), which means that we can run unquantized CoPE-A-9B model on single GPU partition on production LiftWing!
Performing quantization would only help to run the CoPE-A-9B on staging, where we are limited by 16GB VRAM.
Thus, we will deploy straight to production :)

I have developed a model server for the CoPE-A-9B model here: https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1249948.
Once we merge this, I will deploy the model only to production LiftWing and it will be accessible for user testing!

Note that base CoPE-A-9B model will just return binary 1/0 answer. Once we deploy the base model, I will open a subsequent patch to extend base model to return additional confidence in response, which will be based on last activation layer in the model.

Change #1249948 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] policy-violation: Add CoPE-A-9B model server alongside gpt-oss-safeguard-20b.

https://gerrit.wikimedia.org/r/1249948

Change #1251272 had a related patch set uploaded (by Bartosz Wójtowicz; author: Bartosz Wójtowicz):

[operations/deployment-charts@master] ml-services: Add CoPE-A-9B experimental deployment.

https://gerrit.wikimedia.org/r/1251272

Change #1251272 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: Add CoPE-A-9B experimental deployment.

https://gerrit.wikimedia.org/r/1251272

The CoPE-A-9B model is now deployed on LiftWing.

One can query the endpoint as shown below. The current input format was developed based on huggingface reference, where user can pass content and policy parameters.

curl -s -X POST \
  "https://inference.svc.eqiad.wmnet:30443/v1/models/cope-a-9b:predict" \
  -H "Host: cope-a-9b.experimental.wikimedia.org" \
  -H "Content-Type: application/json" \
  -d '{
    "content": "CLICK HERE TO WIN $10000!!! Visit http://totallylegit.biz NOW!!!",
    "policy": "Content must not contain spam, phishing attempts, or deceptive links."
  }'

The current output structure returns binary 1/0 response for violation of policy.

{"violation":1}

In the next steps, I will extend the model to also return confidence score.

After deployment, CoPE-A-9B model server was successfully processing small requests of less than 500 input tokens.

However, when testing the model server with example policies form Gitlab project of close to 8000 tokens, the server froze - it stopped responding and generating any logs without crashing. We have to investigate the root cause of this freezing behaviour and fix it.

Change #1253480 had a related patch set uploaded (by Bartosz Wójtowicz; author: Bartosz Wójtowicz):

[operations/deployment-charts@master] ml-services: Lower MAX_MODEL_LEN for CoPE-A-9B.

https://gerrit.wikimedia.org/r/1253480

Change #1253480 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: Lower MAX_MODEL_LEN for CoPE-A-9B.

https://gerrit.wikimedia.org/r/1253480

After lowering the maximum input token length to 4096, we seem to be able to process all incoming requests. I will figure out optimizations we could make to allow bigger input lengths, but the current 4096 token limit should already be good enough for testing our policies.

Change #1254119 had a related patch set uploaded (by Bartosz Wójtowicz; author: Bartosz Wójtowicz):

[machinelearning/liftwing/inference-services@main] policy-violation: Extend CoPE server to return confidence scores.

https://gerrit.wikimedia.org/r/1254119

Change #1271577 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[integration/config@master] inference-services: trigger policy-violation image builds on shared requirements.txt changes

https://gerrit.wikimedia.org/r/1271577