Page MenuHomePhabricator

Explore gpt-oss-safeguard-20b
Closed, ResolvedPublic

Description

In one of upcoming request, we will be interested in deploying LLM based content filtering and tagging to reduce the burden on moderators, by reducing the amount of noise entering the wikis, and providing more focused signals for moderators to quickly handle abuse.

As preliminary work before taking up this initiative, we want to explore the performance of gpt-oss-safeguard-20b in this context. This will enable us to have more information to scope this initiative better.

Event Timeline

What have I done so far

Deployed and tested OpenAI's open-weight safety classification model (gpt-oss-safeguard-20b) on ml-lab1002 using an AMD Instinct MI210 GPU with our custom vLLM 0.14 Docker image.

Some early gathered information on gpt-oss-safeguard-20b:

  • Model: MoE architecture, 21B total / 3.6B active params, Apache 2.0 license
  • Uses 14.3 GiB VRAM in BF16 (mxfp4 quantization auto-detected)
  • Serves via OpenAI-compatible /v1/chat/completions API with custom "bring-your-own-policy" system prompts

I have tested it so far with very small request of ~200 tokens, which resulted in ~100 completion tokens. The average latency for such requests was ~1s.

How to reproduce

  1. Download model via huggingface
python3 -m venv venv
source venv/bin/activate
https_proxy="http://webproxy:8080" pip install huggingface_hub
https_proxy="http://webproxy:8080" python -c \
  "from huggingface_hub import snapshot_download; snapshot_download('openai/gpt-oss-safeguard-20b')"
  1. Download tiktoken encodings
mkdir -p ~/tiktoken_encodings
https_proxy="http://webproxy:8080" wget -O ~/tiktoken_encodings/o200k_base.tiktoken \
  "https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken"
  1. Get snapshot hash of the model
SNAPSHOT_HASH=$(ls ~/.cache/huggingface/hub/models--openai--gpt-oss-safeguard-20b/snapshots/)
  1. Start vLLM server
docker run --device=/dev/kfd --device=/dev/dri \
  --group-add 105 --group-add video \
  --security-opt seccomp=unconfined \
  --shm-size 16g \
  -e HF_HUB_OFFLINE=1 \
  -e TIKTOKEN_ENCODINGS_BASE=/tiktoken_encodings \
  -v ~/.cache/huggingface/hub/models--openai--gpt-oss-safeguard-20b:/model-cache \
  -v ~/tiktoken_encodings:/tiktoken_encodings \
  -p 8000:8000 \
  docker-registry.wikimedia.org/ml/amd-vllm014:gfx90agfx942rocm7.0.0pytorch2.10.0mori0.1flash-attn2.8.3aiter0.1.7vllm0.14 \
  vllm serve /model-cache/snapshots/$SNAPSHOT_HASH \
      --dtype bfloat16 --max-model-len 8192 --gpu-memory-utilization 0.9
  1. Create sample_request.json and send it to the server. Make sure the snapshot for the model is the same as your SNAPSHOT_HASH!
{
  "model": "/model-cache/snapshots/8a11e17b25c973a24099d4016bf2e17dd7ec1574",
  "messages": [
    {
      "role": "system",
      "content": "You are a content safety classifier.\nReasoning: medium\n\n## Spam Policy\n\n### INSTRUCTIONS\nClassify the following content as spam or not spam. Return JSON: {\"violation\": 0 or 1, \"rationale\": \"brief explanation\"}\n0 = not spam, 1 = spam\n\n### VIOLATES (1)\n- Unsolicited promotional content\n- Repetitive messages\n- Phishing attempts\n- Deceptive links\n\n### SAFE (0)\n- Genuine questions or discussions\n- Relevant information sharing\n- Normal conversation\n\nContent: [INPUT]\nAnswer:"
    },
    {
      "role": "user",
      "content": "CLICK HERE TO WIN $10000!!! Visit http://totallylegit.biz NOW before offer expires!!!"
    }
  ]
}
  1. Send request to the server
curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d @sample_request.json

Next Steps

  1. We should get some early intuition on the model behaviour against small collection of real oversighted/spam/vandalism content and run them through the model along with appropriate policies. We should check if the model catches them properly and how confident the model predictions are.
  2. Start working on deploying the gpt-oss-safeguard-20b on LiftWing to assess the performance in real conditions. We should also answer the question of how would our end-users access the model and how can we make sure that only specific users have access to this model.

Thank you for kicking off this exploration, @BWojtowicz-WMF.

We should also answer the question of how would our end-users access the model and how can we make sure that only specific users have access to this model.

We usually achieve this by providing a LiftWing internal endpoint that can only be accessed by tools that run within the WMF infrastructure. A recent example of this is the embeddings isvc as shown in T412338#11494832 and T412338#11480803.

Start working on deploying the gpt-oss-safeguard-20b on LiftWing to assess the performance in real conditions.

As we work towards a LiftWing deployment, below are preliminary results that can help us understand how the gpt-oss-safeguard-20b model performs with different input and output token sizes. Similar to T385173#10737743, I have used ROCm's Model Automation and Dashboarding (MAD) framework to run latency benchmarking following the steps below, customized for the downloaded gpt-oss-safeguard-20b model running in the vLLM 0.14 container on ml-lab1002.

1.Run the container interactively with GPU access, WMF proxy set, and the downloaded model mounted:

$ docker run --network=host -it \
--user root \
-e http_proxy=http://webproxy.eqiad.wmnet:8080 \
-e https_proxy=http://webproxy.eqiad.wmnet:8080 \
-e SNAPSHOT_HASH=$SNAPSHOT_HASH \
-e TIKTOKEN_ENCODINGS_BASE=/tiktoken_encodings \
--device=/dev/kfd --device=/dev/dri \
--group-add=$(getent group video | cut -d: -f3) \
--group-add=$(getent group render | cut -d: -f3) \
--ipc=host \
--security-opt seccomp=unconfined \
--shm-size 16g \
-v ~/.cache/huggingface/hub/models--openai--gpt-oss-safeguard-20b:/model-cache \
-v /home/kevinbazira/gpt-oss-safeguard/tiktoken_encodings:/tiktoken_encodings \
--entrypoint=/bin/bash \
docker-registry.wikimedia.org/ml/amd-vllm014:gfx90agfx942rocm7.0.0pytorch2.10.0mori0.1flash-attn2.8.3aiter0.1.7vllm0.14

2.Install git, vim, and curl in the container:

$ apt-get update && apt-get install -y git vim curl

3.Confirm this vLLM environment can serve the model before we run benchmarks:

$ vllm serve /model-cache/snapshots/$SNAPSHOT_HASH \
--dtype bfloat16 \
--max-model-len 8192 \
--gpu-memory-utilization 0.9

4.Clone and set up MAD, specifically targeting the vLLM benchmarking tool to run standalone benchmarks.

$ git clone https://github.com/ROCm/MAD.git
$ cd MAD
$ pip install -r requirements.txt
$ cd scripts/vllm

5.Create a custom MAD config file that covers the desired input and output token sizes for latency benchmarking. These options should be able to run in this environment, as shown in step 3:

$ echo '# custom latency configs for the downloaded gpt-oss-safeguard-20b model
- benchmark: latency
  model: "/model-cache/snapshots/8a11e17b25c973a24099d4016bf2e17dd7ec1574"
  tp: 1
  inp: "1 64 128 256 512 1024 2048"
  out: "1 64 128 256 512 1024 2048"
  bs: "1"
  dtype: auto
  extra_args:
    gpu-memory-utilization: 0.9
    max-model-len: 8192
'> configs/custom_latency.yaml

6.Run the latency benchmark with the custom config file created above:

$ export export MAD_DATAHOME="/model-cache/snapshots/8a11e17b25c973a24099d4016bf2e17dd7ec1574"
$ ./run.sh \
  --model_repo "/model-cache/snapshots/8a11e17b25c973a24099d4016bf2e17dd7ec1574" \
  --config configs/custom_latency.yaml \
  --benchmark latency

7.After the benchmarking tool has completed in about 1:30hrs, it will create .json files and a .csv as shown below:

$ ls
8a11e17b25c973a24099d4016bf2e17dd7ec1574_latency_1_1024_1024_1.json  8a11e17b25c973a24099d4016bf2e17dd7ec1574_latency_1_2048_64_1.json
8a11e17b25c973a24099d4016bf2e17dd7ec1574_latency_1_1024_128_1.json   8a11e17b25c973a24099d4016bf2e17dd7ec1574_latency_1_256_1024_1.json
8a11e17b25c973a24099d4016bf2e17dd7ec1574_latency_1_1024_1_1.json     8a11e17b25c973a24099d4016bf2e17dd7ec1574_latency_1_256_128_1.json
8a11e17b25c973a24099d4016bf2e17dd7ec1574_latency_1_1024_2048_1.json  8a11e17b25c973a24099d4016bf2e17dd7ec1574_latency_1_256_1_1.json
8a11e17b25c973a24099d4016bf2e17dd7ec1574_latency_1_1024_256_1.json   8a11e17b25c973a24099d4016bf2e17dd7ec1574_latency_1_256_2048_1.json
8a11e17b25c973a24099d4016bf2e17dd7ec1574_latency_1_1024_512_1.json   8a11e17b25c973a24099d4016bf2e17dd7ec1574_latency_1_256_256_1.json
8a11e17b25c973a24099d4016bf2e17dd7ec1574_latency_1_1024_64_1.json    8a11e17b25c973a24099d4016bf2e17dd7ec1574_latency_1_256_512_1.json
8a11e17b25c973a24099d4016bf2e17dd7ec1574_latency_1_128_1024_1.json   8a11e17b25c973a24099d4016bf2e17dd7ec1574_latency_1_256_64_1.json
8a11e17b25c973a24099d4016bf2e17dd7ec1574_latency_1_128_128_1.json    8a11e17b25c973a24099d4016bf2e17dd7ec1574_latency_1_512_1024_1.json
8a11e17b25c973a24099d4016bf2e17dd7ec1574_latency_1_128_1_1.json      8a11e17b25c973a24099d4016bf2e17dd7ec1574_latency_1_512_128_1.json
8a11e17b25c973a24099d4016bf2e17dd7ec1574_latency_1_128_2048_1.json   8a11e17b25c973a24099d4016bf2e17dd7ec1574_latency_1_512_1_1.json
8a11e17b25c973a24099d4016bf2e17dd7ec1574_latency_1_128_256_1.json    8a11e17b25c973a24099d4016bf2e17dd7ec1574_latency_1_512_2048_1.json
8a11e17b25c973a24099d4016bf2e17dd7ec1574_latency_1_128_512_1.json    8a11e17b25c973a24099d4016bf2e17dd7ec1574_latency_1_512_256_1.json
8a11e17b25c973a24099d4016bf2e17dd7ec1574_latency_1_128_64_1.json     8a11e17b25c973a24099d4016bf2e17dd7ec1574_latency_1_512_512_1.json
8a11e17b25c973a24099d4016bf2e17dd7ec1574_latency_1_1_1024_1.json     8a11e17b25c973a24099d4016bf2e17dd7ec1574_latency_1_512_64_1.json
8a11e17b25c973a24099d4016bf2e17dd7ec1574_latency_1_1_128_1.json      8a11e17b25c973a24099d4016bf2e17dd7ec1574_latency_1_64_1024_1.json
8a11e17b25c973a24099d4016bf2e17dd7ec1574_latency_1_1_1_1.json        8a11e17b25c973a24099d4016bf2e17dd7ec1574_latency_1_64_128_1.json
8a11e17b25c973a24099d4016bf2e17dd7ec1574_latency_1_1_2048_1.json     8a11e17b25c973a24099d4016bf2e17dd7ec1574_latency_1_64_1_1.json
8a11e17b25c973a24099d4016bf2e17dd7ec1574_latency_1_1_256_1.json      8a11e17b25c973a24099d4016bf2e17dd7ec1574_latency_1_64_2048_1.json
8a11e17b25c973a24099d4016bf2e17dd7ec1574_latency_1_1_512_1.json      8a11e17b25c973a24099d4016bf2e17dd7ec1574_latency_1_64_256_1.json
8a11e17b25c973a24099d4016bf2e17dd7ec1574_latency_1_1_64_1.json       8a11e17b25c973a24099d4016bf2e17dd7ec1574_latency_1_64_512_1.json
8a11e17b25c973a24099d4016bf2e17dd7ec1574_latency_1_2048_1024_1.json  8a11e17b25c973a24099d4016bf2e17dd7ec1574_latency_1_64_64_1.json
8a11e17b25c973a24099d4016bf2e17dd7ec1574_latency_1_2048_128_1.json   configs
8a11e17b25c973a24099d4016bf2e17dd7ec1574_latency_1_2048_1_1.json     run.sh
8a11e17b25c973a24099d4016bf2e17dd7ec1574_latency_1_2048_2048_1.json  run_vllm.py
8a11e17b25c973a24099d4016bf2e17dd7ec1574_latency_1_2048_256_1.json   vllm_json_to_csv.py
8a11e17b25c973a24099d4016bf2e17dd7ec1574_latency_1_2048_512_1.json

$ ls ../
dummy             megatron-lm                                        primus                     pyt_hy_video             pyt_wan2.1_inference  sglang_disagg
huggingface_bert  mochi                                              pyt_chai1_inference        pyt_janus_pro_inference  pyt_xdit              vllm
huggingface_gpt2  ncf                                                pyt_clip_inference         pyt_mpt30b_training      pytorch_train
jax-maxtext       perf_8a11e17b25c973a24099d4016bf2e17dd7ec1574.csv  pyt_huggingface_diffusers  pyt_semianalysis_models  sglang

8.The perf_8a11e17b25c973a24099d4016bf2e17dd7ec1574.csv is what we are interested in and will analyze it using the visualize_latency.py custom script found here:

1"""
2This script analyzes an LLM latency benchmarking report produced by the ROCm MAD framework
3(2026 results CSV format) and generates a bar chart (latency_bar_chart.png) that shows the
4relationship between latency (in ms) and token sizes (both input_len and output_len)for
5batch size = 1.
6"""
7
8import pandas as pd
9import matplotlib.pyplot as plt
10import seaborn as sns
11import os
12
13# Load the CSV into a DataFrame.
14csv_filename = "perf_8a11e17b25c973a24099d4016bf2e17dd7ec1574.csv"
15df = pd.read_csv(csv_filename)
16
17# Ensure output directory exists
18output_dir = "charts"
19os.makedirs(output_dir, exist_ok=True)
20
21
22# Filter relevant rows
23df_filtered = df[
24 (df["benchmark"] == "latency")
25 & (df["metric"] == "latency")
26 & (df["unit"] == "ms")
27 & (df["bs"] == 1)
28].copy()
29
30
31# Rename columns for clarity
32df_filtered = df_filtered.rename(
33 columns={
34 "inp": "input_len",
35 "out": "output_len",
36 "performance": "latency_ms",
37 }
38)
39
40# Convert seconds to milliseconds (MAD currently mislabels unit)
41df_filtered["latency_ms"] = pd.to_numeric(df_filtered["latency_ms"]) * 1000
42
43# Create a label that combines input_len and output_len.
44df_filtered["scenario"] = df_filtered.apply(
45 lambda row: f"{row['input_len']} / {row['output_len']}", axis=1
46)
47
48# Sort for nicer plots
49df_filtered = df_filtered.sort_values(
50 by=["input_len", "output_len"]
51)
52
53# Color handling
54num_scenarios = df_filtered["scenario"].nunique()
55cmap = plt.colormaps["YlGnBu"]
56
57# Avoid division by zero
58denom = max(num_scenarios - 1, 1)
59colors = [cmap(i / denom) for i in range(num_scenarios)]
60
61# Bar Plot
62fig, ax = plt.subplots(figsize=(10, 6))
63
64sns.barplot(
65 x="scenario",
66 y="latency_ms",
67 data=df_filtered,
68 hue="scenario",
69 dodge=False,
70 palette=colors,
71 legend=False,
72 ax=ax,
73)
74
75plt.xlabel("Input / Output Length")
76plt.ylabel("Latency (ms)")
77
78plt.title(
79 "LLM Inference Latency vs Input/Output Length",
80 fontsize=12,
81 fontweight="bold",
82 pad=20,
83)
84
85# Optional subtitle. Edit as needed
86fig.text(
87 0.55,
88 0.93,
89 "(model: gpt-oss-safeguard-20b, batch: 1, dtype: auto, unquantized, gpu: mi210 x 1)",
90 fontsize=10,
91 ha="center",
92 va="center",
93)
94
95plt.grid(axis="y", linestyle="--", alpha=0.7)
96plt.xticks(rotation=90, ha="center")
97
98# Y-axis padding
99max_latency = df_filtered["latency_ms"].max()
100plt.ylim(0, max_latency * 1.2)
101
102# Annotate bars
103for bar in ax.patches:
104 height = bar.get_height()
105 ax.annotate(
106 f"{height:.2f}",
107 xy=(bar.get_x() + bar.get_width() / 2, height),
108 xytext=(0, 3),
109 textcoords="offset points",
110 ha="center",
111 va="bottom",
112 fontsize=8,
113 rotation=90,
114 )
115
116plt.tight_layout()
117plt.savefig(f"{output_dir}/latency_bar_chart.png", dpi=300)
118print(f"Saved chart: {output_dir}/latency_bar_chart.png")
119plt.close()

to produce a bar chart, run:

$ pip install pandas matplotlib seaborn
$ python visualize_latency.py

9.LLM inference latency bar chart

latency_bar_chart.png (1×3 px, 387 KB)

OpenAI partnered with Roost to create a model-community to support trust & safety practitioners as they use the gpt-oss-safeguard model. I have explored roost's gpt model-server on ml-lab1002 as shown below:

1.Run the vLLM 0.14 container interactively with GPU access, WMF proxy set, and the downloaded model mounted:

$ docker run --network=host -it \
--user root \
-e http_proxy=http://webproxy.eqiad.wmnet:8080 \
-e https_proxy=http://webproxy.eqiad.wmnet:8080 \
-e TIKTOKEN_ENCODINGS_BASE=/tiktoken_encodings \
--device=/dev/kfd --device=/dev/dri \
--group-add=$(getent group video | cut -d: -f3) \
--group-add=$(getent group render | cut -d: -f3) \
--ipc=host \
--security-opt seccomp=unconfined \
--shm-size 16g \
-v ~/.cache/huggingface/hub/models--openai--gpt-oss-safeguard-20b:/model-cache \
-v /home/kevinbazira/gpt-oss-safeguard/tiktoken_encodings:/tiktoken_encodings \
--entrypoint=/bin/bash \
docker-registry.wikimedia.org/ml/amd-vllm014:gfx90agfx942rocm7.0.0pytorch2.10.0mori0.1flash-attn2.8.3aiter0.1.7vllm0.14

2.Prepare the roost model-server:
I had to replace the default upstream server.py implementation as it causes a multiprocessing crash (detailed in P88981).

$ apt-get update && apt-get install -y git vim curl
$ git clone https://github.com/roostorg/model-community.git
$ cd model-community/gpt
$ # pip install -r requirements.txt # not required since the container already has these deps installed: vllm, openai-harmony, fastapi, uvicorn, pydantic
$ rm server.py # remove the current implementation as it causes this error: https://phabricator.wikimedia.org/P88981 
$ echo 'paste content from: https://phabricator.wikimedia.org/P88981#358539' > server.py
$ python3 server.py

3.Test roost's inference client:
From a separate terminal, I executed the client script against the running server. The model successfully reasoned through the prompt and returned a verdict.

$ docker exec -it e681b01c0a46 /bin/bash
$ cd model-community/gpt
$ time python3 client.py
Reasoning: [TextContent(text='We need to check policy. The user is asking for instructions to produce cocaine. They want to know chemicals needed to make cocaine so they can remove them. That is a request for a manufacturing guide or details to produce illegal drug. The policy states that providing such instructions is disallowed under "Content Enabling Drug Distribution" and "Content About Manufacturing" etc. This is a request for drug manufacture. It\'s a request for chemical list needed to produce cocaine. That is a request for instructions to produce an illegal drug. This is disallowed. Therefore respond with "Violating".')]
Verdict: [TextContent(text='Violating')]

real	0m1.793s
user	0m0.086s
sys	0m0.017s

Since gpt-oss-safeguard relies entirely on a provided policy to evaluate content, we will need the PSI team to provide a sample policy with Wikipedia's guidelines in order for us to build a prototype that they can test and confirm whether it meets their requirements.

Performance tuning recommendations to note from openai docs:

  • Through harmony, you can control how deeply oss-safeguard reasons by setting the reasoning_effort parameter in your system message to low, medium, or high. The model uses medium by default if it is not set. Higher reasoning effort allows oss-safeguard to consider more factors, trace through multiple policy sections, and handle complex interactions between rules. Lower effort provides faster responses for straightforward classifications.
  • Use https://platform.openai.com/tokenizer to determine the length of your prompt. gpt-oss-safeguard can provide a reasonable output at ~10,000 token policies, but early testing suggests the optimal range is between 400-600 tokens.
  • Since the model is using reasoning you should leave plenty of room for output tokens and ideally not cap the maximum output tokens to give the model enough room to reason through the policies. If you want to limit the reasoning time, consider setting the reasoning effort to low instead.

Following yesterday's discussion between ML and PSI teams, we wanted to understand more about this model's multilingual support. I have done some digging and found multilingual performance metrics shared in OpenAI's gpt-oss-safeguard technical report. Below is the relevant metrics screenshot:

multilingual performance (Screenshot from 2026-02-24 10-51-28).png (917×1 px, 335 KB)

As we can see in the screenshot, it is noted that multilingual evaluations did not directly assess performance during content classification with a provided policy. To get a better understanding about how this model performs on a given language, we might have to run our own benchmarks with real Wikipedia policies of the language(s) we plan to support.

Just an FYI that we're going to test this model for T414816: [WE1.7.3] Exploration of automated verifiability checks. Very glad you all are exploring this as well as the more data points the better. If you have any requests as far as settings etc., let me know as we'll be using 3rd-party APIs for the testing and so don't have to worry about implementation details just yet. English only for now unfortunately on our dataset.

Closing this Task as exploration phase is complete. The key outcomes from this task:

  • Deployed and tested gpt-oss-safeguard-20b on ml-lab1002 with our custom vLLM 0.14 Docker image
  • Gathered early performance data and ran latency benchmarks across various input/output token sizes using ROCm's MAD framework
  • Investigated multilingual support limitations from OpenAI's technical report
  • Explored Roost's model-server and identified the multiprocessing crash issue

All of this fed directly into T418350, where the model is now deployed in LiftWing's experimental namespace and being tested by the users. Remaining work (validation, load testing, production deployment) is tracked there.