Page MenuHomePhabricator

Deploy gpt-oss-safeguard-20b on LiftWing
Closed, ResolvedPublic

Description

Step 1 - Model Artifact Storage:

  • Confirm licensing allows redistribution (model is marketed as open-weights).
  • Store the model weights and tiktoken encodings in S3 - available under s3://wmf-ml-models/gpt-oss-safeguard-20b/. Skipped publishing to analytics since it's not our proprietary model - we did the same with aya models.

Step 2 - Integrate Prototype into KServe:

Step 3 - Validate Prototype:

  • Test the prototype locally to confirm it works as expected.
  • Involve test users at this stage so decisions can be made early enough.

Step 4 - Build Production Model-Server:

  • Build the production model-server with support for the custom policies.
  • Ensure it accepts the expected input, runs preprocessing, and returns the expected output.

Step 5 - Publish to Wikimedia Docker Registry:

  • Dockerize the production model-server.
  • Set up CI/CD to publish it to the Wikimedia Docker registry.

Step 6 - Deploy to LiftWing Staging (Experimental):

  • Deploy the model-server in the LiftWing experimental namespace.
  • Ensure it loads the model from S3, accepts the expected input format, runs preprocessing, and returns the expected output.
  • Enable test users to test the model-server via a LiftWing endpoint.

Step 7 - Validate on LiftWing Staging:

  • Using the LW experimental namespace endpoint, validate the production model-server to confirm it works as expected.
  • Iterate with Step 4 as needed.

Step 8 - Load Testing on LiftWing Staging:

  • Run load tests on the production model-server hosted in LW staging to confirm it meets performance requirements.
  • Iterate with Step 4 as needed.

Step 9 - Deploy to Production:

  • Deploy the model-server in the LiftWing production namespace to provide an internal production endpoint for wider use.

Step 10 - Documentation:

  • Document how the inference service hosted on LiftWing can be accessed via an internal endpoint.
  • Share documentation with consuming teams.

Step 11 - Support & Maintenance:

  • Iterate through previous steps as needed based on the optimization required.
  • Provide ongoing support for the inference service to address issues, improvements, and optimizations.

Details

Other Assignee
BWojtowicz-WMF
Related Changes in Gerrit:
SubjectRepoBranchLines +/-
operations/deployment-chartsmaster+7 -0
operations/deployment-chartsmaster+97 -0
integration/configmaster+2 -0
operations/deployment-chartsmaster+13 -4
operations/deployment-chartsmaster+16 -3
machinelearning/liftwing/inference-servicesmain+1 -1
operations/deployment-chartsmaster+8 -0
operations/deployment-chartsmaster+10 -8
machinelearning/liftwing/inference-servicesmain+7 -0
operations/deployment-chartsmaster+3 -3
machinelearning/liftwing/inference-servicesmain+5 -0
operations/deployment-chartsmaster+2 -2
operations/deployment-chartsmaster+6 -6
operations/deployment-chartsmaster+2 -2
operations/deployment-chartsmaster+3 -1
machinelearning/liftwing/inference-servicesmain+6 -0
operations/deployment-chartsmaster+1 -1
machinelearning/liftwing/inference-servicesmain+21 -9
operations/deployment-chartsmaster+3 -1
machinelearning/liftwing/inference-servicesmain+5 -0
operations/deployment-chartsmaster+1 -1
machinelearning/liftwing/inference-servicesmain+0 -1
operations/deployment-chartsmaster+1 -1
machinelearning/liftwing/inference-servicesmain+4 -0
operations/deployment-chartsmaster+11 -5
machinelearning/liftwing/inference-servicesmain+9 -1
operations/deployment-chartsmaster+2 -0
operations/deployment-chartsmaster+3 -1
machinelearning/liftwing/inference-servicesmain+5 -0
operations/deployment-chartsmaster+3 -1
machinelearning/liftwing/inference-servicesmain+6 -0
operations/deployment-chartsmaster+4 -2
machinelearning/liftwing/inference-servicesmain+7 -1
operations/deployment-chartsmaster+48 -26
operations/deployment-chartsmaster+2 -4
operations/deployment-chartsmaster+9 -15
machinelearning/liftwing/inference-servicesmain+123 -4
integration/configmaster+10 -0
machinelearning/liftwing/inference-servicesmain+238 -0
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change #1254273 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update gpt isvc image to one that applies vLLM AITER/Inductor compilation optimizations.

https://gerrit.wikimedia.org/r/1254273

Change #1254719 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[machinelearning/liftwing/inference-services@main] policy-violation: remove fuse_rope_kvcache config from gpt model-server

https://gerrit.wikimedia.org/r/1254719

Change #1254719 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] policy-violation: remove fuse_rope_kvcache config from gpt model-server

https://gerrit.wikimedia.org/r/1254719

Change #1254856 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: update gpt isvc image to one that removes fuse_rope_kvcache config to solve P89877

https://gerrit.wikimedia.org/r/1254856

Change #1254856 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update gpt isvc image to one that removes fuse_rope_kvcache config to solve P89877

https://gerrit.wikimedia.org/r/1254856

Change #1254914 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[machinelearning/liftwing/inference-services@main] policy-violation: add configurable max_num_batched_tokens flag to gpt model-server

https://gerrit.wikimedia.org/r/1254914

Change #1254914 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] policy-violation: add configurable max_num_batched_tokens flag to gpt model-server

https://gerrit.wikimedia.org/r/1254914

Change #1254933 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: update gpt isvc image to one that supports configurable max_num_batched_tokens flag

https://gerrit.wikimedia.org/r/1254933

Change #1254933 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update gpt isvc image to one that supports configurable max_num_batched_tokens flag

https://gerrit.wikimedia.org/r/1254933

Change #1254958 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[machinelearning/liftwing/inference-services@main] policy-violation: enable concurrent request handling in gpt model-server

https://gerrit.wikimedia.org/r/1254958

Change #1254958 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] policy-violation: enable concurrent request handling in gpt model-server

https://gerrit.wikimedia.org/r/1254958

Change #1254967 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: update gpt isvc image to one that supports concurrent request handling

https://gerrit.wikimedia.org/r/1254967

Change #1254967 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update gpt isvc image to one that supports concurrent request handling

https://gerrit.wikimedia.org/r/1254967

Change #1256062 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[machinelearning/liftwing/inference-services@main] policy-violation: add configurable max_num_seqs flag for gpt model-server

https://gerrit.wikimedia.org/r/1256062

Change #1256062 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] policy-violation: add configurable max_num_seqs flag for gpt model-server

https://gerrit.wikimedia.org/r/1256062

Change #1256273 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: update gpt isvc image to one that supports configurable max_num_seqs

https://gerrit.wikimedia.org/r/1256273

Change #1256273 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update gpt isvc image to one that supports configurable max_num_seqs

https://gerrit.wikimedia.org/r/1256273

Change #1256363 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: lower parallel prefilling and concurrent decoding to decrease gpt isvc latency

https://gerrit.wikimedia.org/r/1256363

Change #1256363 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: lower parallel prefilling and concurrent decoding to decrease gpt isvc latency

https://gerrit.wikimedia.org/r/1256363

Change #1258647 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: bump up k8s resources in experimental ns to enable policy-violation isvc deployment

https://gerrit.wikimedia.org/r/1258647

Change #1258647 merged by Dpogorzelski:

[operations/deployment-charts@master] ml-services: bump up k8s resources in experimental ns to enable policy-violation isvc deployment

https://gerrit.wikimedia.org/r/1258647

Change #1266023 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: increase parallel prefilling and concurrent decoding to improve gpt isvc performance

https://gerrit.wikimedia.org/r/1266023

Change #1266023 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: increase parallel prefilling and concurrent decoding to improve gpt isvc performance

https://gerrit.wikimedia.org/r/1266023

Change #1266154 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[machinelearning/liftwing/inference-services@main] policy-violation: add configurable tensor_parallel_size flag to gpt model-server

https://gerrit.wikimedia.org/r/1266154

Change #1266154 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] policy-violation: add configurable tensor_parallel_size flag to gpt model-server

https://gerrit.wikimedia.org/r/1266154

Change #1266195 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: update gpt isvc image to one that supports configurable tensor_parallel_size flag

https://gerrit.wikimedia.org/r/1266195

Change #1266195 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update gpt isvc image to one that supports configurable tensor_parallel_size flag

https://gerrit.wikimedia.org/r/1266195

Change #1266857 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[machinelearning/liftwing/inference-services@main] policy-violation: add configurable disable_custom_all_reduce flag to gpt model-server

https://gerrit.wikimedia.org/r/1266857

Change #1266857 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] policy-violation: add configurable disable_custom_all_reduce flag to gpt model-server

https://gerrit.wikimedia.org/r/1266857

Change #1266905 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: enable multi-GPU setup using SHM to improve gpt isvc performance

https://gerrit.wikimedia.org/r/1266905

Change #1266905 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: enable multi-GPU setup using SHM to improve gpt isvc performance

https://gerrit.wikimedia.org/r/1266905

Change #1268445 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: set NCCL/RCCL env vars for stable SHM multi-GPU communication in gpt isvc

https://gerrit.wikimedia.org/r/1268445

Change #1268445 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: set NCCL/RCCL env vars for stable SHM multi-GPU communication in gpt isvc

https://gerrit.wikimedia.org/r/1268445

Just quickly chiming in to add another use-case for gpt-oss-safeguard-20b. I'll be reporting in more detail shortly, but we've found it be quite effective in T414816: [WE1.7.3] Exploration of automated verifiability checks. Namely, I ran our dataset of 119 claims+sources through the model and latency was average of <3 seconds and performance was quite high. Representative example of the prompt can be found in P90328 and working code here: https://gitlab.wikimedia.org/repos/research/source-verification/-/blob/main/notebooks/02b_%5Bround_2%5D_liftwing_pipeline.ipynb?ref_type=heads

Thanks to @kevinbazira for getting me setup via some example code in P90327!

Change #1271480 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[machinelearning/liftwing/inference-services@main] policy-violation: upgrade model-server to use kserve v0.17

https://gerrit.wikimedia.org/r/1271480

Change #1271480 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] policy-violation: upgrade model-server to use kserve v0.17

https://gerrit.wikimedia.org/r/1271480

Change #1271577 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[integration/config@master] inference-services: trigger policy-violation image builds on shared requirements.txt changes

https://gerrit.wikimedia.org/r/1271577

Just quickly chiming in to add another use-case for gpt-oss-safeguard-20b. I'll be reporting in more detail shortly, but we've found it be quite effective in T414816: [WE1.7.3] Exploration of automated verifiability checks. Namely, I ran our dataset of 119 claims+sources through the model and latency was average of <3 seconds and performance was quite high. Representative example of the prompt can be found in P90328 and working code here: https://gitlab.wikimedia.org/repos/research/source-verification/-/blob/main/notebooks/02b_%5Bround_2%5D_liftwing_pipeline.ipynb?ref_type=heads

Thanks to @kevinbazira for getting me setup via some example code in P90327!

Thank you Isaac for sharing another use-case for the gpt-oss-safeguard-20b isvc. We are working on optimizing its inference speed, and we'll keep you posted on improvements that could benefit your use-case as they roll out.

Change #1276643 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: enable multi-GPU setup using P2P+SHM to improve gpt isvc performance

https://gerrit.wikimedia.org/r/1276643

Change #1276643 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: enable multi-GPU setup using P2P+SHM to improve gpt isvc performance

https://gerrit.wikimedia.org/r/1276643

Below is a consolidated report of the optimizations we have implemented and their corresponding performance results for the gpt-oss-safeguard-20b isvc that is running on MI300x GPUs in the experimental namespace:

Inter-GPU communicationGPU(s)Requests/s (Throughput)Median Latencyp90Failure RateTotal RequestsReport
-113.584000ms5200ms0.00%1617P89906#L50
SHM215.243300ms4300ms0.00%1816P89906#L88
SHM416.592800ms3600ms0.00%1976P89906#L102
P2P216.342900ms3700ms0.00%1946P89906#L109
P2P416.372900ms3700ms0.00%1950P89906#L116
P2P + SHM216.532800ms3600ms0.00%1968P89906#L130
P2P + SHM416.172900ms3700ms0.00%1927P89906#L137

Details of the load tests for each optimization can be found in: P89906

We observed performance improvements as we moved from a single-GPU setup (4000ms median latency at 13.58 RPS) to multi-GPU setups with SHM (T421105, bbf0196f69fc, 80d4a32c1160) and P2P (T421461, 701db6424673). Overall, enabling P2P communication showed an improvement over SHM on the 2-GPU setup, reducing median latency from 3300ms to 2900ms and improving throughput from 15.24 RPS to 16.34 RPS. The best results came from combining P2P with SHM, which brought median latency down further to 2800ms and raised throughput to 16.53 RPS. We can now get roughly the same performance we previously needed 4 GPUs on SHM, with just 2 GPUs on P2P + SHM.

As requested by the PSI team on slack, we are going to deploy this inference service in LiftWing production with the current performance, as we continue to optimize it in the experimental namespace.

Change #1277934 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: bump up k8s resources in llm ns to enable gpt isvc deployment

https://gerrit.wikimedia.org/r/1277934

Change #1278182 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: deploy gpt isvc in prod

https://gerrit.wikimedia.org/r/1278182

Change #1277934 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: bump up k8s resources in llm ns to enable gpt isvc deployment

https://gerrit.wikimedia.org/r/1277934

Change #1271577 merged by jenkins-bot:

[integration/config@master] inference-services: trigger policy-violation image builds for requirements.txt

https://gerrit.wikimedia.org/r/1271577

Change #1278182 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: deploy gpt isvc in prod

https://gerrit.wikimedia.org/r/1278182

Change #1279238 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] inference-services: allow LLM isvcs to work on ml-serve1014 and ml-serve1015

https://gerrit.wikimedia.org/r/1279238

Change #1279238 merged by jenkins-bot:

[operations/deployment-charts@master] inference-services: allow LLM isvcs to work on ml-serve1014 and ml-serve1015

https://gerrit.wikimedia.org/r/1279238

Following T418350#11856639, we bumped up k8s resources in the prod llm namespace to match those used by the gpt-oss-safeguard-20b isvc in the experimental namespace.

This inference service has now been deployed in LiftWing production (eqiad). It can be accessed by tools that run within the WMF production network, as shown below:

$ time curl "https://inference.svc.eqiad.wmnet:30443/v1/models/gpt-oss-safeguard-20b:predict" -X POST \
-d '{ "messages": [{"role": "user", "content": "CLICK HERE TO WIN $10000!!! Visit http://totallylegit.biz NOW before offer expires!!!"}], "developer_prompt": "You are a content safety classifier.\n## Spam Policy\n\n### INSTRUCTIONS\nClassify the following content as spam or not spam. Return JSON: {\"violation\": 0 or 1, \"rationale\": \"brief explanation\"}\n0 = not spam, 1 = spam\n\n### VIOLATES (1)\n- Unsolicited promotional content\n- Repetitive messages\n- Phishing attempts\n- Deceptive links\n\n### SAFE (0)\n- Genuine questions or discussions\n- Relevant information sharing\n- Normal conversation", "max_tokens": 4096, "temperature": 0.7, "top_p": 0.95 }' \
-H  "Host: gpt-oss-safeguard-20b.llm.wikimedia.org" -H "Content-Type: application/json" --http1.1

{"reasoning":"[TextContent(text='The user provided a message: \"CLICK HERE TO WIN $10000!!! Visit http://totallylegit.biz NOW before offer expires!!!\"\\n\\nWe need to classify as spam or not spam. This is promotional content, urging to click a link to win money. It\\'s unsolicited promotional content, obviously spam. So violation 1. Provide rationale.')]","verdict":"[TextContent(text='{\"violation\": 1, \"rationale\": \"The message is a promotional spam urging a click to win money, containing a suspicious link.\"}')]"}
real	0m0.515s
user	0m0.011s
sys	0m0.004s

We have also updated the liftwing_client to use this prod endpoint and shared it with the PSI team.

kevinbazira claimed this task.
kevinbazira updated Other Assignee, added: BWojtowicz-WMF; removed: kevinbazira.
kevinbazira updated the task description. (Show Details)