Step 1 - Model Artifact Storage:
- Confirm licensing allows redistribution (model is marketed as open-weights).
- Store the model weights and tiktoken encodings in S3 - available under s3://wmf-ml-models/gpt-oss-safeguard-20b/. Skipped publishing to analytics since it's not our proprietary model - we did the same with aya models.
Step 2 - Integrate Prototype into KServe:
- Review existing FastAPI + vLLM server: https://github.com/roostorg/model-community/tree/main/gpt (T417860#11640606)
- Integrate Roost's GPT model-server into a custom KServe model-server that follows LiftWing ISVC patterns.
Step 3 - Validate Prototype:
- Test the prototype locally to confirm it works as expected.
- Involve test users at this stage so decisions can be made early enough.
Step 4 - Build Production Model-Server:
- Build the production model-server with support for the custom policies.
- Ensure it accepts the expected input, runs preprocessing, and returns the expected output.
Step 5 - Publish to Wikimedia Docker Registry:
- Dockerize the production model-server.
- Set up CI/CD to publish it to the Wikimedia Docker registry.
Step 6 - Deploy to LiftWing Staging (Experimental):
- Deploy the model-server in the LiftWing experimental namespace.
- Ensure it loads the model from S3, accepts the expected input format, runs preprocessing, and returns the expected output.
- Enable test users to test the model-server via a LiftWing endpoint.
Step 7 - Validate on LiftWing Staging:
- Using the LW experimental namespace endpoint, validate the production model-server to confirm it works as expected.
- Iterate with Step 4 as needed.
Step 8 - Load Testing on LiftWing Staging:
- Run load tests on the production model-server hosted in LW staging to confirm it meets performance requirements.
- Iterate with Step 4 as needed.
Step 9 - Deploy to Production:
- Deploy the model-server in the LiftWing production namespace to provide an internal production endpoint for wider use.
Step 10 - Documentation:
- Document how the inference service hosted on LiftWing can be accessed via an internal endpoint.
- Share documentation with consuming teams.
Step 11 - Support & Maintenance:
- Iterate through previous steps as needed based on the optimization required.
- Provide ongoing support for the inference service to address issues, improvements, and optimizations.