- Make sure that prefix caching is enabled, big win as we will share the same policy prompt across all requests
- Explore reasoning effort implications on latency and output length
- Verify we are running the fast attention path and optimal quantization. Also compare FP16 with BF16.
- Try tuning the --max-model-len parameter closer to our expected policy
- Dive deeper into the multilingual performance of the model
- Write a summary report of the optimizations made
- (Optional, if we really can’t reach good latency) Explore multi-GPU parallelism
Description
Description
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Open | None | T418267 Q2 FY2025-26 Goal: Host a content policy evaluation model on LiftWing | |||
| Open | None | T418351 Optimize gpt-oss-safeguard-20b LiftWing deployment |