Page MenuHomePhabricator

Optimize gpt-oss-safeguard-20b LiftWing deployment
Open, Needs TriagePublic

Description

  • Make sure that prefix caching is enabled, big win as we will share the same policy prompt across all requests
  • Explore reasoning effort implications on latency and output length
  • Verify we are running the fast attention path and optimal quantization. Also compare FP16 with BF16.
  • Try tuning the --max-model-len parameter closer to our expected policy
  • Dive deeper into the multilingual performance of the model
  • Write a summary report of the optimizations made
  • (Optional, if we really can’t reach good latency) Explore multi-GPU parallelism