We want to establish a load testing process for LLMs deployed on Lift Wing ml-staging. This should include:
- Standard load testing to measure request latency percentiles, consistent with our other models
- Visualization of latency patterns on varying input/output token size, similar to our ML-lab benchmark (T382343)
Before implementation, we should:
- Investigate what other folks are doing when load testing LLMs
- Explore how these ways can be adapted to our infrastructure (locust through statboxes etc).