In task T406179, the ML team deployed the revertrisk-wikidata inference service in LiftWing. Subsequently, in task T409388, the Enterprise team conducted load tests to simulate their traffic and shared the following results:
Run 1
• Duration: ~67.3 mins
• Total Requests: 87,595
• Success: 77,205 (88.14%)
• Failures: 10,390 (11.86%)
• Actual RPS: 21.7
• Requests/hour: 78,109
• Target achievement: 52.07% of 150K/hour
• p90 latency (first 200 successes): ~5.7sRun 2
• Duration: ~67.1 mins
• Total Requests: 75,885
• Success: 64,292 (84.72%)
• Failures: 11,593 (15.28%)
• Actual RPS: 18.85
• Requests/hour: 67,866
• Target achievement: 45.24% of 150K/hour
The Enterprise team's results indicate that the service has a p90 latency of ~5.7s for the first 200 successes, which is above the ~500ms target.
Below are the options we are going to explore when optimizing the revertrisk-wikidata isvc to meet the latency target:
- Enable multi-worker processing in KServe
- Parallelize asynchronous calls to Wikidata API
- Improve error handling and retry logic for Wikidata API requests
- Cache Wikidata API responses to reduce redundant calls
- Enable GPU inference if CPU inference is a bottleneck
- Adjust deployment configurations to improve resource allocation and k8s autoscaling