The reference quality models (reference-risk and reference-need) models are bundled in one service and deployed in the revision-models namespace.
We have observed increased latencies in the preprocess phase of the reference-need model. More specifically we observe p75 latency spikes. These can be verified by the kserve inferences services dashboard in grafana.
Additionally, we have detected CPU throttling, which may be contributing to these delays. This is easily observed on the Kubernetes pod details grafana dashboard.
The service seems to have some steady traffic that ranges from 20-35 reqs/sec.
To tackle the latencies we have
- increased the maxReplicas of the service initially from 3->5 and then from 5>8. Additionally we
- increased the cpu requests/limits on the pod for the kserve-container from 12->16.
Although the above steps help mitigate the issue for a while, they don't fix it as from time to time latencies go up again.
Looking into the code of the service the following takes place in preprocessing:
- A request to mwapi is made and extracts information on the current revision (e.g. page_id, revision_id, user_id)
- using the information from the first request 3 additional requests are made to mwapi using async requests and asyncio.gather
So in total we have 4 requests to mwapi that are being made for every request to reference-need.




