On Sunday 27th July 2025, 1:32AM CEST time, we've received alert for reference-need model stating Deployment reference-need-predictor-00012-deployment in revision-models at eqiad has persistently unavailable replicas. We can investigate the source in Grafana dashboard here: https://grafana.wikimedia.org/goto/dQsugKwHR?orgId=1.
We can see that one of the replicas was unavailable for ~30mins, after which it was scheduled correctly and the issue did re-occur.
Though the issue did not persist, it can be a symptom of having too little resources in our cluster to schedule all our replicas correctly. This is more pronounced for the reference-need service as it requests 22 CPUs and 6Gi of memory.
We should investigate whether we can lower the resources for the reference-need service or potentially lower the autoscaling ceiling.
