Page MenuHomePhabricator

Investigate reference-need persistently unavailable replicas alert
Closed, ResolvedPublic

Description

On Sunday 27th July 2025, 1:32AM CEST time, we've received alert for reference-need model stating Deployment reference-need-predictor-00012-deployment in revision-models at eqiad has persistently unavailable replicas. We can investigate the source in Grafana dashboard here: https://grafana.wikimedia.org/goto/dQsugKwHR?orgId=1.

We can see that one of the replicas was unavailable for ~30mins, after which it was scheduled correctly and the issue did re-occur.

image.png (562×2 px, 104 KB)

Though the issue did not persist, it can be a symptom of having too little resources in our cluster to schedule all our replicas correctly. This is more pronounced for the reference-need service as it requests 22 CPUs and 6Gi of memory.

We should investigate whether we can lower the resources for the reference-need service or potentially lower the autoscaling ceiling.

Event Timeline

Resolving this as this was a single time incident and the underlying concern about reference-need's high resource requests (22 CPUs, 6Gi memory) and its impact on cluster scheduling is now tracked as part of T414431, where we are optimizing resource utilization across all ISVCs.