Page MenuHomePhabricator

Refill tool stuck "waiting for an available worker"
Closed, ResolvedPublic

Description

Appear to be stuck and needs a prod to get it started again.

Event Timeline

Curb_Safe_Charmer subscribed.

@Bstorm apparently no longer works for WMF, so looking for someone else in the Cloud Services Team who can help.

Have run webservice restart, but still showing "waiting for an available worker".

Curb_Safe_Charmer added a subscriber: TheresNoTime.

Yesterday evening (UK time), my attempts to restart the API service using 'webservice restart' didn't help, so I pinged @TheresNoTime on IRC and she quickly jumped on the problem, resolving it by deleting and redeploying it. I gather she'll provide a write up in due course.

Curb_Safe_Charmer triaged this task as Medium priority.

Yesterday evening (UK time), my attempts to restart the API service using 'webservice restart' didn't help, so I pinged @TheresNoTime on IRC and she quickly jumped on the problem, resolving it by deleting and redeploying it. I gather she'll provide a write up in due course.

So minus the running about trying to figure out what was going on, what resolved it was:

On refill

  • kubectl get pods
  • kubectl delete pods {name of pod}

On refill-api

  • kubectl get pods
  • kubectl delete pods {name of pod}

They will then auto-recreate.

I also deleted the deployment and re-deployed

  • kubectl get deployments
  • kubectl delete deployments {name of deployment}
  • kubectl apply -f worker-deployment.yml

I want to figure something out with T309103: Set up monitoring for refill so we don't end up relying on user reports 😄

If the worker is using a custom k8s deployment, consider configuring liveliness/readiness probes to make kubernetes restart the container when it gets stuck.