Page MenuHomePhabricator

`webservice restart` isn't actually restarting the pods
Closed, ResolvedPublic

Description

tools.shorturls@tools-sgebastion-08:~/www/rust$ webservice restart
Your job is not running, starting...............
tools.shorturls@tools-sgebastion-08:~/www/rust$ webservice status
Your webservice of type golang111 is running on backend kubernetes
tools.shorturls@tools-sgebastion-08:~/www/rust$ kubectl get pods
NAME                         READY   STATUS    RESTARTS   AGE
shorturls-768f6874cb-kgjvl   1/1     Running   0          21d

The job was definitely running before, even though it said it wasn't running.

Event Timeline

Looking at https://k8s-status.toolforge.org/namespaces/tool-shorturls/, the deployment is 1y15w6d old. The pod is 18h39m28s old but it only has the tools.wmflabs.org tags, so this is indeed the problem mentioned in cloud-announce.

Yep:

tools.shorturls@tools-sgebastion-08:~$ webservice stop
Stopping webservice
tools.shorturls@tools-sgebastion-08:~$ webservice start
Starting webservice....
tools.shorturls@tools-sgebastion-08:~$ kubectl get pods
NAME                         READY   STATUS    RESTARTS   AGE
shorturls-7b7858d794-jzxvw   1/1     Running   0          37s
tools.shorturls@tools-sgebastion-08:~$ kubectl get deployments
NAME        READY   UP-TO-DATE   AVAILABLE   AGE
shorturls   1/1     1            1           45s
tools.shorturls@tools-sgebastion-08:~$ webservice restart
Restarting...
tools.shorturls@tools-sgebastion-08:~$ kubectl get pods
NAME                         READY   STATUS    RESTARTS   AGE
shorturls-7b7858d794-lw6rj   1/1     Running   0          41s

Sorry about the trouble, thanks everyone.

Should we make a note on wikitech somewhere about this lurking issue? I'd bet that Kunal will not be the last person to stumble over it. I vaguely remembered that this would be an issue with Kubernetes selectors but had already forgotten the known issue.

Ideally we'd just patch webservice to realize this and automatically fully migrate to the new labels.

I'm not sure how good of an idea it would be to have webservice restart sometimes perform an (unexpected) stop/start. I do think some sort of warning message would be appropriate when the backend is in STATE_PENDING. See also the discussion in https://wm-bot.wmcloud.org/logs/%23wikimedia-cloud/20211013.txt

It might be useful to have a webservice restart --hard that just performs a stop/start in one command.

Legoktm added a project: Documentation.

I think it is unfortunate that webservice restart has different semantics than webservice stop && webservice start, maybe that should be its own task.

I agree with either having webservice emit some help message or fixing it itself when it encounters this state.

I'm not sure how good of an idea it would be to have webservice restart sometimes perform an (unexpected) stop/start.

At least IMO, I'd prefer that over having webservice restart not actually restart the service :S

I think it is unfortunate that webservice restart has different semantics than webservice stop && webservice start, maybe that should be its own task.

Having them be the same semantics would risk unnecessary reloads of the ingress layer (which would be a Problem) and make restarts more ugly for scaled up apps. Ideally, restarts would move to doing a rolling deployment instead in the future, don't you think? Unfortunately, the assumption that all labels would be stable only holds if we actually never change them. I just wish I'd known they couldn't be changed on the fly in the pod templates. Sorry!

restart originally was exactly the same as stop && start, but was changed to be the current behavior of deleting the pod and letting the replica set recreate it when using the Kubernetes backend because it is much lighter weight and also less buggy (T140415).