`webservice restart` regression with backend=kubernetes in webservice 0.51
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	bd808
	Nov 9 2019, 11:17 PM

Description

Using webservice 0.47 from inside a pod works as expected:

$ webservice restart
******************************************************************************
Note that access.log is no longer enabled by default (see https://w.wiki/9go)
******************************************************************************
Restarting webservice...
$

Using webservice 0.51 from tools-sgebastion-08:

$ webservice restart
Traceback (most recent call last):
  File "/usr/local/bin/webservice", line 230, in <module>
    start(job, 'Your job is not running, starting')
  File "/usr/local/bin/webservice", line 95, in start
    job.request_start()
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 642, in request_start
    pykube.Deployment(self.api, self._get_deployment()).create()
  File "/usr/lib/python2.7/dist-packages/pykube/objects.py", line 76, in create
    self.api.raise_for_status(r)
  File "/usr/lib/python2.7/dist-packages/pykube/http.py", line 104, in raise_for_status
    raise HTTPError(payload["message"])
pykube.exceptions.HTTPError: deployments.extensions "fourohfour" already exists

The pod actually seems to be restarted as hoped. It looks like the delete of the Deployment that is done in KubernetesBackend.request_stop() is failing however. It also seems that the guard checking for an existing Deployment in KubernetesBackend.request_start() is failing, so maybe the root problem is that something changed such that KubernetesBackend._find_obj(pykube.Deployment, self.webservice_label_selector) always fails?

Related, but separate: Docker images seem to have not been updated to use the latest webservice package. Actually kind of nice in this instance as it let me make this comparison and see that this is a regression in webservice and not some other problem with the legacy k8s cluster.

Details

	Subject	Repo	Branch	Lines +/-
	breakfix: set the label selector to a subset of actual labels	operations/software/tools-webservice	master	+11 -2

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
		Restricted Task
Resolved	• Bstorm	T246122 Upgrade the Toolforge Kubernetes cluster to v1.16
		Restricted Task
Resolved	bd808	T232536 Toolforge Kubernetes internal API down, causing `webservice` and other tooling to fail
Resolved	• Bstorm	T236565 "tools" Cloud VPS project jessie deprecation
Resolved	aborrero	T101651 Set up toolsbeta more fully to help make testing easier
Resolved	• Bstorm	T166949 Homedir/UID info breaks after a while in Tools Kubernetes (can't read replica.my.cnf)
Resolved	• Bstorm	T246059 Add admin account creation to maintain-kubeusers
Resolved	• Bstorm	T154504 Make webservice backend default to kubernetes
Declined	None	T245230 Investigate cpu/ram requests and limits for DaemonSets pods
Resolved	• Bstorm	T214513 Deploy and migrate tools to a Kubernetes v1.15 or newer cluster
Resolved	• Bstorm	T236202 Modify webservice and maintain-kubeusers to allow switching to the new cluster
Resolved	• Bstorm	T237836 `webservice restart` regression with backend=kubernetes in webservice 0.51

Event Timeline

bd808 created this task.Nov 9 2019, 11:17 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 9 2019, 11:17 PM

This seems to affect webservice --backend=kubernetes ... start as well:

# verify that nothing is currently running
$ kubectl get deployments
$ kubectl get replicasets
$ kubectl get pods

# start up a python3.5 webservice
$ webservice --backend=kubernetes python3.5 start
Traceback (most recent call last):
  File "/usr/local/bin/webservice", line 218, in <module>
    start(job, 'Starting webservice')
  File "/usr/local/bin/webservice", line 95, in start
    job.request_start()
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 647, in request_start
    pykube.Service(self.api, self._get_svc()).create()
  File "/usr/lib/python2.7/dist-packages/pykube/objects.py", line 76, in create
    self.api.raise_for_status(r)
  File "/usr/lib/python2.7/dist-packages/pykube/http.py", line 104, in raise_for_status
    raise HTTPError(payload["message"])
pykube.exceptions.HTTPError: services "fourohfour" already exists

# Check on the state of k8s
$ kubectl get deployments
NAME         DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
fourohfour   1         1         1            1           9s
$ kubectl get replicasets
NAME                    DESIRED   CURRENT   READY     AGE
fourohfour-2046109565   1         1         1         15s
$ kubectl get pods
NAME                          READY     STATUS    RESTARTS   AGE
fourohfour-2046109565-nom8v   1/1       Running   0          21s

So things were created as expected, but webservice blew up in the process. And this crash has another very unfortunate side effect of preventing $HOME/service.manifest from being written/updated. This means that subsequent webservice reload calls will also blow up, even if done from inside a pod where there is a working webservice command. It also means that webservice status ends up really confused.

Having run this every which way repeatedly in testing, I now think I *did* see this, but I thought it was an odd one-off because it didn't seem consistent.
What happens isn't about the deployment in the second case, it's the service object in the old cluster only. If you do a kubectl delete service --all it will clear up. That's why you are missing it in the start phase--didn't check for services, just deployments and its decendents. It also shows up in that traceback above for start.

It seems that somehow the service find_obj doesn't correctly find the service or doesn't delete it. I honestly cannot quite tell why because it is using the same code in pykube for every object...and it works on the newer cluster. The only difference is an additional label. Now that you've pointed this out it is even more baffling to me because it also missed a deployment in the restart.

We should probably roll back the deployment in tools (there must be some way to do that in aptly), and figure out what is the issue with service objects. The labels are set object-wide, so it shouldn't be possible for them to be different unless there is a regressing in pykube with this number of labels--or maybe the way we concatenate them into a label matcher?

I can try to figure out how to rollback now.

Wait...maybe that's it, you restarted, but the labels it looks for are different because it doesn't use the name of the object, it uses labels. There is absolutely no reason not to use the name of the object unless pykube is incapable of it.

Yes, that makes sense, by adding a new label, I broke deletion of old things that didn't have the new label because it looks for the ENTIRE list of labels. Maybe easiest fix is a quick patch and deploy.

Also, by running again with the old version, you changed the labels a second time. Yes, it all makes sense. Fix coming.

Workaround for users until deploy:
Delete ALL existing objects in the webservice:

kubectl delete service <toolname>
kubectl delete deployment <toolname>
kubectl delete rs --all -- only if the webservice are the only replicaset they have
kubectl delete pod --all -- only if the webservice are the only pods they have

Run as normal. Then everything will have the same labels--unless webservice in the pod changes them for some reason.

Change 549990 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/software/tools-webservice@master] breakfix: set the label selector to a subset of actual labels

https://gerrit.wikimedia.org/r/549990

gerritbot added a project: Patch-For-Review.Nov 10 2019, 1:19 AM

Change 549990 merged by Bstorm:
[operations/software/tools-webservice@master] breakfix: set the label selector to a subset of actual labels

https://gerrit.wikimedia.org/r/549990

Mentioned in SAL (#wikimedia-cloud) [2019-11-10T01:45:13Z] <bstorm_> deploying bugfix for webservice in tools and toolsbeta T237836

In tests the new version is able to correctly find all webservice pods and delete them in a sensible fashion (old and new cluster).

• Bstorm added a parent task: T236202: Modify webservice and maintain-kubeusers to allow switching to the new cluster.Nov 10 2019, 1:49 AM

Mentioned in SAL (#wikimedia-cloud) [2019-11-10T02:10:20Z] <bd808> Building new Docker images for T237836

Maintenance_bot removed a project: Patch-For-Review.Nov 10 2019, 2:10 AM

Mentioned in SAL (#wikimedia-cloud) [2019-11-10T02:17:14Z] <bd808> Building new Docker images for T237836 (retrying after cleaning out old images on tools-docker-builder-06)

Docker containers are updated:

$ kubectl exec -it fourohfour-2046109565-7x8v1 -- /bin/bash
$ dpkg -l|grep webservice
ii  toollabs-webservice               0.52                           all          Infrastructure for running webservices on tools.wmflabs.org

Restarts are obviously needed for running containers to pick up the new images, but I think we can let that happen organically.

`webservice restart` regression with backend=kubernetes in webservice 0.51Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

`webservice restart` regression with backend=kubernetes in webservice 0.51
Closed, ResolvedPublic
Actions

Related Objects
Search...