Page MenuHomePhabricator

k8s webservice restart failure with `ValueError: get() more than one object; use filter`
Closed, ResolvedPublic

Description

tools.precise-tools@tools-bastion-05:~/www/python/src$ git apply ~/169b40bff497eff780df28ba010cdf3564e62cdd.patch; webservice restart
Restarting webserviceTraceback (most recent call last):
  File "/usr/local/bin/webservice", line 152, in <module>
    start(job, '')
  File "/usr/local/bin/webservice", line 34, in start
    wait_for(lambda: job.get_state() == Backend.STATE_RUNNING, message)
  File "/usr/lib/python2.7/dist-packages/toollabs/common/utils.py", line 20, in wait_for
    if predicate():
  File "/usr/local/bin/webservice", line 34, in <lambda>
    wait_for(lambda: job.get_state() == Backend.STATE_RUNNING, message)
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 369, in get_state
    pod = self._find_obj(pykube.Pod, self.webservice_label_selector)
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 185, in _find_obj
    selector=selector
  File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 75, in get
    raise ValueError("get() more than one object; use filter")
ValueError: get() more than one object; use filter

Related Objects

Mentioned In
T234617: Toolforge. introduce new domain toolforge.org
T244289: Improve limit range management in webservice for Kubernetes
T244293: Add a function to webservice called "migrate" that will push a tool from the old cluster on Kubernetes to the new one
T158244: Improve `webservice status` output
T148872: Make webservice command read default cli arguments from ~/.webservicerc
T190696: [Gsoc 2018] Proposal for Toolforge webservice command Improvement
T190638: GSoC 2018 proposal for Improvements for the Toolforge 'webservice' command
T140415: `webservice restart` does not always wait for service to stop before trying to start again
T177603: Proposal: Improvements for the Toolforge 'webservice' command
T175768: Improvements for the Toolforge 'webservice' command
T160353: Still issues with node.js webservice
T158152: tool.spiarticleanalyzer: webservice stop yields ValueError: get() more than one object
T156605: "webservice shell" fails with "No such file or directory" (with php5.6)
Mentioned Here
T234617: Toolforge. introduce new domain toolforge.org
T244289: Improve limit range management in webservice for Kubernetes
T244293: Add a function to webservice called "migrate" that will push a tool from the old cluster on Kubernetes to the new one
T148872: Make webservice command read default cli arguments from ~/.webservicerc
T158244: Improve `webservice status` output
T190638: GSoC 2018 proposal for Improvements for the Toolforge 'webservice' command
P5016 k8s for i in {1..5} webservice stop; webservice start and webservice restart with kubectl get {pods,services,deployments} for debugging T156626
T156605: "webservice shell" fails with "No such file or directory" (with php5.6)

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

This is an annoying bug for experienced users and can be insurmountable for newbies. Let's make the solution to this not more complex than it needs to be.

CMIIW: When webservice stop has finished (with a Kubernetes backend), kubectl get pods should not list any webservice pods, so that would be the testable requirement. So maybe add a STATE_STOPPING for webservices when they are Terminating, but not gone yet. The logic in ´webservice`'s stop() can then remain the same.

After cloning the repository and adding some prints, the output of PYTHONPATH=. scripts/webservice --backend=kubernetes stop && kubectl get pods is:

svc =  <Service typoscan>
deployment =  <Deployment typoscan>
pod =  <Pod typoscan-3292915034-mokde>
pod.obj.status.phase =  u'Running'
Stopping webservicesvc =  None
deployment =  None

NAME                        READY     STATUS        RESTARTS   AGE
typoscan-3292915034-mokde   1/1       Terminating   0          30s

I. e., "Terminating" pods have svc and deployment set to None = tough luck.

So let's see what webservice start is looking at.

(Fun fact: Starting webservice start from a home directory on NFS is enough to cause a sufficient delay for the bug not to show up. Moving to /tmp works reliably for reproducing.)

I. e., "Terminating" pods have svc and deployment set to None = tough luck.

FYI T156626#2984863

So let's see what webservice start is looking at.

It is checking the status of the pod, but it cannot determine that since there are multiple pods running.

AFAICT a "Terminating" pod has pod.obj['status']['phase'] set to Running:

tools.typoscan@tools-bastion-03:/tmp/tools-webservice$ webservice --backend=kubernetes php5.6 start && webservice --backend=kubernetes stop && kubectl get pods && PYTHONPATH=. scripts/webservice --backend=kubernetes php5.6 status
Starting webservice...
Stopping webservice
NAME                        READY     STATUS        RESTARTS   AGE
typoscan-3292915034-mzhbw   1/1       Terminating   0          5s
<Pod typoscan-3292915034-mzhbw>
pod =  <Pod typoscan-3292915034-mzhbw>
pod.obj =  {u'status': {u'containerStatuses': [{u'restartCount': 0, u'name': u'webservice', u'image': u'docker-registry.tools.wmflabs.org/toollabs-php-web:latest', u'imageID': u'docker://sha256:3b9bc605cc9a8c4f379979fcac5d8b7a9b8024411a42f016c63a72af1b1aefd2', u'state': {u'running': {u'startedAt': u'2017-02-04T08:04:32Z'}}, u'ready': True, u'lastState': {}, u'containerID': u'docker://3d1488dcdd791a3e18c5b5659f267b674f748f8cecadbd3c37b3dead8cd45428'}], u'podIP': u'192.168.112.3', u'startTime': u'2017-02-04T08:04:30Z', u'hostIP': u'10.68.22.168', u'phase': u'Running', u'conditions': [{u'status': u'True', u'lastProbeTime': None, u'type': u'Initialized', u'lastTransitionTime': u'2017-02-04T08:04:30Z'}, {u'status': u'True', u'lastProbeTime': None, u'type': u'Ready', u'lastTransitionTime': u'2017-02-04T08:04:32Z'}, {u'status': u'True', u'lastProbeTime': None, u'type': u'PodScheduled', u'lastTransitionTime': u'2017-02-04T08:04:29Z'}]}, u'spec': {u'dnsPolicy': u'ClusterFirst', u'securityContext': {u'fsGroup': 51208, u'runAsUser': 51208}, u'nodeName': u'tools-worker-1026.tools.eqiad.wmflabs', u'terminationGracePeriodSeconds': 30, u'restartPolicy': u'Always', u'volumes': [{u'hostPath': {u'path': u'/public/dumps/'}, u'name': u'dumps'}, {u'hostPath': {u'path': u'/data/project/'}, u'name': u'home'}, {u'hostPath': {u'path': u'/data/scratch/'}, u'name': u'scratch'}, {u'hostPath': {u'path': u'/var/run/nslcd/socket'}, u'name': u'varrunnslcdsocket-e3lfc'}], u'containers': [{u'terminationMessagePath': u'/dev/termination-log', u'name': u'webservice', u'volumeMounts': [{u'mountPath': u'/public/dumps/', u'name': u'dumps'}, {u'mountPath': u'/data/project/', u'name': u'home'}, {u'mountPath': u'/data/scratch/', u'name': u'scratch'}, {u'mountPath': u'/var/run/nslcd/socket', u'name': u'varrunnslcdsocket-e3lfc'}], u'image': u'docker-registry.tools.wmflabs.org/toollabs-php-web:latest', u'workingDir': u'/data/project/typoscan/', u'command': [u'/usr/bin/webservice-runner', u'--type', u'lighttpd', u'--port', u'8000'], u'env': [{u'name': u'HOME', u'value': u'/data/project/typoscan/'}], u'imagePullPolicy': u'Always', u'securityContext': {u'runAsUser': 51208}, u'ports': [{u'protocol': u'TCP', u'name': u'http', u'containerPort': 8000}], u'resources': {u'requests': {u'cpu': u'125m', u'memory': u'256Mi'}, u'limits': {u'cpu': u'2', u'memory': u'2Gi'}}}]}, u'metadata': {u'name': u'typoscan-3292915034-mzhbw', u'deletionTimestamp': u'2017-02-04T08:05:03Z', u'deletionGracePeriodSeconds': 30, u'labels': {u'pod-template-hash': u'3292915034', u'tools.wmflabs.org/webservice-version': u'1', u'name': u'typoscan', u'tools.wmflabs.org/webservice': u'true'}, u'namespace': u'typoscan', u'resourceVersion': u'79206126', u'generateName': u'typoscan-3292915034-', u'creationTimestamp': u'2017-02-04T08:04:29Z', u'annotations': {u'kubernetes.io/created-by': u'{"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicaSet","namespace":"typoscan","name":"typoscan-3292915034","uid":"8dff911d-eab0-11e6-bab3-fa163edef48a","apiVersion":"extensions","resourceVersion":"79206076"}}\n'}, u'selfLink': u'/api/v1/namespaces/typoscan/pods/typoscan-3292915034-mzhbw', u'uid': u'8e021db1-eab0-11e6-bab3-fa163edef48a'}}
pod.obj.status.phase =  u'Running'
Your webservice is running
tools.typoscan@tools-bastion-03:/tmp/tools-webservice$

but deletionTimestamp seems to indicate that this pod is being deleted. On second thought, this seems irrelevant:

Currently the test for "pod running" is masked by if svc and deployment; i. e., if we would first test for running pod → STATUS_RUNNING, if there are no pods, but svc and deploymentSTATUS_PENDING, otherwise → STATUS_STOPPED, that should work?

AFAICT this patch works:

diff --git a/toollabs/webservice/backends/kubernetesbackend.py b/toollabs/webservice/backends/kubernetesbackend.py
index 39a6bbd..808f43c 100644
--- a/toollabs/webservice/backends/kubernetesbackend.py
+++ b/toollabs/webservice/backends/kubernetesbackend.py
@@ -380,15 +380,18 @@ class KubernetesBackend(Backend):
         self._delete_obj(pykube.Pod, self.webservice_label_selector)
 
     def get_state(self):
+        pod = self._find_obj(pykube.Pod, self.webservice_label_selector)
+        if pod is not None:
+            if pod.obj['status']['phase'] == 'Running':
+                return Backend.STATE_RUNNING
+            elif  pod.obj['status']['phase'] == 'Pending':
+                return Backend.STATE_PENDING
         svc = self._find_obj(pykube.Service, self.webservice_label_selector)
         deployment = self._find_obj(pykube.Deployment, self.webservice_label_selector)
         if svc is not None and deployment is not None:
-            pod = self._find_obj(pykube.Pod, self.webservice_label_selector)
-            if pod is not None:
-                if pod.obj['status']['phase'] == 'Running':
-                    return Backend.STATE_RUNNING
             return Backend.STATE_PENDING
-        return Backend.STATE_STOPPED
+        else:
+            return Backend.STATE_STOPPED
 
     def shell(self):
         podSpec = self._get_shell_pod()

The difference in behaviour is that webservice stop will wait until the pod is actually deleted and not just fire off the command and exit. I think this is a feature not only because that is what the user expects but also because it is the same logic as gridenginebackend.py's:

def get_state(self):
    job = self._get_job_xml()
    if job is not None:
        state = job.findtext('.//state').lower()
        if 'r' in state:
            return Backend.STATE_RUNNING
        else:
            return Backend.STATE_PENDING
    return Backend.STATE_STOPPED

I got the same error by doing:

webservice --backend=kubernetes python2 start

Afterwards, when I did the following...

webservice stop; webservice --backend=kubernetes python2 start

... I got...

Your webservice is not running
Your job is already running

EDIT: Had to do

webservice --backend=kubernetes python2 stop; sleep 2; webservice --backend=kubernetes python2 start

to get

Stopping webservice
Starting webservice..

I got the same error. Afterwards, when I did the following...

webservice stop; webservice --backend=kubernetes python2 start

... I got...

Your webservice is not running
Your job is already running

For future reference that is the behavior when the service.manifest get confused. Solution is: webservice --backend=kubernetes python2 stop; sleep 2; webservice --backend=kubernetes python2 start

(Triaging to High because we are just gonna receive more and more reports of this if we don't fix this early)

@scfc Looks like this bug a continuing problem. If that patch works,
perhaps the revision process can begin via gerrit?

Change 336049 had a related patch set uploaded (by Tim Landscheidt):
Wait for the Kubernetes pod to shut down after "stop"

https://gerrit.wikimedia.org/r/336049

I have installed https://integration.wikimedia.org/ci/job/debian-glue/608/artifact/toollabs-webservice_0.33%7Edev+0%7E20170205034157.608+trusty%7E1.gbpb2f82f_all.deb on tools-bastion-03 and tools-services-02, the current host of webservicemonitor. I have tested some basics, and if in the next few days there are no hiccups, I'll merge the commit.

Puppet wanted to downgrade tools-webservice, so I put the dev version in aptly and ran Puppet on all hosts.

Change 336049 merged by jenkins-bot:
Wait for the Kubernetes pod to shut down after "stop"

https://gerrit.wikimedia.org/r/336049

JustBerry reopened this task as Open.EditedFeb 15 2017, 3:13 AM

webservice stop in ~ ((venv)tools.spiarticleanalyzer@tools-bastion-03:~$) still yields

(venv)tools.spiarticleanalyzer@tools-bastion-03:~$ webservice stop
Traceback (most recent call last):
  File "/usr/local/bin/webservice", line 140, in <module>
    if job.get_state() == Backend.STATE_STOPPED:
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 372, in get_state
    pod = self._find_obj(pykube.Pod, self.webservice_label_selector)
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 191, in _find_obj
    selector=selector
  File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 75, in get
    raise ValueError("get() more than one object; use filter")
ValueError: get() more than one object; use filter

The patched version still has the problem of failing if pod = self._find_obj(pykube.Pod, self.webservice_label_selector) matches more than one pod.

_find_obj has handling for the possibility of pykube.exceptions.ObjectDoesNotExist, but it does not handle the ValueError that will be raised if pykube.Query.get() matches more than one object.

How can there be more than one webservice pod? That should be impossible to achieve, and IMNSHO that should require manual intervention.

(In the course of debugging this just now, I ran . ./www/python/venv/bin/activate and webservice stop ­– successfully. I guessed that I needed to run webservice --backend=kubernetes python2 start to restart and did that; now http://tools.wmflabs.org/spiarticleanalyzer/ gives 404. @JustBerry, if I broke something, sorry.)

@scfc No worries, but http://tools.wmflabs.org/spiarticleanalyzer/ seems to
work fine for me... (perhaps you fixed it?). By the way, the error doesn't
show up all the time, but shows up from time to time. Thought it worthwhile
to at least mention it on this task.

@scfc No worries, but http://tools.wmflabs.org/spiarticleanalyzer/ seems to
work fine for me... (perhaps you fixed it?). By the way, the error doesn't
show up all the time, but shows up from time to time. Thought it worthwhile
to at least mention it on this task.

Reporting bugs is always good, especially if someone claimed to have fixed one :-). I can reproduce the issue (sometimes), though I don't have any idea why it could happen. I'll give a bit more thought. For a moment I suspected WebServiceMonitor to have restarted the service in parallel, but that would have logged a line in service.log. *argl*

scfc removed scfc as the assignee of this task.EditedFeb 28 2017, 7:37 PM

(Unassigning because I don't understand the failure mode, or rather, if my understanding had been correct, the bug had been fixed. Perhaps a fresh pair of eyes is helpful.)

zhuyifei1999 lowered the priority of this task from High to Medium.Mar 1 2017, 9:51 AM

@JustBerry Do you have the steps in which you produced the issue? i.e. what did you execute (and from which host) prior to running webservice stop on tools-bastion-03.

(Lowering priority as few further reports have been made on this issue)

tools.checker@tools-bastion-02:~/www/python/src$ webservice restart
Restarting webservice.Traceback (most recent call last):
  File "/usr/local/bin/webservice", line 151, in <module>
    stop(job, 'Restarting webservice')
  File "/usr/local/bin/webservice", line 39, in stop
    wait_for(lambda: job.get_state() == Backend.STATE_STOPPED, message)
  File "/usr/lib/python2.7/dist-packages/toollabs/common/utils.py", line 20, in wait_for
    if predicate():
  File "/usr/local/bin/webservice", line 39, in <lambda>
    wait_for(lambda: job.get_state() == Backend.STATE_STOPPED, message)
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 372, in get_state
    pod = self._find_obj(pykube.Pod, self.webservice_label_selector)
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 191, in _find_obj
    selector=selector
  File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 75, in get
    raise ValueError("get() more than one object; use filter")
ValueError: get() more than one object; use filter
Error in sys.excepthook:
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/apport_python_hook.py", line 145, in apport_excepthook
    os.O_WRONLY | os.O_CREAT | os.O_EXCL, 0o640), 'wb') as f:
OSError: [Errno 2] No such file or directory: '/var/crash/_usr_bin_webservice.51934.crash'

Original exception was:
Traceback (most recent call last):
  File "/usr/local/bin/webservice", line 151, in <module>
    stop(job, 'Restarting webservice')
  File "/usr/local/bin/webservice", line 39, in stop
    wait_for(lambda: job.get_state() == Backend.STATE_STOPPED, message)
  File "/usr/lib/python2.7/dist-packages/toollabs/common/utils.py", line 20, in wait_for
    if predicate():
  File "/usr/local/bin/webservice", line 39, in <lambda>
    wait_for(lambda: job.get_state() == Backend.STATE_STOPPED, message)
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 372, in get_state
    pod = self._find_obj(pykube.Pod, self.webservice_label_selector)
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 191, in _find_obj
    selector=selector
  File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 75, in get
    raise ValueError("get() more than one object; use filter")
ValueError: get() more than one object; use filter

@Legoktm What's the output of kubectl get pods, kubectl get services, kubectl get deployments? I'll note that unlike the original exception what happened during starting the service, this one happened during stopping it.

Sorry, I needed to get the tool back up. I'll remember to check those for next time.

Reproduced:

1(venv)tools.video2commons-test@tools-bastion-02:~$ for i in {1..5}; do kubectl get pods; kubectl get services; kubectl get deployments; webservice --backend=kubernetes python2 stop; kubectl get pods; kubectl get services; kubectl get deployments; webservice --backend=kubernetes python2 start; kubectl get pods; kubectl get services; kubectl get deployments; done
2NAME READY STATUS RESTARTS AGE
3interactive 1/1 Running 0 19d
4Your webservice is not running
5NAME READY STATUS RESTARTS AGE
6interactive 1/1 Running 0 19d
7Starting webservice...
8NAME READY STATUS RESTARTS AGE
9interactive 1/1 Running 0 19d
10video2commons-test-3674141639-6dlc6 1/1 Running 0 3s
11NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
12video2commons-test 192.168.0.243 <none> 8000/TCP 4s
13NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
14video2commons-test 1 1 1 1 4s
15NAME READY STATUS RESTARTS AGE
16interactive 1/1 Running 0 19d
17video2commons-test-3674141639-6dlc6 1/1 Running 0 4s
18NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
19video2commons-test 192.168.0.243 <none> 8000/TCP 4s
20NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
21video2commons-test 1 1 1 1 4s
22Stopping webservice...............
23NAME READY STATUS RESTARTS AGE
24interactive 1/1 Running 0 19d
25video2commons-test-3674141639-6dlc6 1/1 Terminating 0 21s
26Your job is already running
27NAME READY STATUS RESTARTS AGE
28interactive 1/1 Running 0 19d
29video2commons-test-3674141639-6dlc6 1/1 Terminating 0 22s
30NAME READY STATUS RESTARTS AGE
31interactive 1/1 Running 0 19d
32video2commons-test-3674141639-6dlc6 1/1 Terminating 0 23s
33Stopping webservice............
34NAME READY STATUS RESTARTS AGE
35interactive 1/1 Running 0 19d
36Starting webservice...............
37NAME READY STATUS RESTARTS AGE
38interactive 1/1 Running 0 19d
39video2commons-test-3674141639-tfcq8 0/1 ContainerCreating 0 16s
40NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
41video2commons-test 192.168.0.233 <none> 8000/TCP 16s
42NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
43video2commons-test 1 1 1 0 16s
44NAME READY STATUS RESTARTS AGE
45interactive 1/1 Running 0 19d
46video2commons-test-3674141639-tfcq8 0/1 ContainerCreating 0 16s
47NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
48video2commons-test 192.168.0.233 <none> 8000/TCP 16s
49NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
50video2commons-test 1 1 1 0 17s
51Stopping webservice...............
52NAME READY STATUS RESTARTS AGE
53interactive 1/1 Running 0 19d
54video2commons-test-3674141639-tfcq8 1/1 Terminating 0 33s
55Your job is already running
56NAME READY STATUS RESTARTS AGE
57interactive 1/1 Running 0 19d
58video2commons-test-3674141639-tfcq8 1/1 Terminating 0 33s
59NAME READY STATUS RESTARTS AGE
60interactive 1/1 Running 0 19d
61video2commons-test-3674141639-tfcq8 1/1 Terminating 0 34s
62Stopping webservice...............
63NAME READY STATUS RESTARTS AGE
64interactive 1/1 Running 0 19d
65video2commons-test-3674141639-tfcq8 1/1 Terminating 0 50s
66Your job is already running
67NAME READY STATUS RESTARTS AGE
68interactive 1/1 Running 0 19d
69video2commons-test-3674141639-tfcq8 1/1 Terminating 0 51s
70(venv)tools.video2commons-test@tools-bastion-02:~$ for i in {1..5}; do kubectl get pods; kubectl get services; kubectl get deployments; webservice --backend=kubernetes python2 restart; kubectl get pods; kubectl get services; kubectl get deployments; doneNAME READY STATUS RESTARTS AGE
71interactive 1/1 Running 0 19d
72Your job is not running, starting.........
73NAME READY STATUS RESTARTS AGE
74interactive 1/1 Running 0 19d
75video2commons-test-3674141639-0u1mo 1/1 Running 0 10s
76NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
77video2commons-test 192.168.0.230 <none> 8000/TCP 10s
78NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
79video2commons-test 1 1 1 1 10s
80NAME READY STATUS RESTARTS AGE
81interactive 1/1 Running 0 19d
82video2commons-test-3674141639-0u1mo 1/1 Running 0 10s
83NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
84video2commons-test 192.168.0.230 <none> 8000/TCP 10s
85NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
86video2commons-test 1 1 1 1 10s
87Restarting webservice...............Traceback (most recent call last):
88 File "/usr/local/bin/webservice", line 152, in <module>
89 start(job, '')
90 File "/usr/local/bin/webservice", line 34, in start
91 wait_for(lambda: job.get_state() == Backend.STATE_RUNNING, message)
92 File "/usr/lib/python2.7/dist-packages/toollabs/common/utils.py", line 20, in wait_for
93 if predicate():
94 File "/usr/local/bin/webservice", line 34, in <lambda>
95 wait_for(lambda: job.get_state() == Backend.STATE_RUNNING, message)
96 File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 372, in get_state
97 pod = self._find_obj(pykube.Pod, self.webservice_label_selector)
98 File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 191, in _find_obj
99 selector=selector
100 File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 75, in get
101 raise ValueError("get() more than one object; use filter")
102ValueError: get() more than one object; use filter
103Error in sys.excepthook:
104Traceback (most recent call last):
105 File "/usr/lib/python2.7/dist-packages/apport_python_hook.py", line 145, in apport_excepthook
106 os.O_WRONLY | os.O_CREAT | os.O_EXCL, 0o640), 'wb') as f:
107OSError: [Errno 2] No such file or directory: '/var/crash/_usr_bin_webservice.52996.crash'
108
109Original exception was:
110Traceback (most recent call last):
111 File "/usr/local/bin/webservice", line 152, in <module>
112 start(job, '')
113 File "/usr/local/bin/webservice", line 34, in start
114 wait_for(lambda: job.get_state() == Backend.STATE_RUNNING, message)
115 File "/usr/lib/python2.7/dist-packages/toollabs/common/utils.py", line 20, in wait_for
116 if predicate():
117 File "/usr/local/bin/webservice", line 34, in <lambda>
118 wait_for(lambda: job.get_state() == Backend.STATE_RUNNING, message)
119 File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 372, in get_state
120 pod = self._find_obj(pykube.Pod, self.webservice_label_selector)
121 File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 191, in _find_obj
122 selector=selector
123 File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 75, in get
124 raise ValueError("get() more than one object; use filter")
125ValueError: get() more than one object; use filter
126NAME READY STATUS RESTARTS AGE
127interactive 1/1 Running 0 19d
128video2commons-test-3674141639-0u1mo 1/1 Terminating 0 27s
129video2commons-test-3674141639-9gck6 0/1 ContainerCreating 0 0s
130NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
131video2commons-test 192.168.0.194 <none> 8000/TCP 0s
132NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
133video2commons-test 1 1 1 0 2s
134NAME READY STATUS RESTARTS AGE
135interactive 1/1 Running 0 19d
136video2commons-test-3674141639-0u1mo 1/1 Terminating 0 28s
137video2commons-test-3674141639-9gck6 0/1 ContainerCreating 0 1s
138NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
139video2commons-test 192.168.0.194 <none> 8000/TCP 1s
140NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
141video2commons-test 1 1 1 0 2s
142Traceback (most recent call last):
143 File "/usr/local/bin/webservice", line 148, in <module>
144 if job.get_state() != Backend.STATE_RUNNING:
145 File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 372, in get_state
146 pod = self._find_obj(pykube.Pod, self.webservice_label_selector)
147 File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 191, in _find_obj
148 selector=selector
149 File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 75, in get
150 raise ValueError("get() more than one object; use filter")
151ValueError: get() more than one object; use filter
152Error in sys.excepthook:
153Traceback (most recent call last):
154 File "/usr/lib/python2.7/dist-packages/apport_python_hook.py", line 145, in apport_excepthook
155 os.O_WRONLY | os.O_CREAT | os.O_EXCL, 0o640), 'wb') as f:
156OSError: [Errno 2] No such file or directory: '/var/crash/_usr_bin_webservice.52996.crash'
157
158Original exception was:
159Traceback (most recent call last):
160 File "/usr/local/bin/webservice", line 148, in <module>
161 if job.get_state() != Backend.STATE_RUNNING:
162 File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 372, in get_state
163 pod = self._find_obj(pykube.Pod, self.webservice_label_selector)
164 File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 191, in _find_obj
165 selector=selector
166 File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 75, in get
167 raise ValueError("get() more than one object; use filter")
168ValueError: get() more than one object; use filter
169NAME READY STATUS RESTARTS AGE
170interactive 1/1 Running 0 19d
171video2commons-test-3674141639-0u1mo 1/1 Terminating 0 29s
172video2commons-test-3674141639-9gck6 0/1 ContainerCreating 0 2s
173NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
174video2commons-test 192.168.0.194 <none> 8000/TCP 2s
175NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
176video2commons-test 1 1 1 0 3s
177NAME READY STATUS RESTARTS AGE
178interactive 1/1 Running 0 19d
179video2commons-test-3674141639-0u1mo 1/1 Terminating 0 30s
180video2commons-test-3674141639-9gck6 0/1 ContainerCreating 0 3s
181NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
182video2commons-test 192.168.0.194 <none> 8000/TCP 3s
183NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
184video2commons-test 1 1 1 0 4s
185Traceback (most recent call last):
186 File "/usr/local/bin/webservice", line 148, in <module>
187 if job.get_state() != Backend.STATE_RUNNING:
188 File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 372, in get_state
189 pod = self._find_obj(pykube.Pod, self.webservice_label_selector)
190 File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 191, in _find_obj
191 selector=selector
192 File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 75, in get
193 raise ValueError("get() more than one object; use filter")
194ValueError: get() more than one object; use filter
195Error in sys.excepthook:
196Traceback (most recent call last):
197 File "/usr/lib/python2.7/dist-packages/apport_python_hook.py", line 145, in apport_excepthook
198 os.O_WRONLY | os.O_CREAT | os.O_EXCL, 0o640), 'wb') as f:
199OSError: [Errno 2] No such file or directory: '/var/crash/_usr_bin_webservice.52996.crash'
200
201Original exception was:
202Traceback (most recent call last):
203 File "/usr/local/bin/webservice", line 148, in <module>
204 if job.get_state() != Backend.STATE_RUNNING:
205 File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 372, in get_state
206 pod = self._find_obj(pykube.Pod, self.webservice_label_selector)
207 File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 191, in _find_obj
208 selector=selector
209 File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 75, in get
210 raise ValueError("get() more than one object; use filter")
211ValueError: get() more than one object; use filter
212NAME READY STATUS RESTARTS AGE
213interactive 1/1 Running 0 19d
214video2commons-test-3674141639-0u1mo 1/1 Terminating 0 31s
215video2commons-test-3674141639-9gck6 0/1 ContainerCreating 0 4s
216NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
217video2commons-test 192.168.0.194 <none> 8000/TCP 4s
218NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
219video2commons-test 1 1 1 0 5s
220NAME READY STATUS RESTARTS AGE
221interactive 1/1 Running 0 19d
222video2commons-test-3674141639-0u1mo 1/1 Terminating 0 32s
223video2commons-test-3674141639-9gck6 0/1 ContainerCreating 0 5s
224NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
225video2commons-test 192.168.0.194 <none> 8000/TCP 5s
226NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
227video2commons-test 1 1 1 0 6s
228Traceback (most recent call last):
229 File "/usr/local/bin/webservice", line 148, in <module>
230 if job.get_state() != Backend.STATE_RUNNING:
231 File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 372, in get_state
232 pod = self._find_obj(pykube.Pod, self.webservice_label_selector)
233 File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 191, in _find_obj
234 selector=selector
235 File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 75, in get
236 raise ValueError("get() more than one object; use filter")
237ValueError: get() more than one object; use filter
238Error in sys.excepthook:
239Traceback (most recent call last):
240 File "/usr/lib/python2.7/dist-packages/apport_python_hook.py", line 145, in apport_excepthook
241 os.O_WRONLY | os.O_CREAT | os.O_EXCL, 0o640), 'wb') as f:
242OSError: [Errno 2] No such file or directory: '/var/crash/_usr_bin_webservice.52996.crash'
243
244Original exception was:
245Traceback (most recent call last):
246 File "/usr/local/bin/webservice", line 148, in <module>
247 if job.get_state() != Backend.STATE_RUNNING:
248 File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 372, in get_state
249 pod = self._find_obj(pykube.Pod, self.webservice_label_selector)
250 File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 191, in _find_obj
251 selector=selector
252 File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 75, in get
253 raise ValueError("get() more than one object; use filter")
254ValueError: get() more than one object; use filter
255NAME READY STATUS RESTARTS AGE
256interactive 1/1 Running 0 19d
257video2commons-test-3674141639-0u1mo 1/1 Terminating 0 33s
258video2commons-test-3674141639-9gck6 0/1 ContainerCreating 0 6s
259NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
260video2commons-test 192.168.0.194 <none> 8000/TCP 6s
261NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
262video2commons-test 1 1 1 0 7s

Actually, no. This traceback is the original traceback.

I get these while stopping but not while starting, with the same traceback:

tools.yabbr@tools-bastion-03:~$ webservice stop
Traceback (most recent call last):
  File "/usr/local/bin/webservice", line 144, in <module>
    if job.get_state() == Backend.STATE_STOPPED:
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 401, in get_state
    pod = self._find_obj(pykube.Pod, self.webservice_label_selector)
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 210, in _find_obj
    selector=selector
  File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 75, in get
    raise ValueError("get() more than one object; use filter")
ValueError: get() more than one object; use filter

@APerson What's the output of kubectl get pods; kubectl get services; kubectl get deployments?

@zhuyifei1999 (sorry for the delayed response!)

NAME                     READY     STATUS    RESTARTS   AGE
interactive              1/1       Running   0          3d
yabbr-1590617135-0euto   1/1       Running   0          3h
NAME      CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
yabbr     192.168.98.49   <none>        8000/TCP   3h
NAME      DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
yabbr     1         1         1            1           3h

Yep I'm encountering this all the time.

NAME                                  READY     STATUS              RESTARTS   AGE
interactive                           1/1       Running             0          16h
video2commons-test-3674141639-bb1ho   1/1       Terminating         0          40s
video2commons-test-3674141639-wl0iv   0/1       ContainerCreating   0          0s

Can we just change [[https://github.com/wikimedia/operations-software-tools-webservice/blob/7e8da902d4ac6a7bc05f548ffa7bb3f6267e65ed/toollabs/webservice/backends/kubernetesbackend.py#L405|elif pod.obj['status']['phase'] == 'Pending':]] into else:? A Terminating state should definitely not be Backend.STATE_STOPPED

I found a new way this can be broken today. While doing some CrashLoopBackOff state cleanup I found a tool which had 1 pod, no deployments, and 7 replicasets. Running webservice stop was crashing with the "get() more than one object; use filter" error and I'm very sure it was due to the multiple replicaset objects in the namespace. This looked to be the result of some cluster state corruption from ~1 year ago.

@Nehajha has been looking into this as part of her work on T190638: GSoC 2018 proposal for Improvements for the Toolforge 'webservice' command. She has been having difficulty reproducing the problem in her testing environment, so if anyone has tips on what seems to make it happen more often please do report them here.

Mentioned in SAL (#wikimedia-cloud) [2018-07-22T14:28:51Z] <zhuyifei1999_> backed up Neha16's changes to toolsbeta-bastion-01:/usr/lib/python2.7/dist-packages/toollabs to toollabs.bak in the same dir via cp -a, and re-install webservice code on the bastion to debug T156626

I bruteforced this issue:

Attempting my luck... no good.

toolsbeta.test@toolsbeta-bastion-01:~$ kubectl get po; kubectl get svc; kubectl get deploy
NAME                    READY     STATUS    RESTARTS   AGE
test-1089897043-slwhl   1/1       Running   0          1d
NAME      CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
test      192.168.40.242   <none>        8000/TCP   1d
NAME      DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
test      1         1         1            1           1d
toolsbeta.test@toolsbeta-bastion-01:~$ webservice restart
Restarting webservice....
toolsbeta.test@toolsbeta-bastion-01:~$ webservice restart
Restarting webservice....

Then I applied the patch to make the script check statuses more frequently:

zhuyifei1999@toolsbeta-bastion-01:/usr/lib/python2.7/dist-packages/toollabs$ sudo nano common/utils.py
[...]
zhuyifei1999@toolsbeta-bastion-01:/usr/lib/python2.7/dist-packages/toollabs$ curl -sL https://github.com/wikimedia/operations-software-tools-webservice/raw/master/toollabs/common/utils.py | diff - common/utils.py
24c24
<         time.sleep(1)
---
>         time.sleep(0.01)
toolsbeta.test@toolsbeta-bastion-01:~$ webservice restart
Restarting webservice...............Pod resisted shtdown
Traceback (most recent call last):
  File "/usr/local/bin/webservice", line 219, in <module>
    sys.exist(1)
AttributeError: 'module' object has no attribute 'exist'
toolsbeta.test@toolsbeta-bastion-01:~$ webservice restart; kubectl get po; kubectl get svc; kubectl get deploy
Your job is not running, starting...............
NAME                    READY     STATUS              RESTARTS   AGE
test-1089897043-vb02z   0/1       ContainerCreating   0          1s
NAME      CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
test      192.168.14.193   <none>        8000/TCP   1s
NAME      DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
test      1         1         1            0           1s
toolsbeta.test@toolsbeta-bastion-01:~$ webservice restart; kubectl get po; kubectl get svc; kubectl get deploy
Restarting webservice...............Pod resisted shtdown
Traceback (most recent call last):
  File "/usr/local/bin/webservice", line 219, in <module>
    sys.exist(1)
AttributeError: 'module' object has no attribute 'exist'
NAME                    READY     STATUS        RESTARTS   AGE
test-1089897043-vb02z   1/1       Terminating   0          5s

So we hit an unmerged workaround https://gerrit.wikimedia.org/r/#/c/operations/software/tools-webservice/+/444879/, have to use unmodified code to see if it's the same error:

zhuyifei1999@toolsbeta-bastion-01:/usr/lib/python2.7/dist-packages$ sudo cp -a toollabs toollabs.bak
zhuyifei1999@toolsbeta-bastion-01:/usr/lib/python2.7/dist-packages$ sudo apt install --reinstall toollabs-webservice
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following packages were automatically installed and are no longer required:
  linux-image-3.13.0-151-generic linux-image-extra-3.13.0-151-generic
Use 'apt-get autoremove' to remove them.
0 upgraded, 0 newly installed, 1 reinstalled, 0 to remove and 6 not upgraded.
Need to get 0 B/17.6 kB of archives.
After this operation, 0 B of additional disk space will be used.
(Reading database ... 197537 files and directories currently installed.)
Preparing to unpack .../toollabs-webservice_0.39_all.deb ...
Unpacking toollabs-webservice (0.39) over (0.39) ...
Setting up toollabs-webservice (0.39) ...
N: Ignoring file '20auto-upgrades.ucf-dist' in directory '/etc/apt/apt.conf.d/' as it has an invalid filename extension
zhuyifei1999@toolsbeta-bastion-01:/usr/lib/python2.7/dist-packages$ cd toollabs
zhuyifei1999@toolsbeta-bastion-01:/usr/lib/python2.7/dist-packages/toollabs$ curl -sL https://github.com/wikimedia/operations-software-tools-webservice/raw/master/toollabs/common/utils.py | diff - common/utils.py
zhuyifei1999@toolsbeta-bastion-01:/usr/lib/python2.7/dist-packages/toollabs$ patch common/utils.py
[...]
File common/utils.py is read-only; trying to patch anyway
patch: **** Can't create temporary file common/utils.py.ouA6Nn0 : Permission denied
zhuyifei1999@toolsbeta-bastion-01:/usr/lib/python2.7/dist-packages/toollabs$ sudo patch common/utils.py
24c24
<         time.sleep(1)
---
>         time.sleep(0.01)
patching file common/utils.py
zhuyifei1999@toolsbeta-bastion-01:/usr/lib/python2.7/dist-packages/toollabs$ curl -sL https://github.com/wikimedia/operations-software-tools-webservice/raw/master/toollabs/common/utils.py | diff - common/utils.py
24c24
<         time.sleep(1)
---
>         time.sleep(0.01)

And indeed it's the same issue:

toolsbeta.test@toolsbeta-bastion-01:~$ webservice restart; kubectl get po; kubectl get svc; kubectl get deploy
Your job is not running, starting...............
NAME                    READY     STATUS              RESTARTS   AGE
test-1089897043-1oxls   0/1       ContainerCreating   0          1s
NAME      CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
test      192.168.62.70   <none>        8000/TCP   1s
NAME      DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
test      1         1         1            0           1s
toolsbeta.test@toolsbeta-bastion-01:~$ webservice restart; kubectl get po; kubectl get svc; kubectl get deploy
Restarting webservice...............
NAME                    READY     STATUS              RESTARTS   AGE
test-1089897043-1oxls   1/1       Terminating         0          7s
test-1089897043-qmza8   0/1       ContainerCreating   0          0s
NAME      CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
test      192.168.103.35   <none>        8000/TCP   0s
NAME      DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
test      1         1         1            0           1s
toolsbeta.test@toolsbeta-bastion-01:~$ webservice restart; kubectl get po; kubectl get svc; kubectl get deploy
Restarting webservice...............Traceback (most recent call last):
  File "/usr/local/bin/webservice", line 179, in <module>
    start(job, '')
  File "/usr/local/bin/webservice", line 60, in start
    wait_for(lambda: job.get_state() == Backend.STATE_RUNNING, message)
  File "/usr/lib/python2.7/dist-packages/toollabs/common/utils.py", line 20, in wait_for
    if predicate():
  File "/usr/local/bin/webservice", line 60, in <lambda>
    wait_for(lambda: job.get_state() == Backend.STATE_RUNNING, message)
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 402, in get_state
    pod = self._find_obj(pykube.Pod, self.webservice_label_selector)
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 210, in _find_obj
    selector=selector
  File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 75, in get
    raise ValueError("get() more than one object; use filter")
ValueError: get() more than one object; use filter
NAME                    READY     STATUS              RESTARTS   AGE
test-1089897043-iwu5t   0/1       ContainerCreating   0          1s
test-1089897043-qmza8   1/1       Terminating         0          45s
NAME      CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
test      192.168.43.160   <none>        8000/TCP   1s
NAME      DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
test      1         1         1            0           1s

However, I don't understand how the second command in the above paste worked, considering the output of kubectl get pods is almost the same. Fortunately, the majority of the invocations thereafter are able to produce the same STATE_RUNNING traceback.

I am unable to reproduce the traceback on STATE_STOPPED. My initial theory of restarting webservice too fast doesn't work:

toolsbeta.test@toolsbeta-bastion-01:~$ webservice restart; webservice restart; kubectl get po; kubectl get svc; kubectl get deploy; 
Restarting webservice...............Traceback (most recent call last):
  File "/usr/local/bin/webservice", line 179, in <module>
    start(job, '')
  File "/usr/local/bin/webservice", line 60, in start
    wait_for(lambda: job.get_state() == Backend.STATE_RUNNING, message)
  File "/usr/lib/python2.7/dist-packages/toollabs/common/utils.py", line 20, in wait_for
    if predicate():
  File "/usr/local/bin/webservice", line 60, in <lambda>
    wait_for(lambda: job.get_state() == Backend.STATE_RUNNING, message)
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 402, in get_state
    pod = self._find_obj(pykube.Pod, self.webservice_label_selector)
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 210, in _find_obj
    selector=selector
  File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 75, in get
    raise ValueError("get() more than one object; use filter")
ValueError: get() more than one object; use filter
Traceback (most recent call last):
  File "/usr/local/bin/webservice", line 175, in <module>
    if job.get_state() != Backend.STATE_RUNNING:
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 402, in get_state
    pod = self._find_obj(pykube.Pod, self.webservice_label_selector)
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 210, in _find_obj
    selector=selector
  File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 75, in get
    raise ValueError("get() more than one object; use filter")
ValueError: get() more than one object; use filter
NAME                    READY     STATUS              RESTARTS   AGE
test-1089897043-722xn   1/1       Terminating         0          1m
test-1089897043-iwcfv   0/1       ContainerCreating   0          0s
NAME      CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
test      192.168.96.120   <none>        8000/TCP   0s
NAME      DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
test      1         1         1            0           0s

I cannot figure out how job.get_state() succeeds in <module> (1 match only) but turn into multiple matches after stopping. More playing around:

toolsbeta.test@toolsbeta-bastion-01:~$ webservice stop; sleep 5; webservice --backend kubernetes php5.6 start; sleep 5; kubectl get po; webservice restart; kubectl get po; kubectl get svc; kubectl get deploy; 
Stopping webservice...............
Starting webservice...............
NAME                    READY     STATUS    RESTARTS   AGE
test-1089897043-exein   1/1       Running   0          6s
Restarting webservice...............Traceback (most recent call last):
  File "/usr/local/bin/webservice", line 179, in <module>
    start(job, '')
  File "/usr/local/bin/webservice", line 60, in start
    wait_for(lambda: job.get_state() == Backend.STATE_RUNNING, message)
  File "/usr/lib/python2.7/dist-packages/toollabs/common/utils.py", line 20, in wait_for
    if predicate():
  File "/usr/local/bin/webservice", line 60, in <lambda>
    wait_for(lambda: job.get_state() == Backend.STATE_RUNNING, message)
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 402, in get_state
    pod = self._find_obj(pykube.Pod, self.webservice_label_selector)
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 210, in _find_obj
    selector=selector
  File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 75, in get
    raise ValueError("get() more than one object; use filter")
ValueError: get() more than one object; use filter
NAME                    READY     STATUS              RESTARTS   AGE
test-1089897043-ddhtc   0/1       ContainerCreating   0          1s
test-1089897043-exein   1/1       Terminating         0          7s
NAME      CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
test      192.168.17.207   <none>        8000/TCP   1s
NAME      DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
test      1         1         1            0           1s
toolsbeta.test@toolsbeta-bastion-01:~$ webservice stop; sleep 5; webservice --backend kubernetes php5.6 start; sleep 1; kubectl get po; webservice restart; kubectl get po; kubectl get svc; kubectl get deploy; 
Stopping webservice...............
Starting webservice...............
NAME                    READY     STATUS              RESTARTS   AGE
test-1089897043-7oo7w   0/1       ContainerCreating   0          2s
Your job is not running, starting.........
NAME                    READY     STATUS    RESTARTS   AGE
test-1089897043-7oo7w   1/1       Running   0          2s
NAME      CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
test      192.168.58.103   <none>        8000/TCP   2s
NAME      DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
test      1         1         1            1           3s

Change 450495 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/software/tools-webservice@master] Kubernetes: ignore terminating objects when searching

https://gerrit.wikimedia.org/r/450495

I did a bit of hacking on the kubernetesbackend.py file on toolsbeta-bastion-01. I got webservice restart to output more useful information about the state of duplicate objects:

==================================================
<Pod test-1089897043-c1itd>
{   u'conditions': [   {   u'lastProbeTime': None,
                           u'lastTransitionTime': u'2018-08-06T00:26:47Z',
                           u'status': u'True',
                           u'type': u'Initialized'},
                       {   u'lastProbeTime': None,
                           u'lastTransitionTime': u'2018-08-06T00:26:49Z',
                           u'status': u'True',
                           u'type': u'Ready'},
                       {   u'lastProbeTime': None,
                           u'lastTransitionTime': u'2018-08-06T00:26:47Z',
                           u'status': u'True',
                           u'type': u'PodScheduled'}],
    u'containerStatuses': [   {   u'containerID':
u'docker://18babad048d50e121065945f68e56d58527ac8915be5cf1aaa7616fc44e2fa3d',
                                  u'image':
u'docker-registry.tools.wmflabs.org/toollabs-php-web:latest',
                                  u'imageID':
u'docker://sha256:b6b7733f0d037e5b8b1d301cceda9331f336934496e28448b93b977fab8f67b9',
                                  u'lastState': {   },
                                  u'name': u'webservice',
                                  u'ready': True,
                                  u'restartCount': 0,
                                  u'state': {   u'running': {
u'startedAt': u'2018-08-06T00:26:48Z'}}}],
    u'hostIP': u'10.68.18.110',
    u'phase': u'Running',
    u'podIP': u'192.168.215.3',
    u'startTime': u'2018-08-06T00:26:47Z'}
==================================================

==================================================
<Pod test-1089897043-ontg3>
{   u'phase': u'Pending'}
==================================================

I'm using the pprint module to print out the Pod.obj['status'] internal state. I guessed this would be where the useful information would be based on the Pod.ready synthetic property from the upstream library.

I did some more hacking after this and wrote:

def get_state(self):
    """
    Returns state of webservice.

    One of:
    - Backend.STATE_STOPPED
    - Backend.STATE_RUNNING
    - Backend.STATE_PENDING
    """
    def pod_cmp(a, b):
        # Pending < Running < [anything else]
        phase_a = a.obj['status']['phase']
        phase_b = b.obj['status']['phase']
        if phase_a == phase_b:
            return 0
        if phase_a == 'Pending':
            return -1
        if phase_b == 'Pending':
            return 1
        if phase_a == 'Running':
            return -1
        if phase_b == 'Running':
            return 1
        return -1 if phase_a < phase_b else 1

    pods = pykube.Pod.objects(self.api).filter(
        namespace=self.tool.name,
        selector=self.webservice_label_selector
    )
    if pods:
        for p in pods:
            # FIXME: debug print to be removed
            print('{} {} {}'.format(
                len(pods), repr(p), p.obj['status']['phase']),
                file=sys.stderr)
        pod = sorted([p for p in pods], pod_cmp)[0]
        if pod.obj['status']['phase'] == 'Running':
            return Backend.STATE_RUNNING
        if pod.obj['status']['phase'] == 'Pending':
            return Backend.STATE_PENDING
    svc = self._find_obj(pykube.Service, self.webservice_label_selector)
    deployment = self._find_obj(
        pykube.Deployment, self.webservice_label_selector)
    if svc is not None and deployment is not None:
        return Backend.STATE_PENDING
    else:
        return Backend.STATE_STOPPED

This seemed to work for basic tests, but I never saw anything in the 'Terminating' state to be sure. I decided to change the "test" tool to the kubernetes python backend and write a wsgi app with this app.py that would be slow to shutdown to test further:

app.py
import atexit
import sys
import time

import flask

app = flask.Flask(__name__)


@app.route('/')
def index():
    return 'Hello World!'


def nap():
    print('Shutting down...', file=sys.stderr)
    time.sleep(60)


atexit.register(nap)

With the get_state() changes I made above and this webservice things almost work as hoped. The initial webservice start is good.
webservice stop works as expected as well, except that it returns while there is still a pod in the 'Terminating' state according to kubectl get pods. webservice restart works as long as there is no pod in the 'Terminating' state when you make the command. If there is a pod still terminating then you get output that looks like this:

$ webservice --backend=kubernetes python restart
2 <Pod test-1203602763-vwvzp> Running
2 <Pod test-1203602763-xvuks> Running
Traceback (most recent call last):
  File "/usr/local/bin/webservice", line 178, in <module>
    stop(job, 'Restarting webservice')
  File "/usr/local/bin/webservice", line 64, in stop
    job.request_stop()
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py",
line 417, in request_stop
    self._delete_obj(pykube.Pod, self.webservice_label_selector)
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py",
line 240, in _delete_obj
    selector=selector
  File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 75, in get
    raise ValueError("get() more than one object; use filter")
ValueError: get() more than one object; use filter

After adding more debug printing it appears that Pod.obj['status']['phase'] == 'Running' in many different circumstances including states where kubectl get pods would say "Terminating". I eventually figured out that "Terminating" is shown when Pod.obj['metadata']['deletionTimestamp'] exists and has a timestamp in it. It turns out that @scfc had figured this out all the way back in T156626#2998399, but then ignored it because he did not quite understand that Kubernetes would start a new pod while there was still one terminating.

Now that I know how to tell if an object is in the process of being deleted, I just needed to figure out how to ignore those objects. I think it is safe to do this on the assumption that Kubernetes will eventually figure out how to kill even runaway processes like the deliberately slow to shutdown one I wrote.

I played around a bit with the fieldSelector parameter for the search, but I could not figure out how to use it to ensure that metadata.deletionTimestamp was either missing or empty. The search results can easily be treated as a normal python list however (we've been doing that for debugging above), so I decided to post-process the search results and throw out the objects that have a deletionTimestamp:

objs = [
    o for o in objs
    if o.obj['metadata'].get('deletionTimestamp', None) is None
]

Once I had done this I had to also recreate the logic for checking for no results and multiple results that get() was doing for us before:

found = len(objs)
if found == 0:
    return None
elif found == 1:
    return objs[0]
else:
    raise ValueError("get() more than one object; use filter")

Once I had rolled back other changes and put this logic into _find_obj() things were really, really close to working. It turned out however that _delete_obj() would still fail in some circumstances because it was not using _find_obj() and instead searching itself. I rewrote it to use _find_obj():

def _delete_obj(self, kind, selector):
    """
    Delete object of kind matching selector if it exists
    """
    o = self._find_obj(kind, selector)
    if o is not None:
        o.delete()

I have not been able to get things to break again with these changes applied. Running webservice restart many times in a row with the slow to stop python app will leave kubectl get pods showing a lot of pods stuck in "Terminating" state, but they eventually get cleaned up.

The patch for all this is https://gerrit.wikimedia.org/r/450495

Change 450495 merged by jenkins-bot:
[operations/software/tools-webservice@master] Kubernetes: ignore terminating objects when searching

https://gerrit.wikimedia.org/r/450495

Mentioned in SAL (#wikimedia-cloud) [2018-08-08T10:00:59Z] <zhuyifei1999_> building & publishing toollabs-webservice 0.40 deb, and all Docker images T156626 T148872 T158244

zhuyifei1999 assigned this task to bd808.

Let's close this. If anyone encounter this issue again please reopen

I'm getting this error:

$ webservice stop
Traceback (most recent call last):

File "/usr/local/bin/webservice", line 176, in <module>
  stop(job, 'Stopping webservice')
File "/usr/local/bin/webservice", line 69, in stop                                                                                               job.request_stop()
File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 436, in request_stop
  pykube.ReplicaSet, 'name={name}'.format(name=self.tool.name))
File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 260, in _delete_obj
  o = self._find_obj(kind, selector)
File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 252, in _find_obj
  ', '.join(repr(o) for o in objs)

ValueError: Found 2 objects of type ReplicaSet matching selector name=wd-depicts: <ReplicaSet wd-depicts-1035148579>, <ReplicaSet wd-depicts-1769414028>. See https://phabricator.wikimedia.org/T156626

Any clues?

$ kubectl get replicaset
NAME                    DESIRED   CURRENT   READY     AGE
wd-depicts-1035148579   0         0         0         7d
wd-depicts-1769414028   1         1         1         7d
$ kubectl delete replicaset/wd-depicts-1035148579
replicaset "wd-depicts-1035148579" deleted
$ kubectl get replicaset
NAME                    DESIRED   CURRENT   READY     AGE
wd-depicts-1769414028   1         1         1         7d

Change 566045 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/software/tools-webservice@master] [RFC] kubernetes: add support for multiple objects of any kind

https://gerrit.wikimedia.org/r/566045

Change 566045 merged by jenkins-bot:
[operations/software/tools-webservice@master] kubernetes: add support for multiple objects of any kind

https://gerrit.wikimedia.org/r/566045

Getting this error for mix-n-match tool.

Getting this error for mix-n-match tool.

$ sudo become mix-n-match
$ kubectl get all
NAME                               READY   STATUS    RESTARTS   AGE
pod/mix-n-match-6c888b79f5-686w4   1/1     Running   0          22d
pod/mix-n-match-6c888b79f5-zlcw2   1/1     Running   0          9d


NAME                  TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
service/mix-n-match   ClusterIP   10.100.151.252   <none>        8000/TCP   22d


NAME                          READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/mix-n-match   2/2     2            2           22d

NAME                                     DESIRED   CURRENT   READY   AGE
replicaset.apps/mix-n-match-6c888b79f5   2         2         2       22d

It looks like kubectl scale has been used to increase the Pod count for mix-n-match. I will do some testing with another tool to see if I can figure out why the recent updates to webservice are not handling this as an expected case.

Mentioned in SAL (#wikimedia-cloud) [2020-02-10T21:18:57Z] <bstorm_> upgraded toollabs-webservice package for stretch toolsbeta to 0.62 T244293 T244289 T234617 T156626

Mentioned in SAL (#wikimedia-cloud) [2020-02-10T21:25:10Z] <bstorm_> upgraded toollabs-webservice package for tools to 0.62 T244293 T244289 T234617 T156626