k8s webservice restart failure with `ValueError: get() more than one object; use filter`
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	zhuyifei1999
	Jan 30 2017, 10:30 AM

Description

tools.precise-tools@tools-bastion-05:~/www/python/src$ git apply ~/169b40bff497eff780df28ba010cdf3564e62cdd.patch; webservice restart
Restarting webserviceTraceback (most recent call last):
  File "/usr/local/bin/webservice", line 152, in <module>
    start(job, '')
  File "/usr/local/bin/webservice", line 34, in start
    wait_for(lambda: job.get_state() == Backend.STATE_RUNNING, message)
  File "/usr/lib/python2.7/dist-packages/toollabs/common/utils.py", line 20, in wait_for
    if predicate():
  File "/usr/local/bin/webservice", line 34, in <lambda>
    wait_for(lambda: job.get_state() == Backend.STATE_RUNNING, message)
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 369, in get_state
    pod = self._find_obj(pykube.Pod, self.webservice_label_selector)
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 185, in _find_obj
    selector=selector
  File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 75, in get
    raise ValueError("get() more than one object; use filter")
ValueError: get() more than one object; use filter

Details

Subject	Repo	Branch	Lines +/-
kubernetes: add support for multiple objects of any kind	operations/software/tools-webservice	master	+48 -54
Kubernetes: ignore terminating objects when searching	operations/software/tools-webservice	master	+36 -16
Wait for the Kubernetes pod to shut down after "stop"	operations/software/tools-webservice	master	+14 -5

Customize query in gerrit

Related Objects

Mentioned In: T234617: Toolforge. introduce new domain toolforge.org
T244289: Improve limit range management in webservice for Kubernetes
T244293: Add a function to webservice called "migrate" that will push a tool from the old cluster on Kubernetes to the new one
T158244: Improve `webservice status` output
T148872: Make webservice command read default cli arguments from ~/.webservicerc
T190696: [Gsoc 2018] Proposal for Toolforge webservice command Improvement
T190638: GSoC 2018 proposal for Improvements for the Toolforge 'webservice' command
T140415: `webservice restart` does not always wait for service to stop before trying to start again
T177603: Proposal: Improvements for the Toolforge 'webservice' command
T175768: Improvements for the Toolforge 'webservice' command
T160353: Still issues with node.js webservice
T158152: tool.spiarticleanalyzer: webservice stop yields ValueError: get() more than one object
T156605: "webservice shell" fails with "No such file or directory" (with php5.6)
Mentioned Here: T234617: Toolforge. introduce new domain toolforge.org
T244289: Improve limit range management in webservice for Kubernetes
T244293: Add a function to webservice called "migrate" that will push a tool from the old cluster on Kubernetes to the new one
T148872: Make webservice command read default cli arguments from ~/.webservicerc
T158244: Improve `webservice status` output
T190638: GSoC 2018 proposal for Improvements for the Toolforge 'webservice' command
P5016 k8s for i in {1..5} webservice stop; webservice start and webservice restart with kubectl get {pods,services,deployments} for debugging T156626
T156605: "webservice shell" fails with "No such file or directory" (with php5.6)

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

zhuyifei1999 merged a task: T157109: "webservice restart" fails when using kubernetes.Feb 3 2017, 9:50 AM

scfc awarded a token.Feb 4 2017, 6:09 AM

This is an annoying bug for experienced users and can be insurmountable for newbies. Let's make the solution to this not more complex than it needs to be.

CMIIW: When webservice stop has finished (with a Kubernetes backend), kubectl get pods should not list any webservice pods, so that would be the testable requirement. So maybe add a STATE_STOPPING for webservices when they are Terminating, but not gone yet. The logic in ´webservice`'s stop() can then remain the same.

After cloning the repository and adding some prints, the output of PYTHONPATH=. scripts/webservice --backend=kubernetes stop && kubectl get pods is:

svc =  <Service typoscan>
deployment =  <Deployment typoscan>
pod =  <Pod typoscan-3292915034-mokde>
pod.obj.status.phase =  u'Running'
Stopping webservicesvc =  None
deployment =  None

NAME                        READY     STATUS        RESTARTS   AGE
typoscan-3292915034-mokde   1/1       Terminating   0          30s

I. e., "Terminating" pods have svc and deployment set to None = tough luck.

So let's see what webservice start is looking at.

(Fun fact: Starting webservice start from a home directory on NFS is enough to cause a sufficient delay for the bug not to show up. Moving to /tmp works reliably for reproducing.)

In T156626#2998360, @scfc wrote:

I. e., "Terminating" pods have svc and deployment set to None = tough luck.

FYI T156626#2984863

So let's see what webservice start is looking at.

It is checking the status of the pod, but it cannot determine that since there are multiple pods running.

AFAICT a "Terminating" pod has pod.obj['status']['phase'] set to Running:

tools.typoscan@tools-bastion-03:/tmp/tools-webservice$ webservice --backend=kubernetes php5.6 start && webservice --backend=kubernetes stop && kubectl get pods && PYTHONPATH=. scripts/webservice --backend=kubernetes php5.6 status
Starting webservice...
Stopping webservice
NAME                        READY     STATUS        RESTARTS   AGE
typoscan-3292915034-mzhbw   1/1       Terminating   0          5s
<Pod typoscan-3292915034-mzhbw>
pod =  <Pod typoscan-3292915034-mzhbw>
pod.obj =  {u'status': {u'containerStatuses': [{u'restartCount': 0, u'name': u'webservice', u'image': u'docker-registry.tools.wmflabs.org/toollabs-php-web:latest', u'imageID': u'docker://sha256:3b9bc605cc9a8c4f379979fcac5d8b7a9b8024411a42f016c63a72af1b1aefd2', u'state': {u'running': {u'startedAt': u'2017-02-04T08:04:32Z'}}, u'ready': True, u'lastState': {}, u'containerID': u'docker://3d1488dcdd791a3e18c5b5659f267b674f748f8cecadbd3c37b3dead8cd45428'}], u'podIP': u'192.168.112.3', u'startTime': u'2017-02-04T08:04:30Z', u'hostIP': u'10.68.22.168', u'phase': u'Running', u'conditions': [{u'status': u'True', u'lastProbeTime': None, u'type': u'Initialized', u'lastTransitionTime': u'2017-02-04T08:04:30Z'}, {u'status': u'True', u'lastProbeTime': None, u'type': u'Ready', u'lastTransitionTime': u'2017-02-04T08:04:32Z'}, {u'status': u'True', u'lastProbeTime': None, u'type': u'PodScheduled', u'lastTransitionTime': u'2017-02-04T08:04:29Z'}]}, u'spec': {u'dnsPolicy': u'ClusterFirst', u'securityContext': {u'fsGroup': 51208, u'runAsUser': 51208}, u'nodeName': u'tools-worker-1026.tools.eqiad.wmflabs', u'terminationGracePeriodSeconds': 30, u'restartPolicy': u'Always', u'volumes': [{u'hostPath': {u'path': u'/public/dumps/'}, u'name': u'dumps'}, {u'hostPath': {u'path': u'/data/project/'}, u'name': u'home'}, {u'hostPath': {u'path': u'/data/scratch/'}, u'name': u'scratch'}, {u'hostPath': {u'path': u'/var/run/nslcd/socket'}, u'name': u'varrunnslcdsocket-e3lfc'}], u'containers': [{u'terminationMessagePath': u'/dev/termination-log', u'name': u'webservice', u'volumeMounts': [{u'mountPath': u'/public/dumps/', u'name': u'dumps'}, {u'mountPath': u'/data/project/', u'name': u'home'}, {u'mountPath': u'/data/scratch/', u'name': u'scratch'}, {u'mountPath': u'/var/run/nslcd/socket', u'name': u'varrunnslcdsocket-e3lfc'}], u'image': u'docker-registry.tools.wmflabs.org/toollabs-php-web:latest', u'workingDir': u'/data/project/typoscan/', u'command': [u'/usr/bin/webservice-runner', u'--type', u'lighttpd', u'--port', u'8000'], u'env': [{u'name': u'HOME', u'value': u'/data/project/typoscan/'}], u'imagePullPolicy': u'Always', u'securityContext': {u'runAsUser': 51208}, u'ports': [{u'protocol': u'TCP', u'name': u'http', u'containerPort': 8000}], u'resources': {u'requests': {u'cpu': u'125m', u'memory': u'256Mi'}, u'limits': {u'cpu': u'2', u'memory': u'2Gi'}}}]}, u'metadata': {u'name': u'typoscan-3292915034-mzhbw', u'deletionTimestamp': u'2017-02-04T08:05:03Z', u'deletionGracePeriodSeconds': 30, u'labels': {u'pod-template-hash': u'3292915034', u'tools.wmflabs.org/webservice-version': u'1', u'name': u'typoscan', u'tools.wmflabs.org/webservice': u'true'}, u'namespace': u'typoscan', u'resourceVersion': u'79206126', u'generateName': u'typoscan-3292915034-', u'creationTimestamp': u'2017-02-04T08:04:29Z', u'annotations': {u'kubernetes.io/created-by': u'{"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicaSet","namespace":"typoscan","name":"typoscan-3292915034","uid":"8dff911d-eab0-11e6-bab3-fa163edef48a","apiVersion":"extensions","resourceVersion":"79206076"}}\n'}, u'selfLink': u'/api/v1/namespaces/typoscan/pods/typoscan-3292915034-mzhbw', u'uid': u'8e021db1-eab0-11e6-bab3-fa163edef48a'}}
pod.obj.status.phase =  u'Running'
Your webservice is running
tools.typoscan@tools-bastion-03:/tmp/tools-webservice$

but deletionTimestamp seems to indicate that this pod is being deleted. On second thought, this seems irrelevant:

Currently the test for "pod running" is masked by if svc and deployment; i. e., if we would first test for running pod → STATUS_RUNNING, if there are no pods, but svc and deployment → STATUS_PENDING, otherwise → STATUS_STOPPED, that should work?

AFAICT this patch works:

diff --git a/toollabs/webservice/backends/kubernetesbackend.py b/toollabs/webservice/backends/kubernetesbackend.py
index 39a6bbd..808f43c 100644
--- a/toollabs/webservice/backends/kubernetesbackend.py
+++ b/toollabs/webservice/backends/kubernetesbackend.py
@@ -380,15 +380,18 @@ class KubernetesBackend(Backend):
         self._delete_obj(pykube.Pod, self.webservice_label_selector)
 
     def get_state(self):
+        pod = self._find_obj(pykube.Pod, self.webservice_label_selector)
+        if pod is not None:
+            if pod.obj['status']['phase'] == 'Running':
+                return Backend.STATE_RUNNING
+            elif  pod.obj['status']['phase'] == 'Pending':
+                return Backend.STATE_PENDING
         svc = self._find_obj(pykube.Service, self.webservice_label_selector)
         deployment = self._find_obj(pykube.Deployment, self.webservice_label_selector)
         if svc is not None and deployment is not None:
-            pod = self._find_obj(pykube.Pod, self.webservice_label_selector)
-            if pod is not None:
-                if pod.obj['status']['phase'] == 'Running':
-                    return Backend.STATE_RUNNING
             return Backend.STATE_PENDING
-        return Backend.STATE_STOPPED
+        else:
+            return Backend.STATE_STOPPED
 
     def shell(self):
         podSpec = self._get_shell_pod()

The difference in behaviour is that webservice stop will wait until the pod is actually deleted and not just fire off the command and exit. I think this is a feature not only because that is what the user expects but also because it is the same logic as gridenginebackend.py's:

def get_state(self):
    job = self._get_job_xml()
    if job is not None:
        state = job.findtext('.//state').lower()
        if 'r' in state:
            return Backend.STATE_RUNNING
        else:
            return Backend.STATE_PENDING
    return Backend.STATE_STOPPED

I got the same error by doing:

webservice --backend=kubernetes python2 start

Afterwards, when I did the following...

webservice stop; webservice --backend=kubernetes python2 start

... I got...

Your webservice is not running
Your job is already running

EDIT: Had to do

webservice --backend=kubernetes python2 stop; sleep 2; webservice --backend=kubernetes python2 start

to get

Stopping webservice
Starting webservice..

In T156626#2998685, @JustBerry wrote:
I got the same error. Afterwards, when I did the following...
webservice stop; webservice --backend=kubernetes python2 start
... I got...
Your webservice is not running
Your job is already running

For future reference that is the behavior when the service.manifest get confused. Solution is: webservice --backend=kubernetes python2 stop; sleep 2; webservice --backend=kubernetes python2 start

(Triaging to High because we are just gonna receive more and more reports of this if we don't fix this early)

@scfc Looks like this bug a continuing problem. If that patch works,
perhaps the revision process can begin via gerrit?

scfc claimed this task.Feb 5 2017, 3:19 AM

scfc moved this task from Backlog to Waiting for code review on the Toolforge board.

Change 336049 had a related patch set uploaded (by Tim Landscheidt):
Wait for the Kubernetes pod to shut down after "stop"

https://gerrit.wikimedia.org/r/336049

gerritbot added a project: Patch-For-Review.Feb 5 2017, 3:20 AM

I have installed https://integration.wikimedia.org/ci/job/debian-glue/608/artifact/toollabs-webservice_0.33%7Edev+0%7E20170205034157.608+trusty%7E1.gbpb2f82f_all.deb on tools-bastion-03 and tools-services-02, the current host of webservicemonitor. I have tested some basics, and if in the next few days there are no hiccups, I'll merge the commit.

Puppet wanted to downgrade tools-webservice, so I put the dev version in aptly and ran Puppet on all hosts.

Change 336049 merged by jenkins-bot:
Wait for the Kubernetes pod to shut down after "stop"

https://gerrit.wikimedia.org/r/336049

Mentioned in SAL (#wikimedia-labs) [2017-02-07T13:49:32Z] <scfc_de> Deployed toollabs-webservice_0.33_all.deb (T156605, T156626).

scfc closed this task as Resolved.Feb 7 2017, 1:50 PM

webservice stop in ~ ((venv)tools.spiarticleanalyzer@tools-bastion-03:~$) still yields

(venv)tools.spiarticleanalyzer@tools-bastion-03:~$ webservice stop
Traceback (most recent call last):
  File "/usr/local/bin/webservice", line 140, in <module>
    if job.get_state() == Backend.STATE_STOPPED:
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 372, in get_state
    pod = self._find_obj(pykube.Pod, self.webservice_label_selector)
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 191, in _find_obj
    selector=selector
  File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 75, in get
    raise ValueError("get() more than one object; use filter")
ValueError: get() more than one object; use filter

JustBerry mentioned this in T158152: tool.spiarticleanalyzer: webservice stop yields ValueError: get() more than one object.Feb 15 2017, 3:16 AM

The patched version still has the problem of failing if pod = self._find_obj(pykube.Pod, self.webservice_label_selector) matches more than one pod.

_find_obj has handling for the possibility of pykube.exceptions.ObjectDoesNotExist, but it does not handle the ValueError that will be raised if pykube.Query.get() matches more than one object.

How can there be more than one webservice pod? That should be impossible to achieve, and IMNSHO that should require manual intervention.

(In the course of debugging this just now, I ran . ./www/python/venv/bin/activate and webservice stop – successfully. I guessed that I needed to run webservice --backend=kubernetes python2 start to restart and did that; now http://tools.wmflabs.org/spiarticleanalyzer/ gives 404. @JustBerry, if I broke something, sorry.)

@scfc No worries, but http://tools.wmflabs.org/spiarticleanalyzer/ seems to
work fine for me... (perhaps you fixed it?). By the way, the error doesn't
show up all the time, but shows up from time to time. Thought it worthwhile
to at least mention it on this task.

In T156626#3029015, @JustBerry wrote:

@scfc No worries, but http://tools.wmflabs.org/spiarticleanalyzer/ seems to
work fine for me... (perhaps you fixed it?). By the way, the error doesn't
show up all the time, but shows up from time to time. Thought it worthwhile
to at least mention it on this task.

Reporting bugs is always good, especially if someone claimed to have fixed one :-). I can reproduce the issue (sometimes), though I don't have any idea why it could happen. I'll give a bit more thought. For a moment I suspected WebServiceMonitor to have restarted the service in parallel, but that would have logged a line in service.log. *argl*

(Unassigning because I don't understand the failure mode, or rather, if my understanding had been correct, the bug had been fixed. Perhaps a fresh pair of eyes is helpful.)

@JustBerry Do you have the steps in which you produced the issue? i.e. what did you execute (and from which host) prior to running webservice stop on tools-bastion-03.

(Lowering priority as few further reports have been made on this issue)

tools.checker@tools-bastion-02:~/www/python/src$ webservice restart
Restarting webservice.Traceback (most recent call last):
  File "/usr/local/bin/webservice", line 151, in <module>
    stop(job, 'Restarting webservice')
  File "/usr/local/bin/webservice", line 39, in stop
    wait_for(lambda: job.get_state() == Backend.STATE_STOPPED, message)
  File "/usr/lib/python2.7/dist-packages/toollabs/common/utils.py", line 20, in wait_for
    if predicate():
  File "/usr/local/bin/webservice", line 39, in <lambda>
    wait_for(lambda: job.get_state() == Backend.STATE_STOPPED, message)
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 372, in get_state
    pod = self._find_obj(pykube.Pod, self.webservice_label_selector)
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 191, in _find_obj
    selector=selector
  File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 75, in get
    raise ValueError("get() more than one object; use filter")
ValueError: get() more than one object; use filter
Error in sys.excepthook:
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/apport_python_hook.py", line 145, in apport_excepthook
    os.O_WRONLY | os.O_CREAT | os.O_EXCL, 0o640), 'wb') as f:
OSError: [Errno 2] No such file or directory: '/var/crash/_usr_bin_webservice.51934.crash'

Original exception was:
Traceback (most recent call last):
  File "/usr/local/bin/webservice", line 151, in <module>
    stop(job, 'Restarting webservice')
  File "/usr/local/bin/webservice", line 39, in stop
    wait_for(lambda: job.get_state() == Backend.STATE_STOPPED, message)
  File "/usr/lib/python2.7/dist-packages/toollabs/common/utils.py", line 20, in wait_for
    if predicate():
  File "/usr/local/bin/webservice", line 39, in <lambda>
    wait_for(lambda: job.get_state() == Backend.STATE_STOPPED, message)
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 372, in get_state
    pod = self._find_obj(pykube.Pod, self.webservice_label_selector)
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 191, in _find_obj
    selector=selector
  File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 75, in get
    raise ValueError("get() more than one object; use filter")
ValueError: get() more than one object; use filter

@Legoktm What's the output of kubectl get pods, kubectl get services, kubectl get deployments? I'll note that unlike the original exception what happened during starting the service, this one happened during stopping it.

Sorry, I needed to get the tool back up. I'll remember to check those for next time.

• MZMcBride subscribed.Mar 5 2017, 4:18 AM

Reproduced:

P5016 k8s for i in {1..5} webservice stop; webservice start and webservice restart with kubectl get {pods,services,deployments} for debugging T156626

1	(venv)tools.video2commons-test@tools-bastion-02:~$ for i in {1..5}; do kubectl get pods; kubectl get services; kubectl get deployments; webservice --backend=kubernetes python2 stop; kubectl get pods; kubectl get services; kubectl get deployments; webservice --backend=kubernetes python2 start; kubectl get pods; kubectl get services; kubectl get deployments; done
2	NAME READY STATUS RESTARTS AGE
3	interactive 1/1 Running 0 19d
4	Your webservice is not running
5	NAME READY STATUS RESTARTS AGE
6	interactive 1/1 Running 0 19d
7	Starting webservice...
8	NAME READY STATUS RESTARTS AGE
9	interactive 1/1 Running 0 19d
10	video2commons-test-3674141639-6dlc6 1/1 Running 0 3s
11	NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
12	video2commons-test 192.168.0.243 <none> 8000/TCP 4s
13	NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
14	video2commons-test 1 1 1 1 4s
15	NAME READY STATUS RESTARTS AGE
16	interactive 1/1 Running 0 19d
17	video2commons-test-3674141639-6dlc6 1/1 Running 0 4s
18	NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
19	video2commons-test 192.168.0.243 <none> 8000/TCP 4s
20	NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
21	video2commons-test 1 1 1 1 4s
22	Stopping webservice...............
23	NAME READY STATUS RESTARTS AGE
24	interactive 1/1 Running 0 19d
25	video2commons-test-3674141639-6dlc6 1/1 Terminating 0 21s
26	Your job is already running
27	NAME READY STATUS RESTARTS AGE
28	interactive 1/1 Running 0 19d
29	video2commons-test-3674141639-6dlc6 1/1 Terminating 0 22s
30	NAME READY STATUS RESTARTS AGE
31	interactive 1/1 Running 0 19d
32	video2commons-test-3674141639-6dlc6 1/1 Terminating 0 23s
33	Stopping webservice............
34	NAME READY STATUS RESTARTS AGE
35	interactive 1/1 Running 0 19d
36	Starting webservice...............
37	NAME READY STATUS RESTARTS AGE
38	interactive 1/1 Running 0 19d
39	video2commons-test-3674141639-tfcq8 0/1 ContainerCreating 0 16s
40	NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
41	video2commons-test 192.168.0.233 <none> 8000/TCP 16s
42	NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
43	video2commons-test 1 1 1 0 16s
44	NAME READY STATUS RESTARTS AGE
45	interactive 1/1 Running 0 19d
46	video2commons-test-3674141639-tfcq8 0/1 ContainerCreating 0 16s
47	NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
48	video2commons-test 192.168.0.233 <none> 8000/TCP 16s
49	NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
50	video2commons-test 1 1 1 0 17s
51	Stopping webservice...............
52	NAME READY STATUS RESTARTS AGE
53	interactive 1/1 Running 0 19d
54	video2commons-test-3674141639-tfcq8 1/1 Terminating 0 33s
55	Your job is already running
56	NAME READY STATUS RESTARTS AGE
57	interactive 1/1 Running 0 19d
58	video2commons-test-3674141639-tfcq8 1/1 Terminating 0 33s
59	NAME READY STATUS RESTARTS AGE
60	interactive 1/1 Running 0 19d
61	video2commons-test-3674141639-tfcq8 1/1 Terminating 0 34s
62	Stopping webservice...............
63	NAME READY STATUS RESTARTS AGE
64	interactive 1/1 Running 0 19d
65	video2commons-test-3674141639-tfcq8 1/1 Terminating 0 50s
66	Your job is already running
67	NAME READY STATUS RESTARTS AGE
68	interactive 1/1 Running 0 19d
69	video2commons-test-3674141639-tfcq8 1/1 Terminating 0 51s
70	(venv)tools.video2commons-test@tools-bastion-02:~$ for i in {1..5}; do kubectl get pods; kubectl get services; kubectl get deployments; webservice --backend=kubernetes python2 restart; kubectl get pods; kubectl get services; kubectl get deployments; doneNAME READY STATUS RESTARTS AGE
71	interactive 1/1 Running 0 19d
72	Your job is not running, starting.........
73	NAME READY STATUS RESTARTS AGE
74	interactive 1/1 Running 0 19d
75	video2commons-test-3674141639-0u1mo 1/1 Running 0 10s
76	NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
77	video2commons-test 192.168.0.230 <none> 8000/TCP 10s
78	NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
79	video2commons-test 1 1 1 1 10s
80	NAME READY STATUS RESTARTS AGE
81	interactive 1/1 Running 0 19d
82	video2commons-test-3674141639-0u1mo 1/1 Running 0 10s
83	NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
84	video2commons-test 192.168.0.230 <none> 8000/TCP 10s
85	NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
86	video2commons-test 1 1 1 1 10s
87	Restarting webservice...............Traceback (most recent call last):
88	File "/usr/local/bin/webservice", line 152, in <module>
89	start(job, '')
90	File "/usr/local/bin/webservice", line 34, in start
91	wait_for(lambda: job.get_state() == Backend.STATE_RUNNING, message)
92	File "/usr/lib/python2.7/dist-packages/toollabs/common/utils.py", line 20, in wait_for
93	if predicate():
94	File "/usr/local/bin/webservice", line 34, in <lambda>
95	wait_for(lambda: job.get_state() == Backend.STATE_RUNNING, message)
96	File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 372, in get_state
97	pod = self._find_obj(pykube.Pod, self.webservice_label_selector)
98	File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 191, in _find_obj
99	selector=selector
100	File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 75, in get
101	raise ValueError("get() more than one object; use filter")
102	ValueError: get() more than one object; use filter
103	Error in sys.excepthook:
104	Traceback (most recent call last):
105	File "/usr/lib/python2.7/dist-packages/apport_python_hook.py", line 145, in apport_excepthook
106	os.O_WRONLY \| os.O_CREAT \| os.O_EXCL, 0o640), 'wb') as f:
107	OSError: [Errno 2] No such file or directory: '/var/crash/_usr_bin_webservice.52996.crash'
108
109	Original exception was:
110	Traceback (most recent call last):
111	File "/usr/local/bin/webservice", line 152, in <module>
112	start(job, '')
113	File "/usr/local/bin/webservice", line 34, in start
114	wait_for(lambda: job.get_state() == Backend.STATE_RUNNING, message)
115	File "/usr/lib/python2.7/dist-packages/toollabs/common/utils.py", line 20, in wait_for
116	if predicate():
117	File "/usr/local/bin/webservice", line 34, in <lambda>
118	wait_for(lambda: job.get_state() == Backend.STATE_RUNNING, message)
119	File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 372, in get_state
120	pod = self._find_obj(pykube.Pod, self.webservice_label_selector)
121	File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 191, in _find_obj
122	selector=selector
123	File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 75, in get
124	raise ValueError("get() more than one object; use filter")
125	ValueError: get() more than one object; use filter
126	NAME READY STATUS RESTARTS AGE
127	interactive 1/1 Running 0 19d
128	video2commons-test-3674141639-0u1mo 1/1 Terminating 0 27s
129	video2commons-test-3674141639-9gck6 0/1 ContainerCreating 0 0s
130	NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
131	video2commons-test 192.168.0.194 <none> 8000/TCP 0s
132	NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
133	video2commons-test 1 1 1 0 2s
134	NAME READY STATUS RESTARTS AGE
135	interactive 1/1 Running 0 19d
136	video2commons-test-3674141639-0u1mo 1/1 Terminating 0 28s
137	video2commons-test-3674141639-9gck6 0/1 ContainerCreating 0 1s
138	NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
139	video2commons-test 192.168.0.194 <none> 8000/TCP 1s
140	NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
141	video2commons-test 1 1 1 0 2s
142	Traceback (most recent call last):
143	File "/usr/local/bin/webservice", line 148, in <module>
144	if job.get_state() != Backend.STATE_RUNNING:
145	File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 372, in get_state
146	pod = self._find_obj(pykube.Pod, self.webservice_label_selector)
147	File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 191, in _find_obj
148	selector=selector
149	File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 75, in get
150	raise ValueError("get() more than one object; use filter")
151	ValueError: get() more than one object; use filter
152	Error in sys.excepthook:
153	Traceback (most recent call last):
154	File "/usr/lib/python2.7/dist-packages/apport_python_hook.py", line 145, in apport_excepthook
155	os.O_WRONLY \| os.O_CREAT \| os.O_EXCL, 0o640), 'wb') as f:
156	OSError: [Errno 2] No such file or directory: '/var/crash/_usr_bin_webservice.52996.crash'
157
158	Original exception was:
159	Traceback (most recent call last):
160	File "/usr/local/bin/webservice", line 148, in <module>
161	if job.get_state() != Backend.STATE_RUNNING:
162	File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 372, in get_state
163	pod = self._find_obj(pykube.Pod, self.webservice_label_selector)
164	File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 191, in _find_obj
165	selector=selector
166	File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 75, in get
167	raise ValueError("get() more than one object; use filter")
168	ValueError: get() more than one object; use filter
169	NAME READY STATUS RESTARTS AGE
170	interactive 1/1 Running 0 19d
171	video2commons-test-3674141639-0u1mo 1/1 Terminating 0 29s
172	video2commons-test-3674141639-9gck6 0/1 ContainerCreating 0 2s
173	NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
174	video2commons-test 192.168.0.194 <none> 8000/TCP 2s
175	NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
176	video2commons-test 1 1 1 0 3s
177	NAME READY STATUS RESTARTS AGE
178	interactive 1/1 Running 0 19d
179	video2commons-test-3674141639-0u1mo 1/1 Terminating 0 30s
180	video2commons-test-3674141639-9gck6 0/1 ContainerCreating 0 3s
181	NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
182	video2commons-test 192.168.0.194 <none> 8000/TCP 3s
183	NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
184	video2commons-test 1 1 1 0 4s
185	Traceback (most recent call last):
186	File "/usr/local/bin/webservice", line 148, in <module>
187	if job.get_state() != Backend.STATE_RUNNING:
188	File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 372, in get_state
189	pod = self._find_obj(pykube.Pod, self.webservice_label_selector)
190	File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 191, in _find_obj
191	selector=selector
192	File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 75, in get
193	raise ValueError("get() more than one object; use filter")
194	ValueError: get() more than one object; use filter
195	Error in sys.excepthook:
196	Traceback (most recent call last):
197	File "/usr/lib/python2.7/dist-packages/apport_python_hook.py", line 145, in apport_excepthook
198	os.O_WRONLY \| os.O_CREAT \| os.O_EXCL, 0o640), 'wb') as f:
199	OSError: [Errno 2] No such file or directory: '/var/crash/_usr_bin_webservice.52996.crash'
200
201	Original exception was:
202	Traceback (most recent call last):
203	File "/usr/local/bin/webservice", line 148, in <module>
204	if job.get_state() != Backend.STATE_RUNNING:
205	File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 372, in get_state
206	pod = self._find_obj(pykube.Pod, self.webservice_label_selector)
207	File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 191, in _find_obj
208	selector=selector
209	File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 75, in get
210	raise ValueError("get() more than one object; use filter")
211	ValueError: get() more than one object; use filter
212	NAME READY STATUS RESTARTS AGE
213	interactive 1/1 Running 0 19d
214	video2commons-test-3674141639-0u1mo 1/1 Terminating 0 31s
215	video2commons-test-3674141639-9gck6 0/1 ContainerCreating 0 4s
216	NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
217	video2commons-test 192.168.0.194 <none> 8000/TCP 4s
218	NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
219	video2commons-test 1 1 1 0 5s
220	NAME READY STATUS RESTARTS AGE
221	interactive 1/1 Running 0 19d
222	video2commons-test-3674141639-0u1mo 1/1 Terminating 0 32s
223	video2commons-test-3674141639-9gck6 0/1 ContainerCreating 0 5s
224	NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
225	video2commons-test 192.168.0.194 <none> 8000/TCP 5s
226	NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
227	video2commons-test 1 1 1 0 6s
228	Traceback (most recent call last):
229	File "/usr/local/bin/webservice", line 148, in <module>
230	if job.get_state() != Backend.STATE_RUNNING:
231	File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 372, in get_state
232	pod = self._find_obj(pykube.Pod, self.webservice_label_selector)
233	File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 191, in _find_obj
234	selector=selector
235	File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 75, in get
236	raise ValueError("get() more than one object; use filter")
237	ValueError: get() more than one object; use filter
238	Error in sys.excepthook:
239	Traceback (most recent call last):
240	File "/usr/lib/python2.7/dist-packages/apport_python_hook.py", line 145, in apport_excepthook
241	os.O_WRONLY \| os.O_CREAT \| os.O_EXCL, 0o640), 'wb') as f:
242	OSError: [Errno 2] No such file or directory: '/var/crash/_usr_bin_webservice.52996.crash'
243
244	Original exception was:
245	Traceback (most recent call last):
246	File "/usr/local/bin/webservice", line 148, in <module>
247	if job.get_state() != Backend.STATE_RUNNING:
248	File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 372, in get_state
249	pod = self._find_obj(pykube.Pod, self.webservice_label_selector)
250	File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 191, in _find_obj
251	selector=selector
252	File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 75, in get
253	raise ValueError("get() more than one object; use filter")
254	ValueError: get() more than one object; use filter
255	NAME READY STATUS RESTARTS AGE
256	interactive 1/1 Running 0 19d
257	video2commons-test-3674141639-0u1mo 1/1 Terminating 0 33s
258	video2commons-test-3674141639-9gck6 0/1 ContainerCreating 0 6s
259	NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
260	video2commons-test 192.168.0.194 <none> 8000/TCP 6s
261	NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
262	video2commons-test 1 1 1 0 7s

Actually, no. This traceback is the original traceback.

• bd808 mentioned this in T160353: Still issues with node.js webservice.Mar 26 2017, 8:46 PM

• bd808 removed a project: Patch-For-Review.Apr 26 2017, 5:11 AM

• bd808 moved this task from Waiting for code review to Ready to be worked on on the Toolforge board.

I get these while stopping but not while starting, with the same traceback:

tools.yabbr@tools-bastion-03:~$ webservice stop
Traceback (most recent call last):
  File "/usr/local/bin/webservice", line 144, in <module>
    if job.get_state() == Backend.STATE_STOPPED:
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 401, in get_state
    pod = self._find_obj(pykube.Pod, self.webservice_label_selector)
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 210, in _find_obj
    selector=selector
  File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 75, in get
    raise ValueError("get() more than one object; use filter")
ValueError: get() more than one object; use filter

@APerson What's the output of kubectl get pods; kubectl get services; kubectl get deployments?

@zhuyifei1999 (sorry for the delayed response!)

NAME                     READY     STATUS    RESTARTS   AGE
interactive              1/1       Running   0          3d
yabbr-1590617135-0euto   1/1       Running   0          3h
NAME      CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
yabbr     192.168.98.49   <none>        8000/TCP   3h
NAME      DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
yabbr     1         1         1            1           3h

• Phabricator_maintenance removed a subscriber: yuvipanda.Jun 7 2017, 6:39 PM

Yep I'm encountering this all the time.

NAME                                  READY     STATUS              RESTARTS   AGE
interactive                           1/1       Running             0          16h
video2commons-test-3674141639-bb1ho   1/1       Terminating         0          40s
video2commons-test-3674141639-wl0iv   0/1       ContainerCreating   0          0s

Can we just change [[https://github.com/wikimedia/operations-software-tools-webservice/blob/7e8da902d4ac6a7bc05f548ffa7bb3f6267e65ed/toollabs/webservice/backends/kubernetesbackend.py#L405|elif pod.obj['status']['phase'] == 'Pending':]] into else:? A Terminating state should definitely not be Backend.STATE_STOPPED

• bd808 edited projects, added Kubernetes; removed Cloud-Services.Jul 28 2017, 11:01 PM

• bd808 mentioned this in T175768: Improvements for the Toolforge 'webservice' command.Sep 13 2017, 12:05 AM

Sowjanyavemuri mentioned this in T177603: Proposal: Improvements for the Toolforge 'webservice' command.Oct 15 2017, 9:05 PM

zhuyifei1999 mentioned this in T140415: `webservice restart` does not always wait for service to stop before trying to start again.Dec 14 2017, 5:59 PM

Nehajha mentioned this in T190638: GSoC 2018 proposal for Improvements for the Toolforge 'webservice' command.Mar 25 2018, 7:33 AM

djff mentioned this in T190696: [Gsoc 2018] Proposal for Toolforge webservice command Improvement.Mar 27 2018, 5:41 AM

• bd808 merged a task: T196595: "webservice --backend kubernetes nodejs restart" command fails at times.Jun 6 2018, 11:54 PM

• bd808 added a subscriber: srishakatux.

I found a new way this can be broken today. While doing some CrashLoopBackOff state cleanup I found a tool which had 1 pod, no deployments, and 7 replicasets. Running webservice stop was crashing with the "get() more than one object; use filter" error and I'm very sure it was due to the multiple replicaset objects in the namespace. This looked to be the result of some cluster state corruption from ~1 year ago.

jrbs merged a task: T196595: "webservice --backend kubernetes nodejs restart" command fails at times.Jul 1 2018, 7:22 AM

@Nehajha has been looking into this as part of her work on T190638: GSoC 2018 proposal for Improvements for the Toolforge 'webservice' command. She has been having difficulty reproducing the problem in her testing environment, so if anyone has tips on what seems to make it happen more often please do report them here.

Mentioned in SAL (#wikimedia-cloud) [2018-07-22T14:28:51Z] <zhuyifei1999_> backed up Neha16's changes to toolsbeta-bastion-01:/usr/lib/python2.7/dist-packages/toollabs to toollabs.bak in the same dir via cp -a, and re-install webservice code on the bastion to debug T156626

I bruteforced this issue:

Attempting my luck... no good.

toolsbeta.test@toolsbeta-bastion-01:~$ kubectl get po; kubectl get svc; kubectl get deploy
NAME                    READY     STATUS    RESTARTS   AGE
test-1089897043-slwhl   1/1       Running   0          1d
NAME      CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
test      192.168.40.242   <none>        8000/TCP   1d
NAME      DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
test      1         1         1            1           1d
toolsbeta.test@toolsbeta-bastion-01:~$ webservice restart
Restarting webservice....
toolsbeta.test@toolsbeta-bastion-01:~$ webservice restart
Restarting webservice....

Then I applied the patch to make the script check statuses more frequently:

zhuyifei1999@toolsbeta-bastion-01:/usr/lib/python2.7/dist-packages/toollabs$ sudo nano common/utils.py
[...]
zhuyifei1999@toolsbeta-bastion-01:/usr/lib/python2.7/dist-packages/toollabs$ curl -sL https://github.com/wikimedia/operations-software-tools-webservice/raw/master/toollabs/common/utils.py | diff - common/utils.py
24c24
<         time.sleep(1)
---
>         time.sleep(0.01)

toolsbeta.test@toolsbeta-bastion-01:~$ webservice restart
Restarting webservice...............Pod resisted shtdown
Traceback (most recent call last):
  File "/usr/local/bin/webservice", line 219, in <module>
    sys.exist(1)
AttributeError: 'module' object has no attribute 'exist'
toolsbeta.test@toolsbeta-bastion-01:~$ webservice restart; kubectl get po; kubectl get svc; kubectl get deploy
Your job is not running, starting...............
NAME                    READY     STATUS              RESTARTS   AGE
test-1089897043-vb02z   0/1       ContainerCreating   0          1s
NAME      CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
test      192.168.14.193   <none>        8000/TCP   1s
NAME      DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
test      1         1         1            0           1s
toolsbeta.test@toolsbeta-bastion-01:~$ webservice restart; kubectl get po; kubectl get svc; kubectl get deploy
Restarting webservice...............Pod resisted shtdown
Traceback (most recent call last):
  File "/usr/local/bin/webservice", line 219, in <module>
    sys.exist(1)
AttributeError: 'module' object has no attribute 'exist'
NAME                    READY     STATUS        RESTARTS   AGE
test-1089897043-vb02z   1/1       Terminating   0          5s

So we hit an unmerged workaround https://gerrit.wikimedia.org/r/#/c/operations/software/tools-webservice/+/444879/, have to use unmodified code to see if it's the same error:

zhuyifei1999@toolsbeta-bastion-01:/usr/lib/python2.7/dist-packages$ sudo cp -a toollabs toollabs.bak
zhuyifei1999@toolsbeta-bastion-01:/usr/lib/python2.7/dist-packages$ sudo apt install --reinstall toollabs-webservice
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following packages were automatically installed and are no longer required:
  linux-image-3.13.0-151-generic linux-image-extra-3.13.0-151-generic
Use 'apt-get autoremove' to remove them.
0 upgraded, 0 newly installed, 1 reinstalled, 0 to remove and 6 not upgraded.
Need to get 0 B/17.6 kB of archives.
After this operation, 0 B of additional disk space will be used.
(Reading database ... 197537 files and directories currently installed.)
Preparing to unpack .../toollabs-webservice_0.39_all.deb ...
Unpacking toollabs-webservice (0.39) over (0.39) ...
Setting up toollabs-webservice (0.39) ...
N: Ignoring file '20auto-upgrades.ucf-dist' in directory '/etc/apt/apt.conf.d/' as it has an invalid filename extension
zhuyifei1999@toolsbeta-bastion-01:/usr/lib/python2.7/dist-packages$ cd toollabs
zhuyifei1999@toolsbeta-bastion-01:/usr/lib/python2.7/dist-packages/toollabs$ curl -sL https://github.com/wikimedia/operations-software-tools-webservice/raw/master/toollabs/common/utils.py | diff - common/utils.py
zhuyifei1999@toolsbeta-bastion-01:/usr/lib/python2.7/dist-packages/toollabs$ patch common/utils.py
[...]
File common/utils.py is read-only; trying to patch anyway
patch: **** Can't create temporary file common/utils.py.ouA6Nn0 : Permission denied
zhuyifei1999@toolsbeta-bastion-01:/usr/lib/python2.7/dist-packages/toollabs$ sudo patch common/utils.py
24c24
<         time.sleep(1)
---
>         time.sleep(0.01)
patching file common/utils.py
zhuyifei1999@toolsbeta-bastion-01:/usr/lib/python2.7/dist-packages/toollabs$ curl -sL https://github.com/wikimedia/operations-software-tools-webservice/raw/master/toollabs/common/utils.py | diff - common/utils.py
24c24
<         time.sleep(1)
---
>         time.sleep(0.01)

And indeed it's the same issue:

toolsbeta.test@toolsbeta-bastion-01:~$ webservice restart; kubectl get po; kubectl get svc; kubectl get deploy
Your job is not running, starting...............
NAME                    READY     STATUS              RESTARTS   AGE
test-1089897043-1oxls   0/1       ContainerCreating   0          1s
NAME      CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
test      192.168.62.70   <none>        8000/TCP   1s
NAME      DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
test      1         1         1            0           1s
toolsbeta.test@toolsbeta-bastion-01:~$ webservice restart; kubectl get po; kubectl get svc; kubectl get deploy
Restarting webservice...............
NAME                    READY     STATUS              RESTARTS   AGE
test-1089897043-1oxls   1/1       Terminating         0          7s
test-1089897043-qmza8   0/1       ContainerCreating   0          0s
NAME      CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
test      192.168.103.35   <none>        8000/TCP   0s
NAME      DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
test      1         1         1            0           1s
toolsbeta.test@toolsbeta-bastion-01:~$ webservice restart; kubectl get po; kubectl get svc; kubectl get deploy
Restarting webservice...............Traceback (most recent call last):
  File "/usr/local/bin/webservice", line 179, in <module>
    start(job, '')
  File "/usr/local/bin/webservice", line 60, in start
    wait_for(lambda: job.get_state() == Backend.STATE_RUNNING, message)
  File "/usr/lib/python2.7/dist-packages/toollabs/common/utils.py", line 20, in wait_for
    if predicate():
  File "/usr/local/bin/webservice", line 60, in <lambda>
    wait_for(lambda: job.get_state() == Backend.STATE_RUNNING, message)
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 402, in get_state
    pod = self._find_obj(pykube.Pod, self.webservice_label_selector)
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 210, in _find_obj
    selector=selector
  File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 75, in get
    raise ValueError("get() more than one object; use filter")
ValueError: get() more than one object; use filter
NAME                    READY     STATUS              RESTARTS   AGE
test-1089897043-iwu5t   0/1       ContainerCreating   0          1s
test-1089897043-qmza8   1/1       Terminating         0          45s
NAME      CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
test      192.168.43.160   <none>        8000/TCP   1s
NAME      DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
test      1         1         1            0           1s

However, I don't understand how the second command in the above paste worked, considering the output of kubectl get pods is almost the same. Fortunately, the majority of the invocations thereafter are able to produce the same STATE_RUNNING traceback.

I am unable to reproduce the traceback on STATE_STOPPED. My initial theory of restarting webservice too fast doesn't work:

toolsbeta.test@toolsbeta-bastion-01:~$ webservice restart; webservice restart; kubectl get po; kubectl get svc; kubectl get deploy; 
Restarting webservice...............Traceback (most recent call last):
  File "/usr/local/bin/webservice", line 179, in <module>
    start(job, '')
  File "/usr/local/bin/webservice", line 60, in start
    wait_for(lambda: job.get_state() == Backend.STATE_RUNNING, message)
  File "/usr/lib/python2.7/dist-packages/toollabs/common/utils.py", line 20, in wait_for
    if predicate():
  File "/usr/local/bin/webservice", line 60, in <lambda>
    wait_for(lambda: job.get_state() == Backend.STATE_RUNNING, message)
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 402, in get_state
    pod = self._find_obj(pykube.Pod, self.webservice_label_selector)
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 210, in _find_obj
    selector=selector
  File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 75, in get
    raise ValueError("get() more than one object; use filter")
ValueError: get() more than one object; use filter
Traceback (most recent call last):
  File "/usr/local/bin/webservice", line 175, in <module>
    if job.get_state() != Backend.STATE_RUNNING:
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 402, in get_state
    pod = self._find_obj(pykube.Pod, self.webservice_label_selector)
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 210, in _find_obj
    selector=selector
  File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 75, in get
    raise ValueError("get() more than one object; use filter")
ValueError: get() more than one object; use filter
NAME                    READY     STATUS              RESTARTS   AGE
test-1089897043-722xn   1/1       Terminating         0          1m
test-1089897043-iwcfv   0/1       ContainerCreating   0          0s
NAME      CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
test      192.168.96.120   <none>        8000/TCP   0s
NAME      DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
test      1         1         1            0           0s

I cannot figure out how job.get_state() succeeds in <module> (1 match only) but turn into multiple matches after stopping. More playing around:

toolsbeta.test@toolsbeta-bastion-01:~$ webservice stop; sleep 5; webservice --backend kubernetes php5.6 start; sleep 5; kubectl get po; webservice restart; kubectl get po; kubectl get svc; kubectl get deploy; 
Stopping webservice...............
Starting webservice...............
NAME                    READY     STATUS    RESTARTS   AGE
test-1089897043-exein   1/1       Running   0          6s
Restarting webservice...............Traceback (most recent call last):
  File "/usr/local/bin/webservice", line 179, in <module>
    start(job, '')
  File "/usr/local/bin/webservice", line 60, in start
    wait_for(lambda: job.get_state() == Backend.STATE_RUNNING, message)
  File "/usr/lib/python2.7/dist-packages/toollabs/common/utils.py", line 20, in wait_for
    if predicate():
  File "/usr/local/bin/webservice", line 60, in <lambda>
    wait_for(lambda: job.get_state() == Backend.STATE_RUNNING, message)
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 402, in get_state
    pod = self._find_obj(pykube.Pod, self.webservice_label_selector)
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 210, in _find_obj
    selector=selector
  File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 75, in get
    raise ValueError("get() more than one object; use filter")
ValueError: get() more than one object; use filter
NAME                    READY     STATUS              RESTARTS   AGE
test-1089897043-ddhtc   0/1       ContainerCreating   0          1s
test-1089897043-exein   1/1       Terminating         0          7s
NAME      CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
test      192.168.17.207   <none>        8000/TCP   1s
NAME      DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
test      1         1         1            0           1s
toolsbeta.test@toolsbeta-bastion-01:~$ webservice stop; sleep 5; webservice --backend kubernetes php5.6 start; sleep 1; kubectl get po; webservice restart; kubectl get po; kubectl get svc; kubectl get deploy; 
Stopping webservice...............
Starting webservice...............
NAME                    READY     STATUS              RESTARTS   AGE
test-1089897043-7oo7w   0/1       ContainerCreating   0          2s
Your job is not running, starting.........
NAME                    READY     STATUS    RESTARTS   AGE
test-1089897043-7oo7w   1/1       Running   0          2s
NAME      CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
test      192.168.58.103   <none>        8000/TCP   2s
NAME      DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
test      1         1         1            1           3s

rosalieper subscribed.Jul 22 2018, 10:17 PM

Legoktm merged a task: T158152: tool.spiarticleanalyzer: webservice stop yields ValueError: get() more than one object.Jul 30 2018, 4:56 AM

Change 450495 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/software/tools-webservice@master] Kubernetes: ignore terminating objects when searching

https://gerrit.wikimedia.org/r/450495

gerritbot added a project: Patch-For-Review.Aug 6 2018, 5:00 AM

I did a bit of hacking on the kubernetesbackend.py file on toolsbeta-bastion-01. I got webservice restart to output more useful information about the state of duplicate objects:

==================================================
<Pod test-1089897043-c1itd>
{   u'conditions': [   {   u'lastProbeTime': None,
                           u'lastTransitionTime': u'2018-08-06T00:26:47Z',
                           u'status': u'True',
                           u'type': u'Initialized'},
                       {   u'lastProbeTime': None,
                           u'lastTransitionTime': u'2018-08-06T00:26:49Z',
                           u'status': u'True',
                           u'type': u'Ready'},
                       {   u'lastProbeTime': None,
                           u'lastTransitionTime': u'2018-08-06T00:26:47Z',
                           u'status': u'True',
                           u'type': u'PodScheduled'}],
    u'containerStatuses': [   {   u'containerID':
u'docker://18babad048d50e121065945f68e56d58527ac8915be5cf1aaa7616fc44e2fa3d',
                                  u'image':
u'docker-registry.tools.wmflabs.org/toollabs-php-web:latest',
                                  u'imageID':
u'docker://sha256:b6b7733f0d037e5b8b1d301cceda9331f336934496e28448b93b977fab8f67b9',
                                  u'lastState': {   },
                                  u'name': u'webservice',
                                  u'ready': True,
                                  u'restartCount': 0,
                                  u'state': {   u'running': {
u'startedAt': u'2018-08-06T00:26:48Z'}}}],
    u'hostIP': u'10.68.18.110',
    u'phase': u'Running',
    u'podIP': u'192.168.215.3',
    u'startTime': u'2018-08-06T00:26:47Z'}
==================================================

==================================================
<Pod test-1089897043-ontg3>
{   u'phase': u'Pending'}
==================================================

I'm using the pprint module to print out the Pod.obj['status'] internal state. I guessed this would be where the useful information would be based on the Pod.ready synthetic property from the upstream library.

I did some more hacking after this and wrote:

def get_state(self):
    """
    Returns state of webservice.

    One of:
    - Backend.STATE_STOPPED
    - Backend.STATE_RUNNING
    - Backend.STATE_PENDING
    """
    def pod_cmp(a, b):
        # Pending < Running < [anything else]
        phase_a = a.obj['status']['phase']
        phase_b = b.obj['status']['phase']
        if phase_a == phase_b:
            return 0
        if phase_a == 'Pending':
            return -1
        if phase_b == 'Pending':
            return 1
        if phase_a == 'Running':
            return -1
        if phase_b == 'Running':
            return 1
        return -1 if phase_a < phase_b else 1

    pods = pykube.Pod.objects(self.api).filter(
        namespace=self.tool.name,
        selector=self.webservice_label_selector
    )
    if pods:
        for p in pods:
            # FIXME: debug print to be removed
            print('{} {} {}'.format(
                len(pods), repr(p), p.obj['status']['phase']),
                file=sys.stderr)
        pod = sorted([p for p in pods], pod_cmp)[0]
        if pod.obj['status']['phase'] == 'Running':
            return Backend.STATE_RUNNING
        if pod.obj['status']['phase'] == 'Pending':
            return Backend.STATE_PENDING
    svc = self._find_obj(pykube.Service, self.webservice_label_selector)
    deployment = self._find_obj(
        pykube.Deployment, self.webservice_label_selector)
    if svc is not None and deployment is not None:
        return Backend.STATE_PENDING
    else:
        return Backend.STATE_STOPPED

This seemed to work for basic tests, but I never saw anything in the 'Terminating' state to be sure. I decided to change the "test" tool to the kubernetes python backend and write a wsgi app with this app.py that would be slow to shutdown to test further:

app.py

import atexit
import sys
import time

import flask

app = flask.Flask(__name__)


@app.route('/')
def index():
    return 'Hello World!'


def nap():
    print('Shutting down...', file=sys.stderr)
    time.sleep(60)


atexit.register(nap)

With the get_state() changes I made above and this webservice things almost work as hoped. The initial webservice start is good.
webservice stop works as expected as well, except that it returns while there is still a pod in the 'Terminating' state according to kubectl get pods. webservice restart works as long as there is no pod in the 'Terminating' state when you make the command. If there is a pod still terminating then you get output that looks like this:

$ webservice --backend=kubernetes python restart
2 <Pod test-1203602763-vwvzp> Running
2 <Pod test-1203602763-xvuks> Running
Traceback (most recent call last):
  File "/usr/local/bin/webservice", line 178, in <module>
    stop(job, 'Restarting webservice')
  File "/usr/local/bin/webservice", line 64, in stop
    job.request_stop()
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py",
line 417, in request_stop
    self._delete_obj(pykube.Pod, self.webservice_label_selector)
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py",
line 240, in _delete_obj
    selector=selector
  File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 75, in get
    raise ValueError("get() more than one object; use filter")
ValueError: get() more than one object; use filter

After adding more debug printing it appears that Pod.obj['status']['phase'] == 'Running' in many different circumstances including states where kubectl get pods would say "Terminating". I eventually figured out that "Terminating" is shown when Pod.obj['metadata']['deletionTimestamp'] exists and has a timestamp in it. It turns out that @scfc had figured this out all the way back in T156626#2998399, but then ignored it because he did not quite understand that Kubernetes would start a new pod while there was still one terminating.

Now that I know how to tell if an object is in the process of being deleted, I just needed to figure out how to ignore those objects. I think it is safe to do this on the assumption that Kubernetes will eventually figure out how to kill even runaway processes like the deliberately slow to shutdown one I wrote.

I played around a bit with the fieldSelector parameter for the search, but I could not figure out how to use it to ensure that metadata.deletionTimestamp was either missing or empty. The search results can easily be treated as a normal python list however (we've been doing that for debugging above), so I decided to post-process the search results and throw out the objects that have a deletionTimestamp:

objs = [
    o for o in objs
    if o.obj['metadata'].get('deletionTimestamp', None) is None
]

Once I had done this I had to also recreate the logic for checking for no results and multiple results that get() was doing for us before:

found = len(objs)
if found == 0:
    return None
elif found == 1:
    return objs[0]
else:
    raise ValueError("get() more than one object; use filter")

Once I had rolled back other changes and put this logic into _find_obj() things were really, really close to working. It turned out however that _delete_obj() would still fail in some circumstances because it was not using _find_obj() and instead searching itself. I rewrote it to use _find_obj():

def _delete_obj(self, kind, selector):
    """
    Delete object of kind matching selector if it exists
    """
    o = self._find_obj(kind, selector)
    if o is not None:
        o.delete()

I have not been able to get things to break again with these changes applied. Running webservice restart many times in a row with the slow to stop python app will leave kubectl get pods showing a lot of pods stuck in "Terminating" state, but they eventually get cleaned up.

The patch for all this is https://gerrit.wikimedia.org/r/450495

ArielGlenn subscribed.Aug 6 2018, 7:00 PM

Change 450495 merged by jenkins-bot:
[operations/software/tools-webservice@master] Kubernetes: ignore terminating objects when searching

https://gerrit.wikimedia.org/r/450495

Mentioned in SAL (#wikimedia-cloud) [2018-08-08T10:00:59Z] <zhuyifei1999_> building & publishing toollabs-webservice 0.40 deb, and all Docker images T156626 T148872 T158244

Stashbot mentioned this in T148872: Make webservice command read default cli arguments from ~/.webservicerc.Aug 8 2018, 10:01 AM

Stashbot mentioned this in T158244: Improve `webservice status` output.

Let's close this. If anyone encounter this issue again please reopen

I'm getting this error:

$ webservice stop
Traceback (most recent call last):

File "/usr/local/bin/webservice", line 176, in <module>
  stop(job, 'Stopping webservice')
File "/usr/local/bin/webservice", line 69, in stop                                                                                               job.request_stop()
File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 436, in request_stop
  pykube.ReplicaSet, 'name={name}'.format(name=self.tool.name))
File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 260, in _delete_obj
  o = self._find_obj(kind, selector)
File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 252, in _find_obj
  ', '.join(repr(o) for o in objs)

ValueError: Found 2 objects of type ReplicaSet matching selector name=wd-depicts: <ReplicaSet wd-depicts-1035148579>, <ReplicaSet wd-depicts-1769414028>. See https://phabricator.wikimedia.org/T156626

Any clues?

$ kubectl get replicaset
NAME                    DESIRED   CURRENT   READY     AGE
wd-depicts-1035148579   0         0         0         7d
wd-depicts-1769414028   1         1         1         7d
$ kubectl delete replicaset/wd-depicts-1035148579
replicaset "wd-depicts-1035148579" deleted
$ kubectl get replicaset
NAME                    DESIRED   CURRENT   READY     AGE
wd-depicts-1769414028   1         1         1         7d

Change 566045 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/software/tools-webservice@master] [RFC] kubernetes: add support for multiple objects of any kind

https://gerrit.wikimedia.org/r/566045

Change 566045 merged by jenkins-bot:
[operations/software/tools-webservice@master] kubernetes: add support for multiple objects of any kind

https://gerrit.wikimedia.org/r/566045

Getting this error for mix-n-match tool.

In T156626#5848457, @Magnus wrote:

Getting this error for mix-n-match tool.

$ sudo become mix-n-match
$ kubectl get all
NAME                               READY   STATUS    RESTARTS   AGE
pod/mix-n-match-6c888b79f5-686w4   1/1     Running   0          22d
pod/mix-n-match-6c888b79f5-zlcw2   1/1     Running   0          9d


NAME                  TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
service/mix-n-match   ClusterIP   10.100.151.252   <none>        8000/TCP   22d


NAME                          READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/mix-n-match   2/2     2            2           22d

NAME                                     DESIRED   CURRENT   READY   AGE
replicaset.apps/mix-n-match-6c888b79f5   2         2         2       22d

It looks like kubectl scale has been used to increase the Pod count for mix-n-match. I will do some testing with another tool to see if I can figure out why the recent updates to webservice are not handling this as an expected case.

Mentioned in SAL (#wikimedia-cloud) [2020-02-07T21:09:19Z] <bstorm_> upgraded toollabs-webservice package for stretch toolsbeta to 0.59 T244293 T244289 T234617 T156626

Mentioned in SAL (#wikimedia-cloud) [2020-02-10T21:18:57Z] <bstorm_> upgraded toollabs-webservice package for stretch toolsbeta to 0.62 T244293 T244289 T234617 T156626

Mentioned in SAL (#wikimedia-cloud) [2020-02-10T21:25:10Z] <bstorm_> upgraded toollabs-webservice package for tools to 0.62 T244293 T244289 T234617 T156626

k8s webservice restart failure with `ValueError: get() more than one object; use filter`Closed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

k8s webservice restart failure with `ValueError: get() more than one object; use filter`
Closed, ResolvedPublic
Actions