tools.precise-tools@tools-bastion-05:~/www/python/src$ git apply ~/169b40bff497eff780df28ba010cdf3564e62cdd.patch; webservice restart Restarting webserviceTraceback (most recent call last): File "/usr/local/bin/webservice", line 152, in <module> start(job, '') File "/usr/local/bin/webservice", line 34, in start wait_for(lambda: job.get_state() == Backend.STATE_RUNNING, message) File "/usr/lib/python2.7/dist-packages/toollabs/common/utils.py", line 20, in wait_for if predicate(): File "/usr/local/bin/webservice", line 34, in <lambda> wait_for(lambda: job.get_state() == Backend.STATE_RUNNING, message) File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 369, in get_state pod = self._find_obj(pykube.Pod, self.webservice_label_selector) File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 185, in _find_obj selector=selector File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 75, in get raise ValueError("get() more than one object; use filter") ValueError: get() more than one object; use filter
Description
Details
Related Objects
- Mentioned In
- T234617: Toolforge. introduce new domain toolforge.org
T244289: Improve limit range management in webservice for Kubernetes
T244293: Add a function to webservice called "migrate" that will push a tool from the old cluster on Kubernetes to the new one
T158244: Improve `webservice status` output
T148872: Make webservice command read default cli arguments from ~/.webservicerc
T190696: [Gsoc 2018] Proposal for Toolforge webservice command Improvement
T190638: GSoC 2018 proposal for Improvements for the Toolforge 'webservice' command
T140415: `webservice restart` does not always wait for service to stop before trying to start again
T177603: Proposal: Improvements for the Toolforge 'webservice' command
T175768: Improvements for the Toolforge 'webservice' command
T160353: Still issues with node.js webservice
T158152: tool.spiarticleanalyzer: webservice stop yields ValueError: get() more than one object
T156605: "webservice shell" fails with "No such file or directory" (with php5.6) - Mentioned Here
- T234617: Toolforge. introduce new domain toolforge.org
T244289: Improve limit range management in webservice for Kubernetes
T244293: Add a function to webservice called "migrate" that will push a tool from the old cluster on Kubernetes to the new one
T148872: Make webservice command read default cli arguments from ~/.webservicerc
T158244: Improve `webservice status` output
T190638: GSoC 2018 proposal for Improvements for the Toolforge 'webservice' command
P5016 k8s for i in {1..5} webservice stop; webservice start and webservice restart with kubectl get {pods,services,deployments} for debugging T156626
T156605: "webservice shell" fails with "No such file or directory" (with php5.6)
Event Timeline
This is an annoying bug for experienced users and can be insurmountable for newbies. Let's make the solution to this not more complex than it needs to be.
CMIIW: When webservice stop has finished (with a Kubernetes backend), kubectl get pods should not list any webservice pods, so that would be the testable requirement. So maybe add a STATE_STOPPING for webservices when they are Terminating, but not gone yet. The logic in ´webservice`'s stop() can then remain the same.
After cloning the repository and adding some prints, the output of PYTHONPATH=. scripts/webservice --backend=kubernetes stop && kubectl get pods is:
svc = <Service typoscan> deployment = <Deployment typoscan> pod = <Pod typoscan-3292915034-mokde> pod.obj.status.phase = u'Running' Stopping webservicesvc = None deployment = None NAME READY STATUS RESTARTS AGE typoscan-3292915034-mokde 1/1 Terminating 0 30s
I. e., "Terminating" pods have svc and deployment set to None = tough luck.
So let's see what webservice start is looking at.
(Fun fact: Starting webservice start from a home directory on NFS is enough to cause a sufficient delay for the bug not to show up. Moving to /tmp works reliably for reproducing.)
FYI T156626#2984863
So let's see what webservice start is looking at.
It is checking the status of the pod, but it cannot determine that since there are multiple pods running.
AFAICT a "Terminating" pod has pod.obj['status']['phase'] set to Running:
tools.typoscan@tools-bastion-03:/tmp/tools-webservice$ webservice --backend=kubernetes php5.6 start && webservice --backend=kubernetes stop && kubectl get pods && PYTHONPATH=. scripts/webservice --backend=kubernetes php5.6 status Starting webservice... Stopping webservice NAME READY STATUS RESTARTS AGE typoscan-3292915034-mzhbw 1/1 Terminating 0 5s <Pod typoscan-3292915034-mzhbw> pod = <Pod typoscan-3292915034-mzhbw> pod.obj = {u'status': {u'containerStatuses': [{u'restartCount': 0, u'name': u'webservice', u'image': u'docker-registry.tools.wmflabs.org/toollabs-php-web:latest', u'imageID': u'docker://sha256:3b9bc605cc9a8c4f379979fcac5d8b7a9b8024411a42f016c63a72af1b1aefd2', u'state': {u'running': {u'startedAt': u'2017-02-04T08:04:32Z'}}, u'ready': True, u'lastState': {}, u'containerID': u'docker://3d1488dcdd791a3e18c5b5659f267b674f748f8cecadbd3c37b3dead8cd45428'}], u'podIP': u'192.168.112.3', u'startTime': u'2017-02-04T08:04:30Z', u'hostIP': u'10.68.22.168', u'phase': u'Running', u'conditions': [{u'status': u'True', u'lastProbeTime': None, u'type': u'Initialized', u'lastTransitionTime': u'2017-02-04T08:04:30Z'}, {u'status': u'True', u'lastProbeTime': None, u'type': u'Ready', u'lastTransitionTime': u'2017-02-04T08:04:32Z'}, {u'status': u'True', u'lastProbeTime': None, u'type': u'PodScheduled', u'lastTransitionTime': u'2017-02-04T08:04:29Z'}]}, u'spec': {u'dnsPolicy': u'ClusterFirst', u'securityContext': {u'fsGroup': 51208, u'runAsUser': 51208}, u'nodeName': u'tools-worker-1026.tools.eqiad.wmflabs', u'terminationGracePeriodSeconds': 30, u'restartPolicy': u'Always', u'volumes': [{u'hostPath': {u'path': u'/public/dumps/'}, u'name': u'dumps'}, {u'hostPath': {u'path': u'/data/project/'}, u'name': u'home'}, {u'hostPath': {u'path': u'/data/scratch/'}, u'name': u'scratch'}, {u'hostPath': {u'path': u'/var/run/nslcd/socket'}, u'name': u'varrunnslcdsocket-e3lfc'}], u'containers': [{u'terminationMessagePath': u'/dev/termination-log', u'name': u'webservice', u'volumeMounts': [{u'mountPath': u'/public/dumps/', u'name': u'dumps'}, {u'mountPath': u'/data/project/', u'name': u'home'}, {u'mountPath': u'/data/scratch/', u'name': u'scratch'}, {u'mountPath': u'/var/run/nslcd/socket', u'name': u'varrunnslcdsocket-e3lfc'}], u'image': u'docker-registry.tools.wmflabs.org/toollabs-php-web:latest', u'workingDir': u'/data/project/typoscan/', u'command': [u'/usr/bin/webservice-runner', u'--type', u'lighttpd', u'--port', u'8000'], u'env': [{u'name': u'HOME', u'value': u'/data/project/typoscan/'}], u'imagePullPolicy': u'Always', u'securityContext': {u'runAsUser': 51208}, u'ports': [{u'protocol': u'TCP', u'name': u'http', u'containerPort': 8000}], u'resources': {u'requests': {u'cpu': u'125m', u'memory': u'256Mi'}, u'limits': {u'cpu': u'2', u'memory': u'2Gi'}}}]}, u'metadata': {u'name': u'typoscan-3292915034-mzhbw', u'deletionTimestamp': u'2017-02-04T08:05:03Z', u'deletionGracePeriodSeconds': 30, u'labels': {u'pod-template-hash': u'3292915034', u'tools.wmflabs.org/webservice-version': u'1', u'name': u'typoscan', u'tools.wmflabs.org/webservice': u'true'}, u'namespace': u'typoscan', u'resourceVersion': u'79206126', u'generateName': u'typoscan-3292915034-', u'creationTimestamp': u'2017-02-04T08:04:29Z', u'annotations': {u'kubernetes.io/created-by': u'{"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicaSet","namespace":"typoscan","name":"typoscan-3292915034","uid":"8dff911d-eab0-11e6-bab3-fa163edef48a","apiVersion":"extensions","resourceVersion":"79206076"}}\n'}, u'selfLink': u'/api/v1/namespaces/typoscan/pods/typoscan-3292915034-mzhbw', u'uid': u'8e021db1-eab0-11e6-bab3-fa163edef48a'}} pod.obj.status.phase = u'Running' Your webservice is running tools.typoscan@tools-bastion-03:/tmp/tools-webservice$
but deletionTimestamp seems to indicate that this pod is being deleted. On second thought, this seems irrelevant:
Currently the test for "pod running" is masked by if svc and deployment; i. e., if we would first test for running pod → STATUS_RUNNING, if there are no pods, but svc and deployment → STATUS_PENDING, otherwise → STATUS_STOPPED, that should work?
AFAICT this patch works:
diff --git a/toollabs/webservice/backends/kubernetesbackend.py b/toollabs/webservice/backends/kubernetesbackend.py index 39a6bbd..808f43c 100644 --- a/toollabs/webservice/backends/kubernetesbackend.py +++ b/toollabs/webservice/backends/kubernetesbackend.py @@ -380,15 +380,18 @@ class KubernetesBackend(Backend): self._delete_obj(pykube.Pod, self.webservice_label_selector) def get_state(self): + pod = self._find_obj(pykube.Pod, self.webservice_label_selector) + if pod is not None: + if pod.obj['status']['phase'] == 'Running': + return Backend.STATE_RUNNING + elif pod.obj['status']['phase'] == 'Pending': + return Backend.STATE_PENDING svc = self._find_obj(pykube.Service, self.webservice_label_selector) deployment = self._find_obj(pykube.Deployment, self.webservice_label_selector) if svc is not None and deployment is not None: - pod = self._find_obj(pykube.Pod, self.webservice_label_selector) - if pod is not None: - if pod.obj['status']['phase'] == 'Running': - return Backend.STATE_RUNNING return Backend.STATE_PENDING - return Backend.STATE_STOPPED + else: + return Backend.STATE_STOPPED def shell(self): podSpec = self._get_shell_pod()
The difference in behaviour is that webservice stop will wait until the pod is actually deleted and not just fire off the command and exit. I think this is a feature not only because that is what the user expects but also because it is the same logic as gridenginebackend.py's:
def get_state(self): job = self._get_job_xml() if job is not None: state = job.findtext('.//state').lower() if 'r' in state: return Backend.STATE_RUNNING else: return Backend.STATE_PENDING return Backend.STATE_STOPPED
I got the same error by doing:
webservice --backend=kubernetes python2 start
Afterwards, when I did the following...
webservice stop; webservice --backend=kubernetes python2 start
... I got...
Your webservice is not running Your job is already running
EDIT: Had to do
webservice --backend=kubernetes python2 stop; sleep 2; webservice --backend=kubernetes python2 start
to get
Stopping webservice Starting webservice..
For future reference that is the behavior when the service.manifest get confused. Solution is: webservice --backend=kubernetes python2 stop; sleep 2; webservice --backend=kubernetes python2 start
(Triaging to High because we are just gonna receive more and more reports of this if we don't fix this early)
@scfc Looks like this bug a continuing problem. If that patch works,
perhaps the revision process can begin via gerrit?
Change 336049 had a related patch set uploaded (by Tim Landscheidt):
Wait for the Kubernetes pod to shut down after "stop"
I have installed https://integration.wikimedia.org/ci/job/debian-glue/608/artifact/toollabs-webservice_0.33%7Edev+0%7E20170205034157.608+trusty%7E1.gbpb2f82f_all.deb on tools-bastion-03 and tools-services-02, the current host of webservicemonitor. I have tested some basics, and if in the next few days there are no hiccups, I'll merge the commit.
Puppet wanted to downgrade tools-webservice, so I put the dev version in aptly and ran Puppet on all hosts.
Change 336049 merged by jenkins-bot:
Wait for the Kubernetes pod to shut down after "stop"
Mentioned in SAL (#wikimedia-labs) [2017-02-07T13:49:32Z] <scfc_de> Deployed toollabs-webservice_0.33_all.deb (T156605, T156626).
webservice stop in ~ ((venv)tools.spiarticleanalyzer@tools-bastion-03:~$) still yields
(venv)tools.spiarticleanalyzer@tools-bastion-03:~$ webservice stop Traceback (most recent call last): File "/usr/local/bin/webservice", line 140, in <module> if job.get_state() == Backend.STATE_STOPPED: File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 372, in get_state pod = self._find_obj(pykube.Pod, self.webservice_label_selector) File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 191, in _find_obj selector=selector File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 75, in get raise ValueError("get() more than one object; use filter") ValueError: get() more than one object; use filter
The patched version still has the problem of failing if pod = self._find_obj(pykube.Pod, self.webservice_label_selector) matches more than one pod.
_find_obj has handling for the possibility of pykube.exceptions.ObjectDoesNotExist, but it does not handle the ValueError that will be raised if pykube.Query.get() matches more than one object.
How can there be more than one webservice pod? That should be impossible to achieve, and IMNSHO that should require manual intervention.
(In the course of debugging this just now, I ran . ./www/python/venv/bin/activate and webservice stop – successfully. I guessed that I needed to run webservice --backend=kubernetes python2 start to restart and did that; now http://tools.wmflabs.org/spiarticleanalyzer/ gives 404. @JustBerry, if I broke something, sorry.)
@scfc No worries, but http://tools.wmflabs.org/spiarticleanalyzer/ seems to
work fine for me... (perhaps you fixed it?). By the way, the error doesn't
show up all the time, but shows up from time to time. Thought it worthwhile
to at least mention it on this task.
Reporting bugs is always good, especially if someone claimed to have fixed one :-). I can reproduce the issue (sometimes), though I don't have any idea why it could happen. I'll give a bit more thought. For a moment I suspected WebServiceMonitor to have restarted the service in parallel, but that would have logged a line in service.log. *argl*
(Unassigning because I don't understand the failure mode, or rather, if my understanding had been correct, the bug had been fixed. Perhaps a fresh pair of eyes is helpful.)
@JustBerry Do you have the steps in which you produced the issue? i.e. what did you execute (and from which host) prior to running webservice stop on tools-bastion-03.
(Lowering priority as few further reports have been made on this issue)
tools.checker@tools-bastion-02:~/www/python/src$ webservice restart Restarting webservice.Traceback (most recent call last): File "/usr/local/bin/webservice", line 151, in <module> stop(job, 'Restarting webservice') File "/usr/local/bin/webservice", line 39, in stop wait_for(lambda: job.get_state() == Backend.STATE_STOPPED, message) File "/usr/lib/python2.7/dist-packages/toollabs/common/utils.py", line 20, in wait_for if predicate(): File "/usr/local/bin/webservice", line 39, in <lambda> wait_for(lambda: job.get_state() == Backend.STATE_STOPPED, message) File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 372, in get_state pod = self._find_obj(pykube.Pod, self.webservice_label_selector) File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 191, in _find_obj selector=selector File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 75, in get raise ValueError("get() more than one object; use filter") ValueError: get() more than one object; use filter Error in sys.excepthook: Traceback (most recent call last): File "/usr/lib/python2.7/dist-packages/apport_python_hook.py", line 145, in apport_excepthook os.O_WRONLY | os.O_CREAT | os.O_EXCL, 0o640), 'wb') as f: OSError: [Errno 2] No such file or directory: '/var/crash/_usr_bin_webservice.51934.crash' Original exception was: Traceback (most recent call last): File "/usr/local/bin/webservice", line 151, in <module> stop(job, 'Restarting webservice') File "/usr/local/bin/webservice", line 39, in stop wait_for(lambda: job.get_state() == Backend.STATE_STOPPED, message) File "/usr/lib/python2.7/dist-packages/toollabs/common/utils.py", line 20, in wait_for if predicate(): File "/usr/local/bin/webservice", line 39, in <lambda> wait_for(lambda: job.get_state() == Backend.STATE_STOPPED, message) File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 372, in get_state pod = self._find_obj(pykube.Pod, self.webservice_label_selector) File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 191, in _find_obj selector=selector File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 75, in get raise ValueError("get() more than one object; use filter") ValueError: get() more than one object; use filter
@Legoktm What's the output of kubectl get pods, kubectl get services, kubectl get deployments? I'll note that unlike the original exception what happened during starting the service, this one happened during stopping it.
I get these while stopping but not while starting, with the same traceback:
tools.yabbr@tools-bastion-03:~$ webservice stop Traceback (most recent call last): File "/usr/local/bin/webservice", line 144, in <module> if job.get_state() == Backend.STATE_STOPPED: File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 401, in get_state pod = self._find_obj(pykube.Pod, self.webservice_label_selector) File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 210, in _find_obj selector=selector File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 75, in get raise ValueError("get() more than one object; use filter") ValueError: get() more than one object; use filter
@APerson What's the output of kubectl get pods; kubectl get services; kubectl get deployments?
@zhuyifei1999 (sorry for the delayed response!)
NAME READY STATUS RESTARTS AGE interactive 1/1 Running 0 3d yabbr-1590617135-0euto 1/1 Running 0 3h NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE yabbr 192.168.98.49 <none> 8000/TCP 3h NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE yabbr 1 1 1 1 3h
Yep I'm encountering this all the time.
NAME READY STATUS RESTARTS AGE interactive 1/1 Running 0 16h video2commons-test-3674141639-bb1ho 1/1 Terminating 0 40s video2commons-test-3674141639-wl0iv 0/1 ContainerCreating 0 0s
Can we just change [[https://github.com/wikimedia/operations-software-tools-webservice/blob/7e8da902d4ac6a7bc05f548ffa7bb3f6267e65ed/toollabs/webservice/backends/kubernetesbackend.py#L405|elif pod.obj['status']['phase'] == 'Pending':]] into else:? A Terminating state should definitely not be Backend.STATE_STOPPED
I found a new way this can be broken today. While doing some CrashLoopBackOff state cleanup I found a tool which had 1 pod, no deployments, and 7 replicasets. Running webservice stop was crashing with the "get() more than one object; use filter" error and I'm very sure it was due to the multiple replicaset objects in the namespace. This looked to be the result of some cluster state corruption from ~1 year ago.
@Nehajha has been looking into this as part of her work on T190638: GSoC 2018 proposal for Improvements for the Toolforge 'webservice' command. She has been having difficulty reproducing the problem in her testing environment, so if anyone has tips on what seems to make it happen more often please do report them here.
Mentioned in SAL (#wikimedia-cloud) [2018-07-22T14:28:51Z] <zhuyifei1999_> backed up Neha16's changes to toolsbeta-bastion-01:/usr/lib/python2.7/dist-packages/toollabs to toollabs.bak in the same dir via cp -a, and re-install webservice code on the bastion to debug T156626
I bruteforced this issue:
Attempting my luck... no good.
toolsbeta.test@toolsbeta-bastion-01:~$ kubectl get po; kubectl get svc; kubectl get deploy NAME READY STATUS RESTARTS AGE test-1089897043-slwhl 1/1 Running 0 1d NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE test 192.168.40.242 <none> 8000/TCP 1d NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE test 1 1 1 1 1d toolsbeta.test@toolsbeta-bastion-01:~$ webservice restart Restarting webservice.... toolsbeta.test@toolsbeta-bastion-01:~$ webservice restart Restarting webservice....
Then I applied the patch to make the script check statuses more frequently:
zhuyifei1999@toolsbeta-bastion-01:/usr/lib/python2.7/dist-packages/toollabs$ sudo nano common/utils.py [...] zhuyifei1999@toolsbeta-bastion-01:/usr/lib/python2.7/dist-packages/toollabs$ curl -sL https://github.com/wikimedia/operations-software-tools-webservice/raw/master/toollabs/common/utils.py | diff - common/utils.py 24c24 < time.sleep(1) --- > time.sleep(0.01)
toolsbeta.test@toolsbeta-bastion-01:~$ webservice restart Restarting webservice...............Pod resisted shtdown Traceback (most recent call last): File "/usr/local/bin/webservice", line 219, in <module> sys.exist(1) AttributeError: 'module' object has no attribute 'exist' toolsbeta.test@toolsbeta-bastion-01:~$ webservice restart; kubectl get po; kubectl get svc; kubectl get deploy Your job is not running, starting............... NAME READY STATUS RESTARTS AGE test-1089897043-vb02z 0/1 ContainerCreating 0 1s NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE test 192.168.14.193 <none> 8000/TCP 1s NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE test 1 1 1 0 1s toolsbeta.test@toolsbeta-bastion-01:~$ webservice restart; kubectl get po; kubectl get svc; kubectl get deploy Restarting webservice...............Pod resisted shtdown Traceback (most recent call last): File "/usr/local/bin/webservice", line 219, in <module> sys.exist(1) AttributeError: 'module' object has no attribute 'exist' NAME READY STATUS RESTARTS AGE test-1089897043-vb02z 1/1 Terminating 0 5s
So we hit an unmerged workaround https://gerrit.wikimedia.org/r/#/c/operations/software/tools-webservice/+/444879/, have to use unmodified code to see if it's the same error:
zhuyifei1999@toolsbeta-bastion-01:/usr/lib/python2.7/dist-packages$ sudo cp -a toollabs toollabs.bak zhuyifei1999@toolsbeta-bastion-01:/usr/lib/python2.7/dist-packages$ sudo apt install --reinstall toollabs-webservice Reading package lists... Done Building dependency tree Reading state information... Done The following packages were automatically installed and are no longer required: linux-image-3.13.0-151-generic linux-image-extra-3.13.0-151-generic Use 'apt-get autoremove' to remove them. 0 upgraded, 0 newly installed, 1 reinstalled, 0 to remove and 6 not upgraded. Need to get 0 B/17.6 kB of archives. After this operation, 0 B of additional disk space will be used. (Reading database ... 197537 files and directories currently installed.) Preparing to unpack .../toollabs-webservice_0.39_all.deb ... Unpacking toollabs-webservice (0.39) over (0.39) ... Setting up toollabs-webservice (0.39) ... N: Ignoring file '20auto-upgrades.ucf-dist' in directory '/etc/apt/apt.conf.d/' as it has an invalid filename extension zhuyifei1999@toolsbeta-bastion-01:/usr/lib/python2.7/dist-packages$ cd toollabs zhuyifei1999@toolsbeta-bastion-01:/usr/lib/python2.7/dist-packages/toollabs$ curl -sL https://github.com/wikimedia/operations-software-tools-webservice/raw/master/toollabs/common/utils.py | diff - common/utils.py zhuyifei1999@toolsbeta-bastion-01:/usr/lib/python2.7/dist-packages/toollabs$ patch common/utils.py [...] File common/utils.py is read-only; trying to patch anyway patch: **** Can't create temporary file common/utils.py.ouA6Nn0 : Permission denied zhuyifei1999@toolsbeta-bastion-01:/usr/lib/python2.7/dist-packages/toollabs$ sudo patch common/utils.py 24c24 < time.sleep(1) --- > time.sleep(0.01) patching file common/utils.py zhuyifei1999@toolsbeta-bastion-01:/usr/lib/python2.7/dist-packages/toollabs$ curl -sL https://github.com/wikimedia/operations-software-tools-webservice/raw/master/toollabs/common/utils.py | diff - common/utils.py 24c24 < time.sleep(1) --- > time.sleep(0.01)
And indeed it's the same issue:
toolsbeta.test@toolsbeta-bastion-01:~$ webservice restart; kubectl get po; kubectl get svc; kubectl get deploy Your job is not running, starting............... NAME READY STATUS RESTARTS AGE test-1089897043-1oxls 0/1 ContainerCreating 0 1s NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE test 192.168.62.70 <none> 8000/TCP 1s NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE test 1 1 1 0 1s toolsbeta.test@toolsbeta-bastion-01:~$ webservice restart; kubectl get po; kubectl get svc; kubectl get deploy Restarting webservice............... NAME READY STATUS RESTARTS AGE test-1089897043-1oxls 1/1 Terminating 0 7s test-1089897043-qmza8 0/1 ContainerCreating 0 0s NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE test 192.168.103.35 <none> 8000/TCP 0s NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE test 1 1 1 0 1s toolsbeta.test@toolsbeta-bastion-01:~$ webservice restart; kubectl get po; kubectl get svc; kubectl get deploy Restarting webservice...............Traceback (most recent call last): File "/usr/local/bin/webservice", line 179, in <module> start(job, '') File "/usr/local/bin/webservice", line 60, in start wait_for(lambda: job.get_state() == Backend.STATE_RUNNING, message) File "/usr/lib/python2.7/dist-packages/toollabs/common/utils.py", line 20, in wait_for if predicate(): File "/usr/local/bin/webservice", line 60, in <lambda> wait_for(lambda: job.get_state() == Backend.STATE_RUNNING, message) File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 402, in get_state pod = self._find_obj(pykube.Pod, self.webservice_label_selector) File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 210, in _find_obj selector=selector File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 75, in get raise ValueError("get() more than one object; use filter") ValueError: get() more than one object; use filter NAME READY STATUS RESTARTS AGE test-1089897043-iwu5t 0/1 ContainerCreating 0 1s test-1089897043-qmza8 1/1 Terminating 0 45s NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE test 192.168.43.160 <none> 8000/TCP 1s NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE test 1 1 1 0 1s
However, I don't understand how the second command in the above paste worked, considering the output of kubectl get pods is almost the same. Fortunately, the majority of the invocations thereafter are able to produce the same STATE_RUNNING traceback.
I am unable to reproduce the traceback on STATE_STOPPED. My initial theory of restarting webservice too fast doesn't work:
toolsbeta.test@toolsbeta-bastion-01:~$ webservice restart; webservice restart; kubectl get po; kubectl get svc; kubectl get deploy; Restarting webservice...............Traceback (most recent call last): File "/usr/local/bin/webservice", line 179, in <module> start(job, '') File "/usr/local/bin/webservice", line 60, in start wait_for(lambda: job.get_state() == Backend.STATE_RUNNING, message) File "/usr/lib/python2.7/dist-packages/toollabs/common/utils.py", line 20, in wait_for if predicate(): File "/usr/local/bin/webservice", line 60, in <lambda> wait_for(lambda: job.get_state() == Backend.STATE_RUNNING, message) File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 402, in get_state pod = self._find_obj(pykube.Pod, self.webservice_label_selector) File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 210, in _find_obj selector=selector File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 75, in get raise ValueError("get() more than one object; use filter") ValueError: get() more than one object; use filter Traceback (most recent call last): File "/usr/local/bin/webservice", line 175, in <module> if job.get_state() != Backend.STATE_RUNNING: File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 402, in get_state pod = self._find_obj(pykube.Pod, self.webservice_label_selector) File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 210, in _find_obj selector=selector File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 75, in get raise ValueError("get() more than one object; use filter") ValueError: get() more than one object; use filter NAME READY STATUS RESTARTS AGE test-1089897043-722xn 1/1 Terminating 0 1m test-1089897043-iwcfv 0/1 ContainerCreating 0 0s NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE test 192.168.96.120 <none> 8000/TCP 0s NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE test 1 1 1 0 0s
I cannot figure out how job.get_state() succeeds in <module> (1 match only) but turn into multiple matches after stopping. More playing around:
toolsbeta.test@toolsbeta-bastion-01:~$ webservice stop; sleep 5; webservice --backend kubernetes php5.6 start; sleep 5; kubectl get po; webservice restart; kubectl get po; kubectl get svc; kubectl get deploy; Stopping webservice............... Starting webservice............... NAME READY STATUS RESTARTS AGE test-1089897043-exein 1/1 Running 0 6s Restarting webservice...............Traceback (most recent call last): File "/usr/local/bin/webservice", line 179, in <module> start(job, '') File "/usr/local/bin/webservice", line 60, in start wait_for(lambda: job.get_state() == Backend.STATE_RUNNING, message) File "/usr/lib/python2.7/dist-packages/toollabs/common/utils.py", line 20, in wait_for if predicate(): File "/usr/local/bin/webservice", line 60, in <lambda> wait_for(lambda: job.get_state() == Backend.STATE_RUNNING, message) File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 402, in get_state pod = self._find_obj(pykube.Pod, self.webservice_label_selector) File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 210, in _find_obj selector=selector File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 75, in get raise ValueError("get() more than one object; use filter") ValueError: get() more than one object; use filter NAME READY STATUS RESTARTS AGE test-1089897043-ddhtc 0/1 ContainerCreating 0 1s test-1089897043-exein 1/1 Terminating 0 7s NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE test 192.168.17.207 <none> 8000/TCP 1s NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE test 1 1 1 0 1s toolsbeta.test@toolsbeta-bastion-01:~$ webservice stop; sleep 5; webservice --backend kubernetes php5.6 start; sleep 1; kubectl get po; webservice restart; kubectl get po; kubectl get svc; kubectl get deploy; Stopping webservice............... Starting webservice............... NAME READY STATUS RESTARTS AGE test-1089897043-7oo7w 0/1 ContainerCreating 0 2s Your job is not running, starting......... NAME READY STATUS RESTARTS AGE test-1089897043-7oo7w 1/1 Running 0 2s NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE test 192.168.58.103 <none> 8000/TCP 2s NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE test 1 1 1 1 3s
Change 450495 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/software/tools-webservice@master] Kubernetes: ignore terminating objects when searching
I did a bit of hacking on the kubernetesbackend.py file on toolsbeta-bastion-01. I got webservice restart to output more useful information about the state of duplicate objects:
================================================== <Pod test-1089897043-c1itd> { u'conditions': [ { u'lastProbeTime': None, u'lastTransitionTime': u'2018-08-06T00:26:47Z', u'status': u'True', u'type': u'Initialized'}, { u'lastProbeTime': None, u'lastTransitionTime': u'2018-08-06T00:26:49Z', u'status': u'True', u'type': u'Ready'}, { u'lastProbeTime': None, u'lastTransitionTime': u'2018-08-06T00:26:47Z', u'status': u'True', u'type': u'PodScheduled'}], u'containerStatuses': [ { u'containerID': u'docker://18babad048d50e121065945f68e56d58527ac8915be5cf1aaa7616fc44e2fa3d', u'image': u'docker-registry.tools.wmflabs.org/toollabs-php-web:latest', u'imageID': u'docker://sha256:b6b7733f0d037e5b8b1d301cceda9331f336934496e28448b93b977fab8f67b9', u'lastState': { }, u'name': u'webservice', u'ready': True, u'restartCount': 0, u'state': { u'running': { u'startedAt': u'2018-08-06T00:26:48Z'}}}], u'hostIP': u'10.68.18.110', u'phase': u'Running', u'podIP': u'192.168.215.3', u'startTime': u'2018-08-06T00:26:47Z'} ================================================== ================================================== <Pod test-1089897043-ontg3> { u'phase': u'Pending'} ==================================================
I'm using the pprint module to print out the Pod.obj['status'] internal state. I guessed this would be where the useful information would be based on the Pod.ready synthetic property from the upstream library.
I did some more hacking after this and wrote:
def get_state(self): """ Returns state of webservice. One of: - Backend.STATE_STOPPED - Backend.STATE_RUNNING - Backend.STATE_PENDING """ def pod_cmp(a, b): # Pending < Running < [anything else] phase_a = a.obj['status']['phase'] phase_b = b.obj['status']['phase'] if phase_a == phase_b: return 0 if phase_a == 'Pending': return -1 if phase_b == 'Pending': return 1 if phase_a == 'Running': return -1 if phase_b == 'Running': return 1 return -1 if phase_a < phase_b else 1 pods = pykube.Pod.objects(self.api).filter( namespace=self.tool.name, selector=self.webservice_label_selector ) if pods: for p in pods: # FIXME: debug print to be removed print('{} {} {}'.format( len(pods), repr(p), p.obj['status']['phase']), file=sys.stderr) pod = sorted([p for p in pods], pod_cmp)[0] if pod.obj['status']['phase'] == 'Running': return Backend.STATE_RUNNING if pod.obj['status']['phase'] == 'Pending': return Backend.STATE_PENDING svc = self._find_obj(pykube.Service, self.webservice_label_selector) deployment = self._find_obj( pykube.Deployment, self.webservice_label_selector) if svc is not None and deployment is not None: return Backend.STATE_PENDING else: return Backend.STATE_STOPPED
This seemed to work for basic tests, but I never saw anything in the 'Terminating' state to be sure. I decided to change the "test" tool to the kubernetes python backend and write a wsgi app with this app.py that would be slow to shutdown to test further:
import atexit import sys import time import flask app = flask.Flask(__name__) @app.route('/') def index(): return 'Hello World!' def nap(): print('Shutting down...', file=sys.stderr) time.sleep(60) atexit.register(nap)
With the get_state() changes I made above and this webservice things almost work as hoped. The initial webservice start is good.
webservice stop works as expected as well, except that it returns while there is still a pod in the 'Terminating' state according to kubectl get pods. webservice restart works as long as there is no pod in the 'Terminating' state when you make the command. If there is a pod still terminating then you get output that looks like this:
$ webservice --backend=kubernetes python restart 2 <Pod test-1203602763-vwvzp> Running 2 <Pod test-1203602763-xvuks> Running Traceback (most recent call last): File "/usr/local/bin/webservice", line 178, in <module> stop(job, 'Restarting webservice') File "/usr/local/bin/webservice", line 64, in stop job.request_stop() File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 417, in request_stop self._delete_obj(pykube.Pod, self.webservice_label_selector) File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 240, in _delete_obj selector=selector File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 75, in get raise ValueError("get() more than one object; use filter") ValueError: get() more than one object; use filter
After adding more debug printing it appears that Pod.obj['status']['phase'] == 'Running' in many different circumstances including states where kubectl get pods would say "Terminating". I eventually figured out that "Terminating" is shown when Pod.obj['metadata']['deletionTimestamp'] exists and has a timestamp in it. It turns out that @scfc had figured this out all the way back in T156626#2998399, but then ignored it because he did not quite understand that Kubernetes would start a new pod while there was still one terminating.
Now that I know how to tell if an object is in the process of being deleted, I just needed to figure out how to ignore those objects. I think it is safe to do this on the assumption that Kubernetes will eventually figure out how to kill even runaway processes like the deliberately slow to shutdown one I wrote.
I played around a bit with the fieldSelector parameter for the search, but I could not figure out how to use it to ensure that metadata.deletionTimestamp was either missing or empty. The search results can easily be treated as a normal python list however (we've been doing that for debugging above), so I decided to post-process the search results and throw out the objects that have a deletionTimestamp:
objs = [ o for o in objs if o.obj['metadata'].get('deletionTimestamp', None) is None ]
Once I had done this I had to also recreate the logic for checking for no results and multiple results that get() was doing for us before:
found = len(objs) if found == 0: return None elif found == 1: return objs[0] else: raise ValueError("get() more than one object; use filter")
Once I had rolled back other changes and put this logic into _find_obj() things were really, really close to working. It turned out however that _delete_obj() would still fail in some circumstances because it was not using _find_obj() and instead searching itself. I rewrote it to use _find_obj():
def _delete_obj(self, kind, selector): """ Delete object of kind matching selector if it exists """ o = self._find_obj(kind, selector) if o is not None: o.delete()
I have not been able to get things to break again with these changes applied. Running webservice restart many times in a row with the slow to stop python app will leave kubectl get pods showing a lot of pods stuck in "Terminating" state, but they eventually get cleaned up.
The patch for all this is https://gerrit.wikimedia.org/r/450495
Change 450495 merged by jenkins-bot:
[operations/software/tools-webservice@master] Kubernetes: ignore terminating objects when searching
Mentioned in SAL (#wikimedia-cloud) [2018-08-08T10:00:59Z] <zhuyifei1999_> building & publishing toollabs-webservice 0.40 deb, and all Docker images T156626 T148872 T158244
I'm getting this error:
$ webservice stop
Traceback (most recent call last):
File "/usr/local/bin/webservice", line 176, in <module> stop(job, 'Stopping webservice') File "/usr/local/bin/webservice", line 69, in stop job.request_stop() File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 436, in request_stop pykube.ReplicaSet, 'name={name}'.format(name=self.tool.name)) File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 260, in _delete_obj o = self._find_obj(kind, selector) File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 252, in _find_obj ', '.join(repr(o) for o in objs)
ValueError: Found 2 objects of type ReplicaSet matching selector name=wd-depicts: <ReplicaSet wd-depicts-1035148579>, <ReplicaSet wd-depicts-1769414028>. See https://phabricator.wikimedia.org/T156626
Any clues?
$ kubectl get replicaset NAME DESIRED CURRENT READY AGE wd-depicts-1035148579 0 0 0 7d wd-depicts-1769414028 1 1 1 7d $ kubectl delete replicaset/wd-depicts-1035148579 replicaset "wd-depicts-1035148579" deleted $ kubectl get replicaset NAME DESIRED CURRENT READY AGE wd-depicts-1769414028 1 1 1 7d
Change 566045 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/software/tools-webservice@master] [RFC] kubernetes: add support for multiple objects of any kind
Change 566045 merged by jenkins-bot:
[operations/software/tools-webservice@master] kubernetes: add support for multiple objects of any kind
$ sudo become mix-n-match $ kubectl get all NAME READY STATUS RESTARTS AGE pod/mix-n-match-6c888b79f5-686w4 1/1 Running 0 22d pod/mix-n-match-6c888b79f5-zlcw2 1/1 Running 0 9d NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/mix-n-match ClusterIP 10.100.151.252 <none> 8000/TCP 22d NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/mix-n-match 2/2 2 2 22d NAME DESIRED CURRENT READY AGE replicaset.apps/mix-n-match-6c888b79f5 2 2 2 22d
It looks like kubectl scale has been used to increase the Pod count for mix-n-match. I will do some testing with another tool to see if I can figure out why the recent updates to webservice are not handling this as an expected case.
Mentioned in SAL (#wikimedia-cloud) [2020-02-07T21:09:19Z] <bstorm_> upgraded toollabs-webservice package for stretch toolsbeta to 0.59 T244293 T244289 T234617 T156626
Mentioned in SAL (#wikimedia-cloud) [2020-02-10T21:18:57Z] <bstorm_> upgraded toollabs-webservice package for stretch toolsbeta to 0.62 T244293 T244289 T234617 T156626
Mentioned in SAL (#wikimedia-cloud) [2020-02-10T21:25:10Z] <bstorm_> upgraded toollabs-webservice package for tools to 0.62 T244293 T244289 T234617 T156626