Page MenuHomePhabricator

Webservice shell repeatedly times out
Closed, ResolvedPublic

Description

ningu in #wikimedia-cloud reported that webservice --backend=kubernetes node10 shell was failing with "Pod is not ready in time". I asked them to try it again, and it timed out another two times. I logged in to Toolforge and tried to open a node10 shell, a nodejs shell, and a python3.7 shell, all of which failed the same way. I tried the node10 shell and the python3.7 shell from a tool I have migrated to the new k8s cluster (anticompositetools) and one I have not (anticompositetest), and it failed in both cases.

webservice --backend=kubernetes [container] {start|stop|restart|status} is unaffected and works normally.

Event Timeline

bd808 triaged this task as High priority.

This seems to be a regression that was introduced in the latest build of webservice. I will try to figure out what is going wrong.

This made me wonder if there's a debug logging option in webservice...

This made me wonder if there's a debug logging option in webservice...

There is not currently, but it would not be a bad thing to add.

My initial hunch here was that we somehow ended up making a nonsense Pod spec. To test that theory I learned how to use pdb to step through a live script execution:

$ python -m pdb /usr/bin/webservice --backend=kubernetes php7.3 shell
> /usr/bin/webservice(2)<module>()
-> from __future__ import print_function
(Pdb) break toolforge/webservice/backends/kubernetesbackend.py:614
Breakpoint 1 at /usr/lib/python2.7/dist-packages/toolforge/webservice/backends/kubernetesbackend.py:614
(Pdb) continue
> /usr/lib/python2.7/dist-packages/toolforge/webservice/backends/kubernetesbackend.py(614)shell()                                                               
-> podSpec = self._get_shell_pod()
(Pdb) list
609                 return Backend.STATE_PENDING
610             else:                                                           
611                 return Backend.STATE_STOPPED
612
613         def shell(self):
614 B->         podSpec = self._get_shell_pod()                                 
615             if len(self._find_objs(pykube.Pod, self.shell_label_selector)) != 0:
616                 pykube.Pod(self.api, podSpec).create()
617                                                                             
618             if not self._wait_for_pods(self.shell_label_selector):
619                 print("Pod is not ready in time", file=sys.stderr)
(Pdb) next
> /usr/lib/python2.7/dist-packages/toolforge/webservice/backends/kubernetesbackend.py(615)shell()
-> if len(self._find_objs(pykube.Pod, self.shell_label_selector)) != 0:
(Pdb) pp podSpec                                                                
{'apiVersion': 'v1',
 'kind': 'Pod',
 'metadata': {'labels': {'name': 'interactive',
                         'toolforge': 'tool',
                         'tools.wmflabs.org/webservice-interactive': 'true',
                         'tools.wmflabs.org/webservice-interactive-version': '1'},
              'name': 'interactive',
              'namespace': 'tool-bd808-test'},
 'spec': {'containers': [{'command': ['/bin/bash', '-il'],
                          'image': 'docker-registry.tools.wmflabs.org/toolforge-php73-sssd-web:latest',
                          'name': 'interactive',
                          'ports': None,
                          'resources': None,
                          'stdin': True,
                          'tty': True,
                          'workingDir': '/data/project/bd808-test/'}]}}

The podSpec data looks fine, so lets keep going and figure out what happens in the test for an existing Pod:

(Pdb) step
--Call--
> /usr/lib/python2.7/dist-packages/toolforge/webservice/backends/kubernetesbackend.py(266)_find_objs()
-> def _find_objs(self, kind, selector):
(Pdb) next
> /usr/lib/python2.7/dist-packages/toolforge/webservice/backends/kubernetesbackend.py(274)_find_objs()
-> objs = kind.objects(self.api).filter(
(Pdb) next
> /usr/lib/python2.7/dist-packages/toolforge/webservice/backends/kubernetesbackend.py(275)_find_objs()
-> namespace=self._get_ns(), selector=selector
(Pdb) next
> /usr/lib/python2.7/dist-packages/toolforge/webservice/backends/kubernetesbackend.py(279)_find_objs()
-> o
(Pdb) next
> /usr/lib/python2.7/dist-packages/toolforge/webservice/backends/kubernetesbackend.py(280)_find_objs()
-> for o in objs
(Pdb) next
> /usr/lib/python2.7/dist-packages/toolforge/webservice/backends/kubernetesbackend.py(284)_find_objs()
-> return objs
(Pdb) pp objs
[]
(Pdb) next
--Return--
> /usr/lib/python2.7/dist-packages/toolforge/webservice/backends/kubernetesbackend.py(284)_find_objs()->[]
-> return objs
(Pdb) pp objs
[]

The selector based search does not match any running Pods which is what I expected.

(Pdb) next
> /usr/lib/python2.7/dist-packages/toolforge/webservice/backends/kubernetesbackend.py(618)shell()
-> if not self._wait_for_pods(self.shell_label_selector):
(Pdb) quit

Why are we suddenly waiting for a Pod to start without requesting creation of a Pod first? This is unexpected!! Let's look at the listing again:

(Pdb) list
609                 return Backend.STATE_PENDING
610             else:                                                           
611                 return Backend.STATE_STOPPED
612
613         def shell(self):
614 B->         podSpec = self._get_shell_pod()                                 
615             if len(self._find_objs(pykube.Pod, self.shell_label_selector)) != 0:
616                 pykube.Pod(self.api, podSpec).create()
617                                                                             
618             if not self._wait_for_pods(self.shell_label_selector):
619                 print("Pod is not ready in time", file=sys.stderr)

Logic bug! rOSTWa702e02f2980: kubernetes: add support for multiple objects of any kind inverted the logic of the if len(self._find_objs(pykube.Pod, self.shell_label_selector)) != 0 test unintentionally.

Change 571778 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/software/tools-webservice@master] k8s shell: fix logic bug in guard for new Pod creation

https://gerrit.wikimedia.org/r/571778

Change 571778 merged by jenkins-bot:
[operations/software/tools-webservice@master] k8s shell: fix logic bug in guard for new Pod creation

https://gerrit.wikimedia.org/r/571778

Mentioned in SAL (#wikimedia-cloud) [2020-02-12T19:15:10Z] <bd808> Deployed toollabs-webservice (0.63) on bastions (T244954)

Mentioned in SAL (#wikimedia-cloud) [2020-02-12T19:29:21Z] <bd808> Rebuilding all Docker images to pick up toollabs-webservice (0.63) (T244954)

$ webservice --backend=kubernetes php7.3 shell
Defaulting container name to interactive.
Use 'kubectl describe pod/interactive -n tool-bd808-test' to see all of the containers in this pod.
If you don't see a command prompt, try pressing enter.
tools.bd808-test@interactive:~$