Page MenuHomePhabricator

Cull Idle does not work on pAWS
Closed, ResolvedPublic

Description

It fails with:

Traceback (most recent call last):
  File "/usr/local/bin/cull_idle_servers.py", line 88, in <module>
    loop.run_sync(cull)
  File "/usr/local/lib/python3.5/dist-packages/tornado/ioloop.py", line 458, in run_sync
    return future_cell[0].result()
  File "/usr/local/lib/python3.5/dist-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/usr/local/lib/python3.5/dist-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/local/bin/cull_idle_servers.py", line 55, in cull_idle
    resp = yield client.fetch(req)
  File "/usr/local/lib/python3.5/dist-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/usr/local/lib/python3.5/dist-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/usr/local/lib/python3.5/dist-packages/tornado/stack_context.py", line 316, in wrapped
    ret = fn(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tornado/simple_httpclient.py", line 307, in _on_timeout
    raise HTTPError(599, error_message)
tornado.httpclient.HTTPError: HTTP 599: Timeout while connecting

Event Timeline

@yuvipanda had a hunch that dns was not working as hoped inside the containers, but:

bdavis_(wmf)@PAWS:~$ dig paws.wmflabs.org

; <<>> DiG 9.10.3-P4-Ubuntu <<>> paws.wmflabs.org
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 9928
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;paws.wmflabs.org.              IN      A

;; ANSWER SECTION:
paws.wmflabs.org.       86400   IN      A       10.68.21.68

;; Query time: 3 msec
;; SERVER: 10.96.0.10#53(10.96.0.10)
;; WHEN: Wed Sep 06 19:48:29 UTC 2017
;; MSG SIZE  rcvd: 50
bd808 triaged this task as High priority.

Raising priority because the lack of an automatic shutdown method for idle pods leads the PAWS k8s cluster to fill up and then block new pods from spawning.

I'm also un-licking this cookie for @yuvipanda. If anyone has time to spend looking into the problem to narrow down the issue it would be appreciated. The addition of the cloud-services-team (Kanban) tag doesn't guarantee that we will get to it soon, but it does at least put it in a place where we look on a regular basis.

Inside the "hub" pod, the script is running as /usr/bin/python3 /usr/local/bin/cull_idle_servers.py --timeout=3600 --cull_every=600.

$ ssh tools-paws-master-01.tools.eqiad.wmflabs
$ sudo su yuvipanda
$ kubectl --namespace=prod get pod
NAME                               READY     STATUS        RESTARTS   AGE
db-proxy-277156289-xrb8j           1/1       Running       0          30m
deploy-hook-2549193673-1kp03       1/1       Running       0          30m
hub-deployment-2356949582-3dpc5    1/1       Running       0          30m
jupyter-kolossos                   1/1       Running       0          17h
jupyter-liridon                    1/1       Terminating   0          33d
jupyter-mattho69                   1/1       Running       0          17h
jupyter-msdsedemo                  1/1       Running       0          27m
jupyter-sarilho1                   1/1       Running       1          5h
proxy-deployment-581340475-bmqqd   1/1       Running       0          29m
query-killer-3262373550-5tqr0      1/1       Running       0          29m
$ kubectl --namespace=prod exec -it hub-deployment-2356949582-3dpc5 -- /bin/bash
$ ps auxww
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
tools.p+     1  0.0  0.0   1024     4 ?        Ss   15:32   0:00 /pause
tools.p+     7  0.0  0.0   4496   760 ?        Ss   15:32   0:00 /bin/sh -c jupyterhub --config /srv/jupyterhub_config.py --no-ssl
tools.p+    13  8.3  1.1 3652628 90740 ?       Dl   15:32   2:45 /usr/bin/python3 /usr/local/bin/jupyterhub --config /srv/jupyterhub_config.py --no-ssl
tools.p+   248  0.0  0.0  18420  3496 pts/0    Ss   16:04   0:00 /bin/bash
tools.p+   262  1.7  0.2  79340 22684 ?        Ss   16:04   0:00 /usr/bin/python3 /usr/local/bin/cull_idle_servers.py --timeout=3600 --cull_every=600
tools.p+   263  0.0  0.0  36836  2916 pts/0    R+   16:05   0:00 ps auxww
$ 

Am now mostly convinced this is because of hairpin mode problems - a pod trying to talk to a service IP for itself runs into issues reaching itself, causing the 599 issues.

One possibility is to just change https://github.com/jupyterhub/zero-to-jupyterhub-k8s/blob/master/images/hub/cull_idle_servers.py to use 127.0.0.1 to talk to the hub, rather than the service IP. This should work...

It seems it is hitting a 404 due to it missing the + base_url + already merged upstram.

What is the process for pulling a new hub image? How much downtime does it entail?

Mentioned in SAL (#wikimedia-cloud) [2018-02-22T21:13:41Z] <chicocvenancio> jupyterhub updated to fix culler (T175202) culler already ran without 404

Mentioned in SAL (#wikimedia-cloud) [2018-02-22T22:11:16Z] <chicocvenancio> (T175202) culler is running and killing pods as designed!