User pods are staying idle far longer then the set one hour. This led to the full capacity of the cluster being used and users suffering a degraded service.
Description
Related Objects
Event Timeline
The traceback in the logs relating to the culler seems to be:
[E 180504 02:43:11 ioloop:638] Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x7f0daad5cea0>, <tornado.concur rent.Future object at 0x7f0daad77898>) Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/tornado/ioloop.py", line 605, in _run_callback ret = callback() File "/usr/local/lib/python3.6/dist-packages/tornado/stack_context.py", line 277, in null_wrapper return fn(*args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/tornado/ioloop.py", line 626, in _discard_future_result future.result() File "/usr/local/lib/python3.6/dist-packages/tornado/concurrent.py", line 238, in result raise_exc_info(self._exc_info) File "<string>", line 4, in raise_exc_info File "/usr/local/lib/python3.6/dist-packages/tornado/gen.py", line 1063, in run yielded = self.gen.throw(*exc_info) File "/usr/local/bin/cull_idle_servers.py", line 58, in cull_idle resp = yield client.fetch(req) File "/usr/local/lib/python3.6/dist-packages/tornado/gen.py", line 1055, in run value = future.result() File "/usr/local/lib/python3.6/dist-packages/tornado/concurrent.py", line 238, in result raise_exc_info(self._exc_info) File "<string>", line 4, in raise_exc_info File "/usr/local/lib/python3.6/dist-packages/tornado/stack_context.py", line 316, in wrapped ret = fn(*args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/tornado/simple_httpclient.py", line 307, in _on_timeout raise HTTPError(599, error_message) tornado.httpclient.HTTPError: HTTP 599: Timeout during request [I 2018-05-04 02:43:18.733 JupyterHub log:122] 200 GET /paws/hub/api/users (cull-idle@127.0.0.1) 23901.27ms
AFAICT, it took 20 seconds for /paws/hub/api/users to respond, and culler timed out on the request.
There might be also an encoding issue:
[I 2018-05-04 03:24:18.850 JupyterHub service:109] Spawning /usr/local/bin/cull_idle_servers.py --timeout=3600 --cull-every=600 --url=http://127.0.0.1:8081/paws/hub/api [I 180504 03:24:24 cull_idle_servers:68] Culling server for matěj_suchánek (inactive since 2018-03-26 06:58:47.157000) [I 2018-05-04 03:24:27.291 JupyterHub log:122] 200 GET /paws/hub/api/users (cull-idle@127.0.0.1) 4943.87ms Traceback (most recent call last): File "/usr/local/bin/cull_idle_servers.py", line 114, in <module> loop.run_sync(cull) File "/usr/local/lib/python3.6/dist-packages/tornado/ioloop.py", line 458, in run_sync return future_cell[0].result() File "/usr/local/lib/python3.6/dist-packages/tornado/concurrent.py", line 238, in result raise_exc_info(self._exc_info) File "<string>", line 4, in raise_exc_info File "/usr/local/lib/python3.6/dist-packages/tornado/gen.py", line 1063, in run yielded = self.gen.throw(*exc_info) File "/usr/local/bin/cull_idle_servers.py", line 93, in cull_idle yield f File "/usr/local/lib/python3.6/dist-packages/tornado/gen.py", line 1055, in run value = future.result() File "/usr/local/lib/python3.6/dist-packages/tornado/concurrent.py", line 238, in result raise_exc_info(self._exc_info) File "<string>", line 4, in raise_exc_info File "/usr/local/lib/python3.6/dist-packages/tornado/gen.py", line 1063, in run yielded = self.gen.throw(*exc_info) File "/usr/local/bin/cull_idle_servers.py", line 73, in cull_one yield client.fetch(req) File "/usr/local/lib/python3.6/dist-packages/tornado/gen.py", line 1055, in run value = future.result() File "/usr/local/lib/python3.6/dist-packages/tornado/concurrent.py", line 238, in result raise_exc_info(self._exc_info) File "<string>", line 4, in raise_exc_info tornado.httpclient.HTTPError: HTTP 404: Not Found [W 2018-05-04 03:24:27.370 JupyterHub log:122] 404 DELETE /paws/hub/api/users/matÄj_suchánek/server (cull-idle@127.0.0.1) 53.82ms
That is T189534, we're waiting for an upstream release that should come soon on that one.
Chico said on IRC that the excessive time consumption may be caused by the sqlite database being on NFS, and indeed, during time python3 -c 'print(__import__("requests").get("http://localhost:8081/paws/hub/api/users", headers={"Authorization":"token 3[REDACTED, see /proc/47/environ JUPYTERHUB_API_TOKEN]8"}).text)' the hub goes into D-state
Can the database be switched to use toolsdb instead of the sqlite setup? The upstream documentation seems to suggest that this is optimal.
Yes, it might be ideal to migrate the DB, but indeed we can start fresh without data loss (beyond sessions), as data persistence is done through the home folder and the identifier is the global user id.
Mentioned in SAL (#wikimedia-cloud) [2018-05-05T18:25:07Z] <bd808> Added Chico Venancio as a maintainer (T193828)
There is an upstream issue with our use of the utf8_mb4 collation, I send a PR to attempt to fix it. https://github.com/jupyterhub/jupyterhub/pull/1866
PAWS-beta is running with that change now and working from a toolsdb database. I locally built a wheel and pushed it to github for paws-beta, maybe we should do it in a slightly more elegant way if we need to do this in production.
This is a pretty serious bug, as without the culler PAWS will overload and stop working every few days without manual intervention, but I think we can manage and wait a few days before deciding to go the patched wheel route.