Page MenuHomePhabricator

PAWS culler stopped working
Closed, ResolvedPublic

Description

User pods are staying idle far longer then the set one hour. This led to the full capacity of the cluster being used and users suffering a degraded service.

Event Timeline

The traceback in the logs relating to the culler seems to be:

[E 180504 02:43:11 ioloop:638] Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x7f0daad5cea0>, <tornado.concur
rent.Future object at 0x7f0daad77898>)
    Traceback (most recent call last):
      File "/usr/local/lib/python3.6/dist-packages/tornado/ioloop.py", line 605, in _run_callback
        ret = callback()
      File "/usr/local/lib/python3.6/dist-packages/tornado/stack_context.py", line 277, in null_wrapper
        return fn(*args, **kwargs)
      File "/usr/local/lib/python3.6/dist-packages/tornado/ioloop.py", line 626, in _discard_future_result
        future.result()
      File "/usr/local/lib/python3.6/dist-packages/tornado/concurrent.py", line 238, in result
        raise_exc_info(self._exc_info)
      File "<string>", line 4, in raise_exc_info
      File "/usr/local/lib/python3.6/dist-packages/tornado/gen.py", line 1063, in run
        yielded = self.gen.throw(*exc_info)
      File "/usr/local/bin/cull_idle_servers.py", line 58, in cull_idle
        resp = yield client.fetch(req)
      File "/usr/local/lib/python3.6/dist-packages/tornado/gen.py", line 1055, in run
        value = future.result()
      File "/usr/local/lib/python3.6/dist-packages/tornado/concurrent.py", line 238, in result
        raise_exc_info(self._exc_info)
      File "<string>", line 4, in raise_exc_info
      File "/usr/local/lib/python3.6/dist-packages/tornado/stack_context.py", line 316, in wrapped
        ret = fn(*args, **kwargs)
      File "/usr/local/lib/python3.6/dist-packages/tornado/simple_httpclient.py", line 307, in _on_timeout
        raise HTTPError(599, error_message)
    tornado.httpclient.HTTPError: HTTP 599: Timeout during request
[I 2018-05-04 02:43:18.733 JupyterHub log:122] 200 GET /paws/hub/api/users (cull-idle@127.0.0.1) 23901.27ms

AFAICT, it took 20 seconds for /paws/hub/api/users to respond, and culler timed out on the request.

There might be also an encoding issue:

[I 2018-05-04 03:24:18.850 JupyterHub service:109] Spawning /usr/local/bin/cull_idle_servers.py --timeout=3600 --cull-every=600 --url=http://127.0.0.1:8081/paws/hub/api
[I 180504 03:24:24 cull_idle_servers:68] Culling server for matěj_suchánek (inactive since 2018-03-26 06:58:47.157000)
[I 2018-05-04 03:24:27.291 JupyterHub log:122] 200 GET /paws/hub/api/users (cull-idle@127.0.0.1) 4943.87ms
Traceback (most recent call last):
  File "/usr/local/bin/cull_idle_servers.py", line 114, in <module>
    loop.run_sync(cull)
  File "/usr/local/lib/python3.6/dist-packages/tornado/ioloop.py", line 458, in run_sync
    return future_cell[0].result()
  File "/usr/local/lib/python3.6/dist-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/usr/local/lib/python3.6/dist-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/local/bin/cull_idle_servers.py", line 93, in cull_idle
    yield f
  File "/usr/local/lib/python3.6/dist-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/usr/local/lib/python3.6/dist-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/usr/local/lib/python3.6/dist-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/local/bin/cull_idle_servers.py", line 73, in cull_one
    yield client.fetch(req)
  File "/usr/local/lib/python3.6/dist-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/usr/local/lib/python3.6/dist-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
tornado.httpclient.HTTPError: HTTP 404: Not Found
[W 2018-05-04 03:24:27.370 JupyterHub log:122] 404 DELETE /paws/hub/api/users/matěj_suchánek/server (cull-idle@127.0.0.1) 53.82ms

There might be also an encoding issue:

That is T189534, we're waiting for an upstream release that should come soon on that one.

Chico said on IRC that the excessive time consumption may be caused by the sqlite database being on NFS, and indeed, during time python3 -c 'print(__import__("requests").get("http://localhost:8081/paws/hub/api/users", headers={"Authorization":"token 3[REDACTED, see /proc/47/environ JUPYTERHUB_API_TOKEN]8"}).text)' the hub goes into D-state

bd808 added a subscriber: bd808.May 4 2018, 4:32 AM

Can the database be switched to use toolsdb instead of the sqlite setup? The upstream documentation seems to suggest that this is optimal.

Can the database be switched to use toolsdb instead of the sqlite setup? The upstream documentation seems to suggest that this is optimal.

Yes, it might be ideal to migrate the DB, but indeed we can start fresh without data loss (beyond sessions), as data persistence is done through the home folder and the identifier is the global user id.

Chicocvenancio changed the task status from Open to Stalled.May 5 2018, 5:07 PM
Chicocvenancio claimed this task.
Chicocvenancio added a subscriber: Bstorm.

@bd808 (ping @Bstorm as well due to write the docs) To start implementing this in PAWS I need access to an account in toolsdb. I think this could live in the tools.paws tool. Should I ping Yuvi about adding me there? Create another tool?

Mentioned in SAL (#wikimedia-cloud) [2018-05-05T18:25:07Z] <bd808> Added Chico Venancio as a maintainer (T193828)

Chicocvenancio triaged this task as High priority.May 6 2018, 4:58 AM
Chicocvenancio added a project: Upstream.

There is an upstream issue with our use of the utf8_mb4 collation, I send a PR to attempt to fix it. https://github.com/jupyterhub/jupyterhub/pull/1866

PAWS-beta is running with that change now and working from a toolsdb database. I locally built a wheel and pushed it to github for paws-beta, maybe we should do it in a slightly more elegant way if we need to do this in production.

This is a pretty serious bug, as without the culler PAWS will overload and stop working every few days without manual intervention, but I think we can manage and wait a few days before deciding to go the patched wheel route.

Chicocvenancio changed the task status from Stalled to Open.May 7 2018, 2:04 PM

The PR was merged upstream and released in beta one hour ago.

bd808 awarded a token.May 19 2018, 7:42 PM
Vvjjkkii renamed this task from PAWS culler stopped working to 7mdaaaaaaa.Jul 1 2018, 1:12 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii removed Chicocvenancio as the assignee of this task.
Vvjjkkii updated the task description. (Show Details)