Page MenuHomePhabricator

PAWS userhomes permission error
Closed, ResolvedPublic

Description

Task about PAWS userhome permission error.

As it currently stands we have a workaround in place set up by @madhuvishy (a bash script to chown userhomes to tools.paws run as root in the grid).
The issue is that the hub image is being run by root since January in an attempt to get the culler to work. (See T175202)

Event Timeline

Chicocvenancio created this task.

Can we test if culler now works without being root?

Can we test if culler now works without being root?

Since T188428 we can indeed.

It seems it is a solvable issue, but setting the uid as tools.paws does not solve the problem.

This is the full error we get:

[I 2018-03-02 01:06:57.534 JupyterHub service:266] Starting service 'cull-idle': ['/usr/local/bin/cull_idle_servers.py', '--timeout=3600', '--cull-every=600', '--url=http://127.0.0.1:8081/paws/hub/api']
[I 2018-03-02 01:06:57.538 JupyterHub service:109] Spawning /usr/local/bin/cull_idle_servers.py --timeout=3600 --cull-every=600 --url=http://127.0.0.1:8081/paws/hub/api
Failed to import the site module
Traceback (most recent call last):

File "/usr/lib/python3.6/site.py", line 561, in <module>
  main()
File "/usr/lib/python3.6/site.py", line 547, in main
  known_paths = addusersitepackages(known_paths)
File "/usr/lib/python3.6/site.py", line 288, in addusersitepackages
  user_site = getusersitepackages()
File "/usr/lib/python3.6/site.py", line 264, in getusersitepackages
  user_base = getuserbase() # this will also set USER_BASE
File "/usr/lib/python3.6/site.py", line 254, in getuserbase
  USER_BASE = get_config_var('userbase')
File "/usr/lib/python3.6/sysconfig.py", line 607, in get_config_var
  return get_config_vars().get(name)
File "/usr/lib/python3.6/sysconfig.py", line 558, in get_config_vars
  _CONFIG_VARS['userbase'] = _getuserbase()
File "/usr/lib/python3.6/sysconfig.py", line 205, in _getuserbase
  return joinuser("~", ".local")
File "/usr/lib/python3.6/sysconfig.py", line 184, in joinuser
  return os.path.expanduser(os.path.join(*args))
File "/usr/lib/python3.6/posixpath.py", line 247, in expanduser
  userhome = pwd.getpwuid(os.getuid()).pw_dir

KeyError: 'getpwuid(): uid not found: 52771'
[E 2018-03-02 01:07:27.546 JupyterHub service:296] Service cull-idle exited with status 1

That next-to-last line tells me the image has no clue about tools.paws, and for some reason python cares about that very deeply.
I think this is as simple as adding a tools.paws user with the same uid as we have outside that container, to do that, however we must stop using vanilla z2jh hub image and create our own, either inheriting from it or otherwise. This is a needed step for several things we want in PAWS, I'll create a task for it soon.

In the meantime, you might want to test the changes you made to extraconfig that would be a better workaround for this than the root chown grid job.

The os.path.expanduser function, copied from inside my pod's /usr/lib/python3.6/posixpath.py is:

def expanduser(path):
    """Expand ~ and ~user constructions.  If user or $HOME is unknown,
    do nothing."""
    path = os.fspath(path)
    if isinstance(path, bytes):
        tilde = b'~'
    else:
        tilde = '~'
    if not path.startswith(tilde):
        return path
    sep = _get_sep(path)
    i = path.find(sep, 1)
    if i < 0:
        i = len(path)
    if i == 1:
        if 'HOME' not in os.environ:
            import pwd
            userhome = pwd.getpwuid(os.getuid()).pw_dir
        else:
            userhome = os.environ['HOME']
    else:
        import pwd
        name = path[1:i]
        if isinstance(name, bytes):
            name = str(name, 'ASCII')
        try:
            pwent = pwd.getpwnam(name)
        except KeyError:
            return path
        userhome = pwent.pw_dir
    if isinstance(path, bytes):
        userhome = os.fsencode(userhome)
        root = b'/'
    else:
        root = '/'
    userhome = userhome.rstrip(root)
    return (userhome + path[i:]) or root

I think we can easily workaround this by setting a $HOME, like we do toy singleuser pods.

I think we can easily workaround this by setting a $HOME, like we do toy singleuser pods.

Interesting, if we set that in the extra-config (or if there is a "normal" config value for that) might be faster than T188754.

or if there is a "normal" config value for that

I thought that is what extraEnv is for, and it's also is also referenced in upstream docs. Weird if that doesn't work.

I agree this weird.
hub.extraEnv does work to place HOME in the hub. The culler still fails in the same spot though.

chicocvenancio@toolsbeta-paws-master-01:~$ kubectl -n prod exec -it hub-6dcf4cc59-2j64g python3
Python 3.6.3 (default, Oct 3 2017, 21:45:48)
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

import os
os.environ['HOME']

'/home/paws'

chicocvenancio@toolsbeta-paws-master-01:~$ kubectl -n prod logs hub-6dcf4cc59-2j64g -f
[I 2018-03-05 13:21:15.247 JupyterHub service:266] Starting service 'cull-idle': ['/usr/local/bin/cull_idle_servers.py', '--timeout=3600', '--cull-every=600', '--url=http://127.0.0.1:8081/paws/hub/api']
[I 2018-03-05 13:21:15.276 JupyterHub service:109] Spawning /usr/local/bin/cull_idle_servers.py --timeout=3600 --cull-every=600 --url=http://127.0.0.1:8081/paws/hub/api
Failed to import the site module
Traceback (most recent call last):

File "/usr/lib/python3.6/site.py", line 561, in <module>
  main()
File "/usr/lib/python3.6/site.py", line 547, in main
  known_paths = addusersitepackages(known_paths)
File "/usr/lib/python3.6/site.py", line 288, in addusersitepackages
  user_site = getusersitepackages()
File "/usr/lib/python3.6/site.py", line 264, in getusersitepackages
  user_base = getuserbase() # this will also set USER_BASE
File "/usr/lib/python3.6/site.py", line 254, in getuserbase
  USER_BASE = get_config_var('userbase')
File "/usr/lib/python3.6/sysconfig.py", line 607, in get_config_var
  return get_config_vars().get(name)
File "/usr/lib/python3.6/sysconfig.py", line 558, in get_config_vars
  _CONFIG_VARS['userbase'] = _getuserbase()
File "/usr/lib/python3.6/sysconfig.py", line 205, in _getuserbase
  return joinuser("~", ".local")
File "/usr/lib/python3.6/sysconfig.py", line 184, in joinuser
  return os.path.expanduser(os.path.join(*args))
File "/usr/lib/python3.6/posixpath.py", line 247, in expanduser
  userhome = pwd.getpwuid(os.getuid()).pw_dir

KeyError: 'getpwuid(): uid not found: 52771'

The culler has a very different environ set than the hub:

1root@hub-7d68647bf4-pqm74:/proc# ps auxfww
2USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
3root 58 0.1 0.0 20280 4040 ? Ss 16:05 0:00 /bin/bash
4root 69 0.0 0.0 40724 3384 ? R+ 16:06 0:00 \_ ps auxfww
5root 1 0.6 1.2 2604240 105564 ? Ssl Mar02 29:56 /usr/bin/python3 /usr/local/bin/jupyterhub --config /srv/jupyterhub_config.py --upgrade-db
6root 23 0.0 0.3 91260 31380 ? Ss Mar02 0:12 python3 /usr/local/bin/cull_idle_servers.py --timeout=3600 --cull-every=600 --url=http://127.0.0.1:8081/paws/hub/api
7root@hub-7d68647bf4-pqm74:/proc# cat /proc/1/environ | tr '\0' '\n'
8PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
9HOSTNAME=hub-7d68647bf4-pqm74
10SINGLEUSER_IMAGE=quay.io/wikimedia-paws/singleuser:c24ab1e
11JPY_COOKIE_SECRET=REDACTED
12POD_NAMESPACE=prod
13CONFIGPROXY_AUTH_TOKEN=REDACTED
14JUPYTERHUB_CRYPT_KEY=REDACTED
15MYSQL_HMAC_KEY=REDACTED
16USER=tools.paws
17PROXY_PUBLIC_PORT_80_TCP=tcp://10.104.118.81:80
18PROXY_PUBLIC_PORT_443_TCP_PROTO=tcp
19PROXY_HTTP_PORT_8000_TCP_ADDR=10.110.41.203
20DEPLOY_HOOK_SERVICE_HOST=10.106.22.165
21MYSQL_PORT_3306_TCP=tcp://10.97.130.38:3306
22PROXY_HTTP_SERVICE_PORT=8000
23KUBERNETES_SERVICE_PORT_HTTPS=443
24DEPLOY_HOOK_SERVICE_PORT=8888
25PROXY_API_PORT_8001_TCP_PORT=8001
26PROXY_PUBLIC_PORT=tcp://10.104.118.81:80
27KUBERNETES_PORT_443_TCP_PORT=443
28KUBERNETES_PORT_443_TCP_ADDR=10.96.0.1
29PROXY_API_SERVICE_HOST=10.102.227.179
30PROXY_API_PORT_8001_TCP=tcp://10.102.227.179:8001
31PROXY_PUBLIC_SERVICE_PORT=80
32MYSQL_PORT_3306_TCP_PROTO=tcp
33PROXY_HTTP_PORT_8000_TCP_PORT=8000
34KUBERNETES_SERVICE_HOST=10.96.0.1
35MYSQL_SERVICE_HOST=10.97.130.38
36MYSQL_PORT=tcp://10.97.130.38:3306
37KUBERNETES_SERVICE_PORT=443
38PROXY_PUBLIC_PORT_80_TCP_ADDR=10.104.118.81
39HUB_PORT_8081_TCP_PORT=8081
40PROXY_PUBLIC_SERVICE_HOST=10.104.118.81
41PROXY_PUBLIC_PORT_443_TCP_ADDR=10.104.118.81
42MYSQL_SERVICE_PORT=3306
43PROXY_API_PORT_8001_TCP_ADDR=10.102.227.179
44DEPLOY_HOOK_PORT_8888_TCP=tcp://10.106.22.165:8888
45PROXY_PUBLIC_SERVICE_PORT_HTTP=80
46PROXY_PUBLIC_SERVICE_PORT_HTTPS=443
47PROXY_HTTP_PORT=tcp://10.110.41.203:8000
48DEPLOY_HOOK_PORT=tcp://10.106.22.165:8888
49DEPLOY_HOOK_PORT_8888_TCP_PROTO=tcp
50PROXY_PUBLIC_PORT_443_TCP=tcp://10.104.118.81:443
51PROXY_API_SERVICE_PORT=8001
52PROXY_API_PORT=tcp://10.102.227.179:8001
53PROXY_API_PORT_8001_TCP_PROTO=tcp
54MYSQL_PORT_3306_TCP_PORT=3306
55KUBERNETES_PORT_443_TCP=tcp://10.96.0.1:443
56DEPLOY_HOOK_PORT_8888_TCP_ADDR=10.106.22.165
57HUB_PORT_8081_TCP=tcp://10.110.164.126:8081
58PROXY_PUBLIC_PORT_443_TCP_PORT=443
59HUB_PORT=tcp://10.110.164.126:8081
60PROXY_HTTP_PORT_8000_TCP=tcp://10.110.41.203:8000
61HUB_SERVICE_PORT=8081
62HUB_PORT_8081_TCP_ADDR=10.110.164.126
63PROXY_HTTP_PORT_8000_TCP_PROTO=tcp
64PROXY_PUBLIC_PORT_80_TCP_PROTO=tcp
65MYSQL_PORT_3306_TCP_ADDR=10.97.130.38
66HUB_SERVICE_HOST=10.110.164.126
67PROXY_HTTP_SERVICE_HOST=10.110.41.203
68KUBERNETES_PORT=tcp://10.96.0.1:443
69KUBERNETES_PORT_443_TCP_PROTO=tcp
70DEPLOY_HOOK_PORT_8888_TCP_PORT=8888
71PROXY_PUBLIC_PORT_80_TCP_PORT=80
72HUB_PORT_8081_TCP_PROTO=tcp
73LANG=C.UTF-8
74HOME=/root
75root@hub-7d68647bf4-pqm74:/proc# cat /proc/23/environ | tr '\0' '\n'
76PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
77LANG=C.UTF-8
78JUPYTERHUB_SERVICE_NAME=cull-idle
79JUPYTERHUB_API_TOKEN=REDACTED
80JPY_API_TOKEN=REDACTED
81JUPYTERHUB_CLIENT_ID=service-cull-idle
82JUPYTERHUB_HOST=
83JUPYTERHUB_OAUTH_CALLBACK_URL=oauth_callback
84JUPYTERHUB_USER=
85JUPYTERHUB_API_URL=http://10.110.164.126:8081/paws/hub/api
86JUPYTERHUB_BASE_URL=/paws/

I did manage to get past that error by creating the user inside the hub (see T188754), alas, that is not useful for the userhome permission error.
Using the a postStart k8s hook , or the modify_pod hook from kubespawner to launch chown in the singleuser image also do not work due to lack of permissions (taking from root is usually frowned upon). Setting a (root owned) chown on the user home also does not work due to it not having permissions.

Reading the upstream bug it seems that we have two workarround options. We can either

  • Use the pv.beta.kubernetes.io/gid annotation in the PersistentVolume to set the GID for the new directories (we can then use take in a postStart hook to set the owner as well, though probably not needed).
  • Create an container in the singleuser image that will chown the home as root and will only allow the notebook container to start after it is run.

I will attempt to use the first solution.

Chicocvenancio changed the task status from Open to Stalled.May 25 2018, 8:27 PM

AHA! Finally made the initContainer definition, https://github.com/yuvipanda/paws/pull/28.

Changing to stalled until T195217 is fixed (at least the https://paws-deploy-hook.wmflabs.org parts)

Chicocvenancio raised the priority of this task from Low to Unbreak Now!.

The paws-userhomes-hack.bash seems to have died, new users are unable to login to PAWS at the moment. Raising priority and assigning to on-call person from WMCS.

To fix this, a tools root needs to run the /data/project/paws-userhomes-hack.bash /data/project/paws/paws-userhomes-hack.bash script as root, preferably in the grid engine. sudo jsub /data/project/paws-userhomes-hack.bash sudo jsub /data/project/paws/paws-userhomes-hack.bash should do it.

Currently affect users are:

tools.paws@tools-bastion-03:~$ find /data/project/paws/userhomes/ -maxdepth 1 -user root
/data/project/paws/userhomes/51990302
/data/project/paws/userhomes/8631757
/data/project/paws/userhomes/54479680
/data/project/paws/userhomes/1281
/data/project/paws/userhomes/54533986
/data/project/paws/userhomes/54469129
/data/project/paws/userhomes/54485747
/data/project/paws/userhomes/54475927
/data/project/paws/userhomes/52360419
/data/project/paws/userhomes/53338127
/data/project/paws/userhomes/46482467
/data/project/paws/userhomes/54491825
/data/project/paws/userhomes/5424947
/data/project/paws/userhomes/54476398
/data/project/paws/userhomes/54485761

For the record, this is the contents of the script:

tools.paws@tools-bastion-03:~$ cat paws-userhomes-hack.bash
#!/bin/bash
while true
do

find /data/project/paws/userhomes/ -maxdepth 1 -user root | xargs -L1 chown -v tools.paws:tools.paws
sleep 1

done

@Chicocvenancio

root@tools-bastion-03:~# test -e /data/project/paws-userhomes-hack.bash; echo $?
1

timeout 180s bash -x /data/project/paws/paws-userhomes-hack.bash
+ true
+ find /data/project/paws/userhomes/ -maxdepth 1 -user root
+ xargs -L1 chown -v tools.paws:tools.paws
changed ownership of ‘/data/project/paws/userhomes/51990302’ from root:root to tools.paws:tools.paws
changed ownership of ‘/data/project/paws/userhomes/8631757’ from root:root to tools.paws:tools.paws
changed ownership of ‘/data/project/paws/userhomes/54479680’ from root:root to tools.paws:tools.paws
changed ownership of ‘/data/project/paws/userhomes/1281’ from root:root to tools.paws:tools.paws
changed ownership of ‘/data/project/paws/userhomes/54533986’ from root:root to tools.paws:tools.paws
changed ownership of ‘/data/project/paws/userhomes/54469129’ from root:root to tools.paws:tools.paws
changed ownership of ‘/data/project/paws/userhomes/54485747’ from root:root to tools.paws:tools.paws
changed ownership of ‘/data/project/paws/userhomes/54475927’ from root:root to tools.paws:tools.paws
changed ownership of ‘/data/project/paws/userhomes/52360419’ from root:root to tools.paws:tools.paws
changed ownership of ‘/data/project/paws/userhomes/53338127’ from root:root to tools.paws:tools.paws
changed ownership of ‘/data/project/paws/userhomes/46482467’ from root:root to tools.paws:tools.paws
changed ownership of ‘/data/project/paws/userhomes/54491825’ from root:root to tools.paws:tools.paws
changed ownership of ‘/data/project/paws/userhomes/5424947’ from root:root to tools.paws:tools.paws
changed ownership of ‘/data/project/paws/userhomes/54476398’ from root:root to tools.paws:tools.paws
changed ownership of ‘/data/project/paws/userhomes/54485761’ from root:root to tools.paws:tools.paws
+ sleep 1
+ true
+ find /data/project/paws/userhomes/ -maxdepth 1 -user root
+ xargs -L1 chown -v tools.paws:tools.paws
chown: missing operand after ‘tools.paws:tools.paws’
Try 'chown --help' for more information.

note chown: missing operand after ‘tools.paws:tools.paws’ after first run but it's ignorable afaik

Chicocvenancio lowered the priority of this task from Unbreak Now! to Low.
Chicocvenancio added a subscriber: Andrew.

reducing priority as workaround back in place. reassigning to myself for permanent solution. Thanks for the fix @chasemp!

Fix deployed to PAWS. @chasemp we can stop the hack if it decided to stay working this time.