Page MenuHomePhabricator

jupyterhub-conda.service failure after hadooptest client bullseye upgrade
Closed, ResolvedPublic5 Estimated Story Points

Assigned To
Authored By
Stevemunene
Mar 30 2023, 12:20 AM
Referenced Files
F36973378: image.png
May 3 2023, 10:17 AM
F36973317: image.png
May 3 2023, 9:23 AM
F36972332: image.png
May 2 2023, 8:33 AM
Restricted File
Mar 31 2023, 1:29 PM
F36935345: image.png
Mar 31 2023, 1:24 PM

Description

jupyterhub-conda.service our recently refreshed bullseye hadoop test client an-test-client1002 is currentl y having an error similar to the one faced with systemd-timesyncd.service on T310643
The error on our bullseye hadoop test client an-test-client1002 looks like

Mar 29 09:33:32 an-test-client1002 systemd[391070]: jupyterhub-conda.service: Failed to set up mount namespacing: /run/systemd/unit-root/: Input/output error
Mar 29 09:33:32 an-test-client1002 systemd[391070]: jupyterhub-conda.service: Failed at step NAMESPACE spawning /opt/conda-analytics/bin/jupyterhub: Input/output error
Mar 29 09:33:32 an-test-client1002 systemd[1]: jupyterhub-conda.service: Main process exited, code=exited, status=226/NAMESPACE
Mar 29 09:33:32 an-test-client1002 systemd[1]: jupyterhub-conda.service: Failed with result 'exit-code'.

Event Timeline

We've seen this kind of error a couple of times before, related to bullseye upgrades, so I can give you a couploe of pointers.

It seems to be caused by an interaction between systemd and the FUSE mount for HDFS we have beneath /mnt/hdfs.

What we have had to in the past is to add the InaccessiblePaths option to the systemd unit that is failing to start.
Specifically, the way that this is done is by adding a systemd override file.

Here are a couple of places where this has been done already.
https://codesearch.wmcloud.org/search/?q=InaccessiblePaths&files=&excludeFiles=&repos=

So if you look at the filesystem of a bullseye server, that has /mnt/hdfs present, you will see how these override files get generated.

btullis@clouddumps1001:/etc/systemd$ grep -r Inacc *
system/systemd-logind.service.d/puppet-override.conf:InaccessiblePaths=-/mnt
system/systemd-timesyncd.service.d/puppet-override.conf:InaccessiblePaths=-/mnt
system/systemd-timedated.service.d/puppet-override.conf:InaccessiblePaths=-/mnt

The override files get merged in with the systemd unit files that come with the relevant package, so we don't have to modify the files themselves.

e.g. the following two files are merged when you want to control the systemd-timesyncd.service unit.

btullis@clouddumps1001:/etc/systemd$ ls -l /lib/systemd/system/systemd-timesyncd.service 
-rw-r--r-- 1 root root 1548 Aug  7  2022 /lib/systemd/system/systemd-timesyncd.service

btullis@clouddumps1001:/etc/systemd$ ls -l /etc/systemd/system/systemd-timesyncd.service.d/puppet-override.conf 
-r--r--r-- 1 root root 34 Aug 22  2022 /etc/systemd/system/systemd-timesyncd.service.d/puppet-override.conf

So I think that in that in this case we either:

  • want to add a similar override file for the jupyterhub-condaservice, that is found at /lib/systemd/system/jupyterhub-conda.service

or

  • modify the service file itself at the point of creation. (We create it from a template here.

Whichever you think would be neater/better is fine by me.

Change 904617 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] Jupyterhub-conda exclude /mnt from accessible paths

https://gerrit.wikimedia.org/r/904617

Hm, I'd expect use of /mnt/hdfs to be something that Jupyter users would expect to work? Maybe they don't use it and don't need it, but we should check first?

Hm, I'd expect use of /mnt/hdfs to be something that Jupyter users would expect to work? Maybe they don't use it and don't need it, but we should check first?

It's a great call, but I don't think it works at the moment. (Unless I'm missing something, which is easily possible.)

From the stat servers, in a normal SSH shell, one can only access /mnt/hdfs as the analytics-privatedata user.

btullis@stat1005:~$ ls -l /mnt/hdfs
ls: cannot access '/mnt/hdfs': Input/output error
btullis@stat1005:~$ sudo -u analytics-privatedata ls -l /mnt/hdfs
total 20
drwxr-xr-x   2 hdfs hadoop 4096 Mar 31 06:00 system
drwxrwxrwt  48 hdfs hdfs   4096 Mar 31 13:19 tmp
drwxrwxr-x 321 hdfs hadoop 4096 Mar 31 11:43 user
drwxr-xr-x   4 hdfs hdfs   4096 Jul 11  2014 var
drwxr-xr-x   8 hdfs hadoop 4096 Feb 10  2022 wmf

Trying this same operation from JupyterHub gives me an error, like this:

image.png (235×1 px, 28 KB)

So I'd be surprised if anyone is using the FUSE mount from JupyterHub at the moment. We can certainly ask though.

Oh whoops, I am missing something. What an idiot.

btullis@stat1005:~$ kinit
Password for btullis@WIKIMEDIA: 
btullis@stat1005:~$ ls -l /mnt/hdfs
total 20
drwxr-xr-x   2 hdfs hadoop 4096 Mar 31 06:00 system
drwxrwxrwt  49 hdfs hdfs   4096 Mar 31 13:20 tmp
drwxrwxr-x 321 hdfs hadoop 4096 Mar 31 11:43 user
drwxr-xr-x   4 hdfs hdfs   4096 Jul 11  2014 var
drwxr-xr-x   8 hdfs hadoop 4096 Feb 10  2022 wmf

{F36935360,width=90%}
So yes, removing access to /mnt/hdfs might well have a greater impact than first thought. Thanks @Ottomata .

Been digging at this, so far it seems that jupyterhub can't locate the custom wmf spawnwers class, manually running the exec script ./opt/conda-analytics/bin/jupyterhub troubleshoot --config=/etc/jupyterhub-conda/jupyterhub_config.py --no-ssl has this as the result.

stevemunene@an-test-client1002:/opt/conda-analytics/bin$ sudo ./jupyterhub troubleshoot --config=/etc/jupyterhub-conda/jupyterhub_config.py --no-ssl
[E 2023-04-20 08:15:44.617 JupyterHub app:2989]
    Traceback (most recent call last):
      File "/opt/conda-analytics/lib/python3.10/site-packages/jupyterhub/app.py", line 2986, in launch_instance_async
        await self.initialize(argv)
      File "/opt/conda-analytics/lib/python3.10/site-packages/jupyterhub/app.py", line 2477, in initialize
        self.load_config_file(self.config_file)
      File "/opt/conda-analytics/lib/python3.10/site-packages/traitlets/config/application.py", line 110, in inner
        return method(app, *args, **kwargs)
      File "/opt/conda-analytics/lib/python3.10/site-packages/traitlets/config/application.py", line 889, in load_config_file
        for (config, fname) in self._load_config_files(
      File "/opt/conda-analytics/lib/python3.10/site-packages/traitlets/config/application.py", line 848, in _load_config_files
        config = loader.load_config()
      File "/opt/conda-analytics/lib/python3.10/site-packages/traitlets/config/loader.py", line 625, in load_config
        self._read_file_as_dict()
      File "/opt/conda-analytics/lib/python3.10/site-packages/traitlets/config/loader.py", line 658, in _read_file_as_dict
        exec(compile(f.read(), conf_filename, "exec"), namespace, namespace)
      File "/etc/jupyterhub-conda/jupyterhub_config.py", line 14, in <module>
        from spawners import CondaEnvProfilesSpawner
    ModuleNotFoundError: No module named 'spawners'

Checking the installed modules with stevemunene@an-test-client1002:/opt/conda-analytics/bin$ ./jupyter troubleshoot shows all modules are installed as expected and the output is similar across an-test-client1001 and an-test-client1002.
Tried manually specifying the sudospawner path in the two ways with both exiting as before.

stevemunene@an-test-client1002:/opt/conda-analytics/bin$ ./jupyterhub troubleshoot --config=/etc/jupyterhub-conda/jupyterhub_config.py --no-ssl --SudoSpawner.sudospawner_path='/etc/jupyterhub-conda/spawnwers.py'
[E 2023-04-20 11:44:29.092 JupyterHub app:2989]
    Traceback (most recent call last):
      File "/opt/conda-analytics/lib/python3.10/site-packages/jupyterhub/app.py", line 2986, in launch_instance_async
        await self.initialize(argv)
      File "/opt/conda-analytics/lib/python3.10/site-packages/jupyterhub/app.py", line 2477, in initialize
        self.load_config_file(self.config_file)
      File "/opt/conda-analytics/lib/python3.10/site-packages/traitlets/config/application.py", line 110, in inner
        return method(app, *args, **kwargs)
      File "/opt/conda-analytics/lib/python3.10/site-packages/traitlets/config/application.py", line 889, in load_config_file
        for (config, fname) in self._load_config_files(
      File "/opt/conda-analytics/lib/python3.10/site-packages/traitlets/config/application.py", line 848, in _load_config_files
        config = loader.load_config()
      File "/opt/conda-analytics/lib/python3.10/site-packages/traitlets/config/loader.py", line 625, in load_config
        self._read_file_as_dict()
      File "/opt/conda-analytics/lib/python3.10/site-packages/traitlets/config/loader.py", line 658, in _read_file_as_dict
        exec(compile(f.read(), conf_filename, "exec"), namespace, namespace)
      File "/etc/jupyterhub-conda/jupyterhub_config.py", line 14, in <module>
        from spawners import CondaEnvProfilesSpawner
    ModuleNotFoundError: No module named 'spawners'

With the line c.SudoSpawner.sudospawner_path = '/etc/jupyterhub-conda/spawnwers.py' in a custom config file the result is the same.

stevemunene@an-test-client1002:/opt/conda-analytics/bin$ ./jupyterhub --config=/home/stevemunene/jupyterhub_config.py --no-ssl
[E 2023-04-20 11:55:59.125 JupyterHub app:2989]
    Traceback (most recent call last):
      File "/opt/conda-analytics/lib/python3.10/site-packages/jupyterhub/app.py", line 2986, in launch_instance_async
        await self.initialize(argv)
      File "/opt/conda-analytics/lib/python3.10/site-packages/jupyterhub/app.py", line 2477, in initialize
        self.load_config_file(self.config_file)
      File "/opt/conda-analytics/lib/python3.10/site-packages/traitlets/config/application.py", line 110, in inner
        return method(app, *args, **kwargs)
      File "/opt/conda-analytics/lib/python3.10/site-packages/traitlets/config/application.py", line 889, in load_config_file
        for (config, fname) in self._load_config_files(
      File "/opt/conda-analytics/lib/python3.10/site-packages/traitlets/config/application.py", line 848, in _load_config_files
        config = loader.load_config()
      File "/opt/conda-analytics/lib/python3.10/site-packages/traitlets/config/loader.py", line 625, in load_config
        self._read_file_as_dict()
      File "/opt/conda-analytics/lib/python3.10/site-packages/traitlets/config/loader.py", line 658, in _read_file_as_dict
        exec(compile(f.read(), conf_filename, "exec"), namespace, namespace)
      File "/home/stevemunene/jupyterhub_config.py", line 14, in <module>
        from spawners import CondaEnvProfilesSpawner
    ModuleNotFoundError: No module named 'spawners'

Change 904617 abandoned by Stevemunene:

[operations/puppet@production] Jupyterhub-conda exclude /mnt from accessible paths

Reason:

We established that removing access to /mnt/hdfs might have a greater impact than first thought.

https://gerrit.wikimedia.org/r/904617

If you look at the jupyterhub systemd unit on an-test-client1001 at /lib/systemd/system/jupyterhub-conda.service, you'll see:

# Our custom CondaEnvProfilesSpawner class is here.
# Make sure it can be loaded.
Environment=PYTHONPATH=/etc/jupyterhub-conda

So, if you want to run our jupyterhub on the CLI, you need to make sure this is set:

PYTHONPATH=/etc/jupyterhub-conda /opt/conda-analytics/bin/jupyterhub --config=/etc/jupyterhub-conda/jupyterhub_config.py --no-ssl
JArguello-WMF set the point value for this task to 5.Apr 25 2023, 2:22 PM

If you look at the jupyterhub systemd unit on an-test-client1001 at /lib/systemd/system/jupyterhub-conda.service, you'll see:

# Our custom CondaEnvProfilesSpawner class is here.
# Make sure it can be loaded.
Environment=PYTHONPATH=/etc/jupyterhub-conda

So, if you want to run our jupyterhub on the CLI, you need to make sure this is set:

PYTHONPATH=/etc/jupyterhub-conda /opt/conda-analytics/bin/jupyterhub --config=/etc/jupyterhub-conda/jupyterhub_config.py --no-ssl

Thanks @Ottomata , However, this still results in the same error on both an-test-client1001 Where the jupyterhub-conda.service is working as expected and on an-test-client1002 where the service is unable to start.

stevemunene@an-test-client1002:~$ export PYTHONPATH=/etc/jupyterhub-conda
stevemunene@an-test-client1002:~$ /opt/conda-analytics/bin/jupyterhub --config=/etc/jupyterhub-conda/jupyterhub_config.py --no-ssl
[I 2023-04-27 12:00:24.008 JupyterHub app:2479] Running JupyterHub version 1.5.0
[I 2023-04-27 12:00:24.008 JupyterHub app:2509] Using Authenticator: builtins.PosixGroupCheckingAuthenticator
[I 2023-04-27 12:00:24.008 JupyterHub app:2509] Using Spawner: spawners.CondaEnvProfilesSpawner
[I 2023-04-27 12:00:24.008 JupyterHub app:2509] Using Proxy: jupyterhub.proxy.ConfigurableHTTPProxy-1.5.0
[I 2023-04-27 12:00:24.017 JupyterHub app:1554] Loading cookie_secret from /srv/jupyterhub-conda/data/jupyterhub_cookie_secret
[E 2023-04-27 12:00:24.017 JupyterHub app:1584] Refusing to run JupyterHub with invalid cookie_secret_file. /srv/jupyterhub-conda/data/jupyterhub_cookie_secret error was: [Errno 13] Permission denied: '/srv/jupyterhub-conda/data/jupyterhub_cookie_secret'
[D 2023-04-27 12:00:24.017 JupyterHub application:967] Exiting application: jupyterhub
stevemunene@an-test-client1002:~$ sudo /opt/conda-analytics/bin/jupyterhub --config=/etc/jupyterhub-conda/jupyterhub_config.py --no-ssl
[E 2023-04-27 12:00:40.484 JupyterHub app:2989]
    Traceback (most recent call last):
      File "/opt/conda-analytics/lib/python3.10/site-packages/jupyterhub/app.py", line 2986, in launch_instance_async
        await self.initialize(argv)
      File "/opt/conda-analytics/lib/python3.10/site-packages/jupyterhub/app.py", line 2477, in initialize
        self.load_config_file(self.config_file)
      File "/opt/conda-analytics/lib/python3.10/site-packages/traitlets/config/application.py", line 110, in inner
        return method(app, *args, **kwargs)
      File "/opt/conda-analytics/lib/python3.10/site-packages/traitlets/config/application.py", line 889, in load_config_file
        for (config, fname) in self._load_config_files(
      File "/opt/conda-analytics/lib/python3.10/site-packages/traitlets/config/application.py", line 848, in _load_config_files
        config = loader.load_config()
      File "/opt/conda-analytics/lib/python3.10/site-packages/traitlets/config/loader.py", line 625, in load_config
        self._read_file_as_dict()
      File "/opt/conda-analytics/lib/python3.10/site-packages/traitlets/config/loader.py", line 658, in _read_file_as_dict
        exec(compile(f.read(), conf_filename, "exec"), namespace, namespace)
      File "/etc/jupyterhub-conda/jupyterhub_config.py", line 14, in <module>
        from spawners import CondaEnvProfilesSpawner
    ModuleNotFoundError: No module named 'spawners'

Hm, I don't get that error when I run as my user on either client. I get:

Refusing to run JupyterHub with invalid cookie_secret_file. /srv/jupyterhub-conda/data/jupyterhub_cookie_secret error was: [Errno 13] Permission denied: '/srv/jupyterhub-conda/data/jupyterhub_cookie_secret'

Which makes sense, because my user can't read that.

If I sudo -i, and then try (with PYTHONPATH set), it starts up fine.

Just confirmed, the cli works for me as well with sudo -i.
Still trying to figure out why the service keeps failing.

Noticed some failed puppet runs on the host an-test-client1002 despite having some successful runs a while back.
These were:

Error: Could not set 'file' on ensure: No such file or directory - A directory component in /etc/jupyter/jupyter_notebook_config.py20230428-3729498-1hgjjau.lock does not exist or is a dangling symbolic link (file: /etc/puppet/modules/jupyterhub/manifests/server.pp, line: 81)
Error: Could not set 'file' on ensure: No such file or directory - A directory component in /etc/jupyter/jupyter_notebook_config.py20230428-3729498-1hgjjau.lock does not exist or is a dangling symbolic link (file: /etc/puppet/modules/jupyterhub/manifests/server.pp, line: 81)
Wrapped exception:
No such file or directory - A directory component in /etc/jupyter/jupyter_notebook_config.py20230428-3729498-1hgjjau.lock does not exist or is a dangling symbolic link
Error: /Stage[main]/Jupyterhub::Server/File[/etc/jupyter/jupyter_notebook_config.py]/ensure: change from 'absent' to 'file' failed: Could not set 'file' on ensure: No such file or directory - A directory component in /etc/jupyter/jupyter_notebook_config.py20230428-3729498-1hgjjau.lock does not exist or is a dangling symbolic link (file: /etc/puppet/modules/jupyterhub/manifests/server.pp, line: 81)
Error: Could not set 'file' on ensure: No such file or directory - A directory component in /usr/lib/hive/bin/ext/hiveserver2.sh20230428-3729498-1hs3jq7.lock does not exist or is a dangling symbolic link (file: /etc/puppet/modules/bigtop/manifests/hive.pp, line: 164)
Error: Could not set 'file' on ensure: No such file or directory - A directory component in /usr/lib/hive/bin/ext/hiveserver2.sh20230428-3729498-1hs3jq7.lock does not exist or is a dangling symbolic link (file: /etc/puppet/modules/bigtop/manifests/hive.pp, line: 164)
Wrapped exception:
No such file or directory - A directory component in /usr/lib/hive/bin/ext/hiveserver2.sh20230428-3729498-1hs3jq7.lock does not exist or is a dangling symbolic link
Error: /Stage[main]/Bigtop::Hive/File[/usr/lib/hive/bin/ext/hiveserver2.sh]/ensure: change from 'absent' to 'file' failed: Could not set 'file' on ensure: No such file or directory - A directory component in /usr/lib/hive/bin/ext/hiveserver2.sh20230428-3729498-1hs3jq7.lock does not exist or is a dangling symbolic link (file: /etc/puppet/modules/bigtop/manifests/hive.pp, line: 164) (corrective)

These occur when puppet tries to create a file but the parent directory is not available ref PUP-10545 which we might need to adjust.

I manually created the /etc/jupyter folder on an-test-client1002 and the /usr/lib/hive/bin/ext/ folders to test function after all are available and possibly be the missing link to get the jupyterhub-conda.service running. However, there was no change in behavior for the jupyterhub-conda.service. Moreover, the hive server error still persists and the created folders no longer exist.

Accessing the hub on my machine via ssh -N an-test-client1002.eqiad.wmnet -L 8880:127.0.0.1:8880 then on my browser going to 127.0.0.1:8880 and successfully logging in as detailed here
I have access to most features per my account privileges, but fail when trying to create a conda env with the following error

Running as unit: jupyter-stevemunene-singleuser-conda-analytics.service
[D 2023-05-02 08:20:00.874 JupyterHub spawner:1179] Polling subprocess every 30s
[I 2023-05-02 08:20:01.569 JupyterHub pages:402] stevemunene is pending spawn
[I 2023-05-02 08:20:01.593 JupyterHub log:189] 200 GET /hub/spawn-pending/stevemunene (stevemunene@127.0.0.1) 33.32ms
Task exception was never retrieved
future: <Task finished name='Task-110' coro=<BaseHandler.spawn_single_user() done, defined at /opt/conda-analytics/lib/python3.10/site-packages/jupyterhub/handlers/base.py:796> exception=HTTPError()>
Traceback (most recent call last):
  File "/opt/conda-analytics/lib/python3.10/site-packages/jupyterhub/handlers/base.py", line 996, in spawn_single_user
    await gen.with_timeout(
asyncio.exceptions.TimeoutError: Timeout

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda-analytics/lib/python3.10/site-packages/jupyterhub/handlers/base.py", line 1030, in spawn_single_user
    raise web.HTTPError(
tornado.web.HTTPError: HTTP 500: Internal Server Error (Spawner failed to start [status=1]. The logs for stevemunene may contain details.)
[W 2023-05-02 08:21:49.084 JupyterHub user:767] stevemunene's server never showed up at http://127.0.0.1:37569/user/stevemunene/ after 120 seconds. Giving up
[D 2023-05-02 08:21:49.086 JupyterHub user:819] Stopping stevemunene
[D 2023-05-02 08:21:49.168 JupyterHub user:845] Deleting oauth client jupyterhub-user-stevemunene
[D 2023-05-02 08:21:49.188 JupyterHub user:848] Finished stopping stevemunene
[E 2023-05-02 08:21:49.208 JupyterHub gen:630] Exception in Future <Task finished name='Task-111' coro=<BaseHandler.spawn_single_user.<locals>.finish_user_spawn() done, defined at /opt/conda-analytics/lib/python3.10/site-packages/jupyterhub/handlers/base.py:900> exception=TimeoutError("Server at http://127.0.0.1:37569/user/stevemunene/ didn't respond in 120 seconds")> after timeout
    Traceback (most recent call last):
      File "/opt/conda-analytics/lib/python3.10/site-packages/tornado/gen.py", line 625, in error_callback
        future.result()
      File "/opt/conda-analytics/lib/python3.10/site-packages/jupyterhub/handlers/base.py", line 907, in finish_user_spawn
        await spawn_future
      File "/opt/conda-analytics/lib/python3.10/site-packages/jupyterhub/user.py", line 748, in spawn
        await self._wait_up(spawner)
      File "/opt/conda-analytics/lib/python3.10/site-packages/jupyterhub/user.py", line 795, in _wait_up
        raise e
      File "/opt/conda-analytics/lib/python3.10/site-packages/jupyterhub/user.py", line 762, in _wait_up
        resp = await server.wait_up(
      File "/opt/conda-analytics/lib/python3.10/site-packages/jupyterhub/utils.py", line 236, in wait_for_http_server
        re = await exponential_backoff(
      File "/opt/conda-analytics/lib/python3.10/site-packages/jupyterhub/utils.py", line 184, in exponential_backoff
        raise TimeoutError(fail_message)
    TimeoutError: Server at http://127.0.0.1:37569/user/stevemunene/ didn't respond in 120 seconds

image.png (634×1 px, 86 KB)

Still looking ongoing.

TimeoutError("Server at http://127.0.0.1:37569/user/stevemunene/ didn't respond in 120 seconds")

Those timeouts are rare, but do happen; I have seen them on stat1007, for example. Perhaps we should increase the timeout a bit?

120 is pretty long, but perhaps conda env creation does take longer than that? It shouldn't though, IIRC it just has to clone the /opt/conda-analytics env, it doesn't have to download anything.

Did the conda env actually get created? You can look in ~/.conda/environments.txt and ~/.conda/envs.

Did the jupyter singleuser server actually get spawned?

The env was not created while running the process, neither was the single user service.

However, I did try to manually create the env but got the standard fail and recommendation to have it created via jupyterhub

recommendation to have it created via jupyterhub

I didn't know of such recommendation. How did you manually create it?

By running this exec /etc/jupyterhub-conda/jupyterhub-singleuser-conda-env.sh __NEW__ --port=45325 --SingleUserNotebookApp.default_url=/lab
Creation starts and the env details are copied but it cannot start since it was not launched from jupyterhub

I've also been doing a little digging. Unsuccessfully so far, but I'll report what I've seen.

From my research T333511#8741011 above, adding the InaccessiblePaths=-/mnt option to /lib/systemd/system/jupyterhub-conda.service does cause the jupyterhub-conda service to start successfully.

However, I/we made two assumptions at this point:

  1. The jupyter-btullis-singleuser-conda-analytics.service (replace with your username) service would start correctly.
  2. The /mnt/hdfs path would indeed be inaccessible to users, from within their spawned jupyterhub singleuser environment.

I went about trying to test these assumptions, so the first thing I did was to:

  • Disable puppet on an-test-client-1002
  • Add the InaccessiblePaths=-/mnt option to /lib/systemd/system/jupyterhub-conda.service and execute sudo systemctl daemon-reload followed by sudo systemctl restart jupyterhub-conda.service
  • Verify that the jupyterhub-conda service has started with: sudo systemctl status jupyterhub-conda.service
  • Created an SSH tunnel with: ssh -N an-test-client1002.eqiad.wmnet -L 8880:127.0.0.1:8880

Then I logged in and went to create a new conda environment from the web interface.

I got the same type of failure as seen by @Stevemunene in T333511#8819254

At this point, I checked the logs of the jupyter-btullis-singleuser-conda-analytics.service which had attempted to start, but failed.

btullis@an-test-client1002:~$ journalctl -u jupyter-btullis-singleuser-conda-analytics.service 
-- Journal begins at Wed 2023-05-03 06:16:23 UTC, ends at Wed 2023-05-03 08:49:09 UTC. --
May 03 08:46:56 an-test-client1002 systemd[1]: Started /bin/bash -c cd /home/btullis && exec /etc/jupyterhub-conda/jupyterhub-singleuser-conda-env.sh __NEW__ --port=36745 --SingleUserNotebookApp.default_url=/la>
May 03 08:46:56 an-test-client1002 systemd[1519774]: jupyter-btullis-singleuser-conda-analytics.service: Failed to set up mount namespacing: /run/systemd/unit-root/srv/spark-tmp: No such file or directory
May 03 08:46:56 an-test-client1002 systemd[1519774]: jupyter-btullis-singleuser-conda-analytics.service: Failed at step NAMESPACE spawning /bin/bash: No such file or directory
May 03 08:46:56 an-test-client1002 systemd[1]: jupyter-btullis-singleuser-conda-analytics.service: Main process exited, code=exited, status=226/NAMESPACE
May 03 08:46:56 an-test-client1002 systemd[1]: jupyter-btullis-singleuser-conda-analytics.service: Failed with result 'exit-code'.

This is interesting because it says /bin/bash: No such file or directory

I think that we should go back to what @Ottomata said in T333511#8820493

Did the conda env actually get created? You can look in ~/.conda/environments.txt and ~/.conda/envs.

Let's see if we can create one manually.

To start with, we definitely don't have a conda environment, because there is no .conda directory on this host.

btullis@an-test-client1002:~$ ls -l .conda
ls: cannot access '.conda': No such file or directory

I then refer to https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/conda-analytics to check how to create one manually.

I check for any available conda environments, which just shows the base.

btullis@an-test-client1002:~$ conda-analytics-list
# conda environments:
#
base                  *  /opt/conda-analytics

I then create one manually:

btullis@an-test-client1002:~$ conda-analytics-clone btullis-test
Creating new cloned conda env btullis-test...
Source:      /opt/conda-analytics
Destination: /home/btullis/.conda/envs/btullis-test
The following packages cannot be cloned out of the root environment:
 - conda-forge/linux-64::conda-4.13.0-py310hff52083_2
Packages: 198
Files: 2592
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
#     $ conda activate btullis-test
#
# To deactivate an active environment, use
#
#     $ conda deactivate

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/btullis/.conda/envs/btullis-test

  added / updated specs:
    - conda=4.13.0


The following NEW packages will be INSTALLED:

  conda              conda-forge/linux-64::conda-4.13.0-py310hff52083_2


Preparing transaction: done
Verifying transaction: done
Executing transaction: done
Wed 03 May 2023 09:14:29 AM UTC Created user conda environment btullis-test

To activate this environment with vanilla conda run:
  source /opt/conda-analytics/etc/profile.d/conda.sh
  conda activate btullis-test

Alternatively, you can use the conda-analytic helper script:
  source conda-analytics-activate btullis-test

This shows up in the list:

btullis@an-test-client1002:~$ conda-analytics-list
# conda environments:
#
btullis-test             /home/btullis/.conda/envs/btullis-test
base                  *  /opt/conda-analytics

...and we can now see it from the terminal, as expected.

btullis@an-test-client1002:~$ ls -l .conda/envs/
total 4
drwxr-sr-x 16 btullis wikidev 4096 May  3 09:14 btullis-test

So we have verified that creation of a conda environment worked. I didn't time it, but it felt like it took less than 120 seconds.
I'll now try launching a jupyterhub singleuser with this existing environment.

It took a restart of the jupyterhub-conda service before it appeared in the list, but this conda environment is now available for selection.

image.png (403×1 px, 34 KB)

Immediately upon selecting the pre-existing conda environment and clicking Start, we see the same errors in my personal jupyter-btullis-singleuser-conda-analytics.service log

May 03 09:23:16 an-test-client1002 systemd[1]: Started /bin/bash -c cd /home/btullis && exec /etc/jupyterhub-conda/jupyterhub-singleuser-conda-env.sh /home/btullis/.conda/envs/btullis-test --port=36677 --SingleUserNotebookApp.default_url=/lab.
May 03 09:23:16 an-test-client1002 systemd[1522605]: jupyter-btullis-singleuser-conda-analytics.service: Failed to set up mount namespacing: /run/systemd/unit-root/srv/spark-tmp: No such file or directory
May 03 09:23:16 an-test-client1002 systemd[1522605]: jupyter-btullis-singleuser-conda-analytics.service: Failed at step NAMESPACE spawning /bin/bash: No such file or directory
May 03 09:23:16 an-test-client1002 systemd[1]: jupyter-btullis-singleuser-conda-analytics.service: Main process exited, code=exited, status=226/NAMESPACE
May 03 09:23:16 an-test-client1002 systemd[1]: jupyter-btullis-singleuser-conda-analytics.service: Failed with result 'exit-code'.

This makes me think that the failure is somewhere before the creation of the conda environment even happens.

We looked a the error message and checked the name of the first directory mentioned

Failed to set up mount namespacing: /run/systemd/unit-root/srv/spark-tmp: No such file or directory

Checking on an-test-client1001 the directory /srv/spark-tmp is present, with permissions 1777
So we created this on an-test-client1002

btullis@an-test-client1002:~$ sudo mkdir /srv/spark-tmp
btullis@an-test-client1002:~$ sudo chmod 1777 /srv/spark-tmp/

Then when we tried to spawn the server, the error message was slightly different.

May 03 09:34:38 an-test-client1002 systemd[1523403]: jupyter-btullis-singleuser-conda-analytics.service: Failed to set up mount namespacing: /run/systemd/unit-root/: Input/output error
May 03 09:34:38 an-test-client1002 systemd[1523403]: jupyter-btullis-singleuser-conda-analytics.service: Failed at step NAMESPACE spawning /bin/bash: Input/output error

At this point we went back to the very first troubleshooting step, which was to umount /mnt/hdfs

This caused jupyter-btullis-singleuser-conda-analytics.service to start successfully.

So we know that it's still inextricably linked to the HDFS-FUSE mount, but instead of being configured by the options in the parent service unit file: (/lib/systemd/system/jupyterhub-conda.service), each of these services is a transient systemd service so it doesn't have a unit file, other than the one created dynamically in /run/systemd/transients/

btullis@an-test-client1002:/run/systemd/transient$ ls -l
total 24
-rw-r--r-- 1 root root 1105 May  3 09:39 jupyter-btullis-singleuser-conda-analytics.service
-rw-r--r-- 1 root root 1175 May  2 08:20 jupyter-stevemunene-singleuser-conda-analytics.service
-rw-r--r-- 1 root root  436 May  3 08:44 session-10814.scope
-rw-r--r-- 1 root root  436 May  3 08:46 session-10816.scope
-rw-r--r-- 1 root root  436 May  3 08:49 session-10817.scope
-rw-r--r-- 1 root root  444 May  3 09:09 session-10820.scope

I think we need to figure out how to modify this at creation time, in order to support the HDFS-FUSE mount safely.

We have found a workaround. Not 100% sure it it's the best option yet, but at least it is an option.

The systemdspawner has the following configuration option in /etc/jupyterhub-conda/jupyterhub_config.py

c.SystemdSpawner.readonly_paths = ['/']

We configure this option here: https://github.com/wikimedia/operations-puppet/blob/production/modules/jupyterhub/templates/config/jupyterhub_config.py.erb#L272

The documentation for the feature is here: https://github.com/jupyterhub/systemdspawner#readonly_paths

Commenting out the option results in the default value of None being configured.

This allows the service to be spawned and provides access to /mnt/hdfs for the resulting singleuser service:

image.png (460×988 px, 65 KB)

The main question is: is there any significant security implication to disabling this feature and relying on the native file system permissions of the host?
I don't see any immediate threat by doing so.

A follow-up question is, could there be any other solutions, perhaps something along the lines of:

  1. Using the unit_extra_properties option.
  2. Fixing the systemdspawner itself so that it doesn't throw an error if it encouters a FUSE mount that it cannot access.

Change 904617 restored by Btullis:

[operations/puppet@production] Jupyterhub-conda exclude /mnt from accessible paths

https://gerrit.wikimedia.org/r/904617

I've done a little more testing and I don't believe that there's any other realistic workaround for us, so I've marked the change as ready for review.

  • I tried setting InaccessiblePaths=-/mnt using the unit_extra_properties option of the systemd spawner.
  • I tried adding /mnt/hdfs to the readwrite_paths option, just in case it avoided the error. It's mounted read-only anyway.
  • I examined the source of the systemdspawner to see if there was any way to change the behaviour at this level, but it seems it is not. It simply passes through systemd-exec directives to systemd.

Therefore, I belive that disabling the readonly_paths option for bullseye and above is the best option.

I modified the patch so that it:

  • doesn't affect buster
  • explicitly sets readwrite_paths = None on bullseye and above
  • adds some documentation explaining the change

Change 904617 merged by Btullis:

[operations/puppet@production] jupyterhub-conda: Fix incompatibility with HDFS-FUSE mount

https://gerrit.wikimedia.org/r/904617

This all seems to work on an-test-client1002 now. Resolving this ticket.