Page MenuHomePhabricator

Kerberos credential cache location
Closed, DeclinedPublic

Description

By default, the kerberos credential cache is in /tmp/krb5cc_$userid, with $userid changing for each user of course. There is the possibility to change it, and we tried:

https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/594516/
https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/594519/
etc..

The idea was to move the default ccache location to /run/user/$userid/krb_cred, since it is a better location in a systemd environment (/run/user is on tmpfs, so flushed at every reboot, and also more standard for user settings than /tmp) but for a variety of reason it didn't work.

By default Java uses /tmp and it doesn't read /etc/krb.conf or similar to pick up different values, but it seems only using the KRB5CCNAME environment variable. We thought that simply exporting it by default in some place like /etc/profile.d was enough, but it turned out to not work with tools like Jupyterhub straight out of the box.

One of the major benefits of not having the credential cache under /tmp is that it wouldn't cause consistency issues with PrivateTmp settings for daemons running via systemd, most notably nagios NRPE and Jupyter notebooks. In the nagios use case, our /mnt/hdfs readability checks cannot run easily since kerberos-run-command populates the system /tmp, meanwhile the nagios server (who forks the check to run) uses a private tmp. In the Jupyter use case, every unit representing a user's notebook is not able to pick up "system tmp" krb credentials for the user, and requires the usage of the Jupyter terminal to kinit (creating multiple TGT with different lifetimes).

Event Timeline

I presume the KRB5CCNAME in /etc/profile approach doesn't work due to there being no shell involved. Do you think using /etc/environment instead might work better? Or are there further problems beyond the env var not being set for some environments?

I re-analyzed the problem with fresh mind, and I found this very useful gh issue: https://github.com/jupyter-incubator/sparkmagic/issues/466

What I did was:

  1. change krb.conf on an-test-client1001 adding default_ccache_name = /run/user/%{uid}/krb_cred (The default is /tmp/krb5_%{uid})
  2. add the following bit to jupyterhub_config.py:
import systemdspawner
class CustomSystemdSpawnerEnv(systemdspawner.SystemdSpawner):

    def get_env(self):
        env = super().get_env()
        env['KRB5CCNAME'] = ("/run/user/{}/krb_cred"
                             .format(str(pwd.getpwnam(self.user.name).pw_uid)))
        return env

c.JupyterHub.spawner_class = CustomSystemdSpawnerEnv
  1. re-create my jupyter unit (systemctl stop etc.. and re-login)

Result:

elukey@an-test-client1001:~$ sudo systemctl cat jupyter-elukey-singleuser.service  | grep Environment
Environment="PATH=/home/elukey/venv/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" "LANG=en_US.UTF-8"....."KRB5CCNAME=/run/user/13926/krb_cred" "SHELL=/bin/bash"

I was able to klist on a jupyter terminal and see my "global" kerberos token, without doing any kinit. I was also able to create a spark kernel and spark.sql worked fine!

Change 649289 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::swap: allow to override the kerberos credential cache location

https://gerrit.wikimedia.org/r/649289

Change 649305 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] kerberos: move credentials cache under /run for an-test-client1001

https://gerrit.wikimedia.org/r/649305

Change 649289 merged by Elukey:
[operations/puppet@production] profile::swap: allow to override the kerberos credential cache location

https://gerrit.wikimedia.org/r/649289

Change 649305 merged by Elukey:
[operations/puppet@production] kerberos: move credentials cache under /run for an-test-client1001

https://gerrit.wikimedia.org/r/649305

Change 649348 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] jupyterhub: add missing import to jupyterhub_config.py

https://gerrit.wikimedia.org/r/649348

Change 649348 merged by Elukey:
[operations/puppet@production] jupyterhub: add missing import to jupyterhub_config.py

https://gerrit.wikimedia.org/r/649348

Change 649355 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] kerberos: move credendial cache under /run on stat1004

https://gerrit.wikimedia.org/r/649355

Change 649355 merged by Elukey:
[operations/puppet@production] kerberos: move credential cache under /run on stat1004

https://gerrit.wikimedia.org/r/649355

Mentioned in SAL (#wikimedia-analytics) [2020-12-14T14:58:48Z] <elukey> stat1004's krb credential cache moved under /run (shared between notebooks and ssh/bash) - T255262

elukey triaged this task as Medium priority.
elukey edited projects, added Analytics-Kanban; removed Patch-For-Review.

Seems to work fine, Isaac is going to test the new settings on stat1004 during the next day and report if any weirdness comes up. Volunteers for testing on stat1004 are welcome!

Change 649415 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] kerberos: explicitly set KRB5CCNAME

https://gerrit.wikimedia.org/r/649415

Change 649415 merged by Elukey:
[operations/puppet@production] kerberos: explicitly set KRB5CCNAME

https://gerrit.wikimedia.org/r/649415

Change 649495 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::presto::client: allow to customize the kerberos ccache path

https://gerrit.wikimedia.org/r/649495

Change 649495 merged by Elukey:
[operations/puppet@production] profile::presto::client: allow to customize the kerberos ccache path

https://gerrit.wikimedia.org/r/649495

I just fixed a couple of bugs:

  • kerberos-run-command wasn't working properly.
  • presto (client) was not picking up the new location of the credential cache.

The current state of the deployment is the following:

  • stat1004 is the only host with the new credential cache location under /run/user/$uid/krbcc
  • kerberos-run-command was patched to always store credentials under /tmp/krb5cc_$uid, since the /run/user/$uid sessions are not provided for system users like analytics, analytics-privatedata, etc.. (or at least, this is my understanding after reading systemd's logind and pam configs).

The above works great for notebooks, but I feel it is a half baked solution. I have also found an interesting problem, namely that sometimes kerberos-run-command may store credentials in an unexpected path/way. For example, before this change, a command like the following forced credentials to be stored under /tmp/krb5cc_0 (for root):

# missing -u analytics after sudo
sudo kerberos-run-command hdfs hdfs dfs -ls

Now after my hack to set KRB5CCNAME to /tmp/krb5cc_$run_as_user, the above correctly stores credentials for uid=analytics, but with root-only read perms.

elukey@an-test-client1001:~$ id analytics
uid=498(analytics) gid=498(analytics) groups=498(analytics),731(analytics-privatedata-users)

elukey@an-test-client1001:~$ ls -l /tmp/krb5cc_498
-rw------- 1 analytics analytics 1053 Dec 18 09:15 /tmp/krb5cc_498
elukey@an-test-client1001:~$ sudo kerberos-run-command analytics hdfs dfs -ls
[..]
elukey@an-test-client1001:~$ ls -l /tmp/krb5cc_498
-rw------- 1 root root 1053 Dec 18 10:46 /tmp/krb5cc_498

After some thinking, the /tmp/krb5cc_0 is probably the more sound approach - if root is able to read the keytab, it is also able to get credentials in its own ccache.

As follow up, what I'd do is something like the following:

  1. create a dir (via puppet) called /run/kerberos (on tmpfs)
  2. set the krb5.conf default ccache location to /run/kerberos/krb5cc_$uid
  3. set KRB5CCNAME to /run/kerberos/krb5cc_$uid as well

In this way we'd have something on tmpfs (cleaned up by the OS) and that works for real users and system users (without really requiring any complex PAM setups). @MoritzMuehlenhoff any thoughts about this?

Change 650480 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::kerberos::client: change alternative ccache location

https://gerrit.wikimedia.org/r/650480

Change 650499 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] kerberos: rollback settings for ccache in /run/user/$uid

https://gerrit.wikimedia.org/r/650499

Change 650499 merged by Elukey:
[operations/puppet@production] kerberos: rollback settings for ccache in /run/user/$uid

https://gerrit.wikimedia.org/r/650499

I just rolledback the state on stat1004, kept the settings only for an-test-client1001. The kerberos-run-command script has also been reverted to its original behavior.

One good feedback that Isaac gave to me is that the credential cache expired for him way quicker than 48h, so I suspect that /run/user/$uid are periodically dropped/cleaned-up. This is something to keep in mind to test!

Change 655033 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_test_cluster::client: drop custom kerberos ccache settings

https://gerrit.wikimedia.org/r/655033

Change 655033 merged by Elukey:
[operations/puppet@production] role::analytics_test_cluster::client: drop custom kerberos ccache settings

https://gerrit.wikimedia.org/r/655033

Change 655085 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] jupyterhub: avoid PrivateTmp for ephemeral systemd units for notebooks

https://gerrit.wikimedia.org/r/655085

Change 655085 merged by Elukey:
[operations/puppet@production] jupyterhub: avoid PrivateTmp for ephemeral systemd units for notebooks

https://gerrit.wikimedia.org/r/655085

Change 655090 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Allow jupyter notebooks to write to tmp

https://gerrit.wikimedia.org/r/655090

Change 655090 merged by Ottomata:
[operations/puppet@production] Allow jupyter notebooks to write to tmp

https://gerrit.wikimedia.org/r/655090

Change 655092 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Allow jupyter notebooks to write to tmp for newpyter too

https://gerrit.wikimedia.org/r/655092

Change 655092 merged by Ottomata:
[operations/puppet@production] Allow jupyter notebooks to write to tmp for newpyter too

https://gerrit.wikimedia.org/r/655092

Change 650480 abandoned by Elukey:
[operations/puppet@production] profile::kerberos::client: change alternative ccache location

Reason:
Not needed anymore, needs more work

https://gerrit.wikimedia.org/r/650480

Change 650480 abandoned by Elukey:
[operations/puppet@production] profile::kerberos::client: change alternative ccache location

Reason:
Not needed anymore, needs more work

https://gerrit.wikimedia.org/r/650480

I had a chat with Moritz about this and the change may have worked, but on /run/kerberos we wouldn't have had all the nice kernel hardening features that /tmp offers. A new location for the kerberos credential cache will need to take this in mind, but we can postpone it later on (after Bigtop for example).

elukey removed elukey as the assignee of this task.

For a lot of reasons, this was more complicated than expected, and a lot of clients had problems with it. We solved the main issue that brought us to this, namely avoiding to kinit again (so sharing krb sessions) in Jupyter notebooks, I would just decline the task for the moment and re-open in case we have time to work on it.