Page MenuHomePhabricator

Improve user experience for Kerberos by creating automatic token renewal service
Closed, ResolvedPublic

Description

Update - August 2021

Having tested the functionality of automatically renewing kerberos tickets on login, the consensus is that:

  • users would appreciate the ability to keep their ticket renewed by the use of a background service
  • we should look at developing and deploying this service on all kerberos enabled hosts by default, in order to improve the user experience

Renaming this ticket accordingly.

</update>

Not sure why I haven't realized this before, but the "renew" part of the kerberos credentials is something really handy for every user. For example:

elukey@stat1004:~$ klist
Ticket cache: FILE:/tmp/krb5cc_13926
Default principal: elukey@WIKIMEDIA

Valid starting       Expires              Service principal
11/30/2020 08:58:08  12/02/2020 08:58:05  krbtgt/WIKIMEDIA@WIKIMEDIA
	renew until 12/07/2020 08:58:05

Two dates are highlighted:

  • 12/02/2020 08:58:05 - my ticket expires in approx two days
  • 12/07/2020 08:58:05 - max renew date for my ticket

The latter means that until that date, I can simply run kinit -R (without password) and get refreshed ticket (that will last for 48h).

What if we:

  • add to the stat100x's motd (or even more general, every kerberos client) some details about the current expire/renew timings
  • execute kinit -R automatically upon login for every user

We'd need to add some guards of course, like not doing it if ticket is empty etc.., but it should be very simple and in theory allow people to kinit way less than now.

Related Objects

Event Timeline

execute kinit -R automatically upon login for every user

THIS WOULD BE AWESOME!

Ottomata triaged this task as Medium priority.
Ottomata edited projects, added Analytics-Clusters; removed Analytics.

This would be awesome! Is there a way to do this for Jupyter as well, since kinit within Jupyter is distinct from outside of it?

Change 645320 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::kerberos::client: improve the user experience with kinit

https://gerrit.wikimedia.org/r/645320

This would be awesome! Is there a way to do this for Jupyter as well, since kinit within Jupyter is distinct from outside of it?

Not sure but I'll add this to the tests :)

Change 645320 merged by Elukey:
[operations/puppet@production] profile::kerberos::client: improve the user experience with kinit

https://gerrit.wikimedia.org/r/645320

Little update: I added a script under /etc/profile.d that informs the user about the need for a kinit or not right after ssh. For example:

ssh stat1004.eqiad.wmnet

[...]
Debian GNU/Linux 10 auto-installed on Tue Oct 13 08:07:24 UTC 2020.
Last login: Fri Dec  4 16:22:39 2020 from 2620:0:861:3:208:80:154:86

Found a valid Kerberos ticket in the credential cache:
Ticket cache: FILE:/tmp/krb5cc_13926
Default principal: elukey@WIKIMEDIA
[..]

elukey@stat1004:~$ kdestroy

ssh stat1004.eqiad.wmnet
[..]
Debian GNU/Linux 10 auto-installed on Tue Oct 13 08:07:24 UTC 2020.
Last login: Fri Dec  4 18:20:40 2020 from 2620:0:861:3:208:80:154:86

You do not have a valid Kerberos ticket in the credential cache, remember to kinit.

The next step is to explore how krenew works (https://linux.die.net/man/1/krenew) to automatically review kerberos credentials for stat100x users.

elukey removed elukey as the assignee of this task.Jun 1 2021, 8:13 AM

Change 705356 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add the kstart package to all kerberos clients

https://gerrit.wikimedia.org/r/705356

Having looked in detail at the kstart package we can see that it does not install any daemon, nor run any pre/post install scripts. (https://packages.debian.org/buster/amd64/kstart/filelist)

Therefore I suggest that we install this package on all kerberos clients so that we can start testing the krenew functionality. We can start by adding it to user owned profile config files (e.g. .bashrc) before making another patch to add it to the system-wide configuration.

I have updated the patch with comments from @elukey so that a parameter enable_autorenew is present with a default of false.
The parameter is overridden to be true for the profile::analytics_test_cluster::client role.

Rather then merging this now and testing manually, I will mark the patch as WIP and work on the autorenew process itself.

This first patch patch is ready merging, which will install the kstart package and enable auto-renew functionality on an-test-client1001.
We can then deploy this as-is across all kerberized servers, or look instead at keeping tickets renewed with user daemons.

Change 705356 merged by Btullis:

[operations/puppet@production] Enable kerberos ticket auto-renewal for a test client

https://gerrit.wikimedia.org/r/705356

This appears to work as expected. The only question I have is whether the users would prefer more feedback about the renewed ticket lifespan, or whether we can promote this to production as-is.

The motd now looks like the following when logging into an-test-client1001.eqiad.wmnet

btullis@marlin:~$ ssh an-test-client1001.eqiad.wmnet 
Linux an-test-client1001 4.19.0-14-amd64 #1 SMP Debian 4.19.171-2 (2021-01-30) x86_64
Debian GNU/Linux 10 (buster)

  _  __         _               _             _   _               _   
 | |/ /        | |             (_)           | | | |             | |  
 | ' / ___ _ __| |__   ___ _ __ _ _______  __| | | |__   ___  ___| |_ 
 |  < / _ \ '__| '_ \ / _ \ '__| |_  / _ \/ _` | | '_ \ / _ \/ __| __|
 | . \  __/ |  | |_) |  __/ |  | |/ /  __/ (_| | | | | | (_) \__ \ |_ 
 |_|\_\___|_|  |_.__/ \___|_|  |_/___\___|\__,_| |_| |_|\___/|___/\__|


This host is capable of Kerberos authentication in the WIKIMEDIA realm.

For more info: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos/UserGuide

an-test-client1001 is a Analytics Hadoop test client (analytics_test_cluster::client)
The last Puppet run was at Thu Jul 29 15:55:49 UTC 2021 (27 minutes ago). 
Last puppet commit: (14a795004c) Btullis - Update hive to log4j version2 configuration files
Debian GNU/Linux 10 auto-installed on Wed Oct 21 22:23:14 UTC 2020.
Last login: Thu Jul 29 16:22:37 2021 from 2620:0:861:4:208:80:155:110

Renewing existing Kerberos ticket in the credential cache:
krenew: renewing credentials for btullis@WIKIMEDIA
btullis@an-test-client1001:~$

Would people prefer to have the output from klist as well, or is this OK to deploy to all kerberized systems?

I think that the two lines

Renewing existing Kerberos ticket in the credential cache:
krenew: renewing credentials for btullis@WIKIMEDIA

Are a little bit confusing to me, people might wonder what they mean. I would keep the klist output as well, people seemed to like it (but it may have changed, I'd ask to product-analytics/research to confirm).

The confusing part for me is Renewing existing Kerberos ticket in the credential cache:, it seems as if something is missing, I don't see it visually connecting to the krenew output below (but it could be only me). Maybe removing the line + klist output after krenew could be good?

Before deploying to all kerberized systems, I'd explore the possibility to have something not relying on users ssh-ing to the host (like a systemd unit etc..).

No blockers, good work, feel free to proceed with anything :)

Thanks @elukey. Yes, I totally get what you mean about the lack of clarity. I just couldn't see (without looking at the source) what the command's output was going to be until I got the package onto one kerberized server.

I'll have a think about the other options, such as systemd units. It's just a bit more tricky thinking of things like:

  • would we create a systemd user unit for each kerberos enabled user account?
  • would we have to enable lingering for each account, to allow systemd user services to run without the users' being logged in?
  • would users start/stop the service themselves on each kerberized host, or would this be done automatically?
  • if it's done automatically, what happens before they first run kinit, will the services retry up to a certain limit, then fail?
  • what happens to a user's renewal service if someone runs kdestroy to remove their existing ticket?
  • where should logging of renewals go? Syslog?
  • at what point would we be reducing security too much with automated, persistent renewal of users' tickets?

Lots to think about.

Some random thoughts here, some of those are wild guesses/wishful thinking since I haven't looked at krenew in detail yet :-)

  • would we create a systemd user unit for each kerberos enabled user account?

Probably yes, it could be spawned from the logind session I guess?

  • would we have to enable lingering for each account, to allow systemd user services to run without the users' being logged in?
  • would users start/stop the service themselves on each kerberized host, or would this be done automatically?

I think spawning them automatically is what's typically wanted by a user (and if there are complaints (which I doubt) we could still add a config option later?

  • if it's done automatically, what happens before they first run kinit, will the services retry up to a certain limit, then fail?

Maybe we could check whether a ticket exists for a user, so it only attempts that after an initial, successful kinit?

  • what happens to a user's renewal service if someone runs kdestroy to remove their existing ticket?

If someone explicitly revokes the ticket we should also stop the renewal. Since typically this will be done for a good reason (revoking an account or so).

  • where should logging of renewals go? Syslog?

Seems fine to me, yes.

  • at what point would we be reducing security too much with automated, persistent renewal of users' tickets?

A long renewal phase (like a week or two) should be entirely fine. The identity of the user is validated via their SSH login/key (and we're planning to add U2F two factor auth as for this as well). The only risk of a longer renewal period that I can see is that keeping the Kerberos secrets in memory for the renewal mechanism makes them subject to vulnerabilities in Kerberos/systemd (like some attack which leaks memory contents or so). But it seems like an acceptable tradeoff here.

BTullis renamed this task from Reduce manual kinit frequency on stat100x hosts to Improve user experience for Kerberos by creating automatic token renewal service.Aug 2 2021, 5:09 PM
BTullis updated the task description. (Show Details)

I think I need to seek some input from SRE on this, as to what is the best way to proceed. I'm trying to create a user-level scheduled task to renew Kerberos tickets automatically, as described in the previous comment. I think that my preferred way to to this would be to use systemd-run --user to create a transient systemd timer for each user. For example, by executing the following function upon login.

function create_autorenew_timer {
    echo -e "\nCreating automatic Kerberos ticket renewal service"
    /usr/bin/systemd-run --quiet --user --unit krenew-$USER.timer \
        --on-calendar=daily \
        --description="Kerberos ticket renewal timer for for $USER" \
        --property=StandardOutput=null \
        --property=StandardError=null \
        /usr/bin/sh -c "/usr/bin/klist -s && /usr/bin/krenew -v -L 2> /dev/null"
}

This works as expected.

btullis@an-test-client1001:~$ systemctl --user -q list-units krenew-btullis.timer
UNIT                 LOAD   ACTIVE SUB     DESCRIPTION                                  
krenew-btullis.timer loaded active waiting Kerberos ticket renewal timer for for btullis

However, this method requires lingering to be enabled for each user account that wants to use it. Without lingering enabled the timer is removed when the last user session is closed. The same is true if it's a persistent timer loaded from a unit file under $HOME/.local/share/systemd/user/.

I think that lingering is generally a useful feature, allowing users to create long-running services, including those which can start at boot.

All that is required to enable lingering for usernameis a zero length file named /var/lib/systemd/username
This can be created with sudo loginctl enable-linger username or just with touch.
In fact, if we had the policykit-1 package installed, users would be able to enable lingering themselves. See this bug for context.

The scope that we're talking about for this ticket is:

  • users with kerberos principals
  • on hosts with profile::kerberos::client applied

...although it might be useful more widely.

My questions are:

  1. would I get permission to enable lingering for these user accounts on these hosts? (if so, what's the best way to do this in puppet?)
  2. or should I make static systemd timers for each user in the --system service manager? (bit messy in my opinion)
  3. or should I just use each user's crontab, which is probably the simpler option? (bit old-fashioned in my opinion)
  4. or something else?

Any other ideas or pointers to prior art gratefully received.

Change 711482 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Improve the Kerberos automatic renewal service

https://gerrit.wikimedia.org/r/711482

This might be a daft question, given the work that has already gone into the change, but why don't we simply increase the maximum ticket lifetime that the KDC can issue instead of automating ticket renewals?

Currently the maximum length of time for which a ticket can be issued is 2 days.
The ticket can be renewed for up to 7 days but no longer than that.

Therefore, all of the effort that we're going to for this automatic renewal mechanism means that users are only going to have to authenticate using their Kerberos passwords once every 7 days on each host they use, instead of every 2 days.
Why don't we just increase max_life from 2d to 7d on the KDC here? https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/profile/templates/kerberos/kdc.conf.erb$13
(There's a corresponding change in the client's /etc/krb5.conf as well.)

To reduce the frequency with which kinit is required even further, why not just increase max_life and max_renewable_life to 14 days each?

I think it is good to have something that needs to auto-renew periodically after 2d (or similar), so that when we need to revoke a kerberos principal we have a limited time to wait before everything expires correctly. Maybe we could start with this one week period, and then figure out if max_renewable_life could be moved to 14d. After all the work that you did I'd be in favor of having a stricter check on valid kerberos tickets rather than going for max_life=7d :) I would even reduce max_life to one day after your change, since we bumped it to 2d only as trade-off between security and user experience.

Thanks for that explanation. It makes sense, I just didn't want to think that we hadn't considered the alternatives, or for us to think that the user experience would be vastly different.
I agree that perhaps after we have deployed it to production and checked that there are no unintended effects, we could likely increase max_renewable_life and decrease max_life as you suggest.

Change 711482 merged by Btullis:

[operations/puppet@production] Improve the Kerberos automatic renewal service

https://gerrit.wikimedia.org/r/711482

The change has now been deployed and the auto-renew service is enabled for users on an-test-client1001.eqiad.wmnet.
I'd encourage anyone interested to log in to that host and kinit.
If you log in again within 2 days, the transient systemd timer will be created and your ticket will be renewed each day for up to 7 days.

I'm going to leave it for a week or so to make sure that there is no unexpected logspam, cronspam, any other kind of spam, or unexpected consequences.

If all goes well, then I'll prepare another change to enable the feature on stat100[4-8]. eqiad.wmnet.

Change 722352 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Enable the kerberos auto-renew service for stat nodes

https://gerrit.wikimedia.org/r/722352

My ticket expired yesterday, but the ticket for @elukey is still valid.
Last night's syslog shows this:

Sep 21 00:00:21 an-test-client1001 systemd[28726]: Started Kerberos ticket renewal timer for for elukey.
Sep 21 00:00:21 an-test-client1001 systemd[12810]: Started Kerberos ticket renewal timer for for btullis.
Sep 21 00:00:22 an-test-client1001 systemd[12810]: krenew-btullis.service: Main process exited, code=exited, status=1/FAILURE
Sep 21 00:00:22 an-test-client1001 systemd[12810]: krenew-btullis.service: Failed with result 'exit-code'.
Sep 21 00:00:22 an-test-client1001 krenew[28748]: renewing credentials for elukey@WIKIMEDIA
Sep 21 00:00:22 an-test-client1001 systemd[28726]: krenew-elukey.service: Succeeded.

If I run systemctl --use status krenew-btullis.service it shows that the service has failed, which is expected.

image.png (203×1 px, 54 KB)

We didn't get any more log messages than that and no cron mail on success or failure, so I think that this is fine. I suppose I could create some housekeeping timers to delete any failing user auto-renew services, but I'm not sure that it's really necessary. I've created a CR to implement this change as it is on the stat1--x boxes and I'm happy for that to go ahead. Any other comments or concerns from anyone?

Tested as well an-test-client1001, and I was able to use the spark-shell without kinit. All good!

Change 722352 merged by Btullis:

[operations/puppet@production] Enable the kerberos auto-renew service for stat nodes

https://gerrit.wikimedia.org/r/722352

Merged and deployed.
First login on stat1004.eqiad.wmnet after a puppet run.

stat1004 is a Statistics & Analytics cluster explorer (private data access, no local compute) (statistics::explorer)
The last Puppet run was at Fri Sep 24 12:46:38 UTC 2021 (1 minutes ago). 
Last puppet commit: (4f936700a5) Btullis - Enable the kerberos auto-renew service for stat nodes
Debian GNU/Linux 10 auto-installed on Tue Oct 13 08:07:24 UTC 2020.
Last login: Fri Sep 24 12:46:17 2021 from 2620:0:861:4:208:80:155:110

You have a valid Kerberos ticket.

Creating automatic Kerberos ticket renewal service

Second login after running puppet.

stat1004 is a Statistics & Analytics cluster explorer (private data access, no local compute) (statistics::explorer)
The last Puppet run was at Fri Sep 24 12:46:38 UTC 2021 (1 minutes ago). 
Last puppet commit: (4f936700a5) Btullis - Enable the kerberos auto-renew service for stat nodes
Debian GNU/Linux 10 auto-installed on Tue Oct 13 08:07:24 UTC 2020.
Last login: Fri Sep 24 12:48:14 2021 from 2620:0:861:4:208:80:155:110

You have a valid Kerberos ticket.
Your automatic Kerberos ticket renewal service is also active on this host

I will update the wiki with this information.

@BTullis Unfortunately, this does not work at stat1005:

[urbanecm@stat1005 ~]$ bash /etc/profile.d/kerberos_autorenew.sh

You have a valid Kerberos ticket.
/etc/profile.d/kerberos_autorenew.sh: line 8: /usr/bin/systemctl: No such file or directory

Creating automatic Kerberos ticket renewal service
Failed to find executable /usr/bin/sh: No such file or directory
[urbanecm@stat1005 ~]$

It appears /usr/bin/sh and /usr/bin/systemctl are not a thing at stat1005 -- only /bin/sh is. However, those files do exist at stat1006 or an-test-client1001. I'm leaving it up to you whether you want to create them at stat1005 too (perhaps via puppet), or change the paths to /bin/xxx instead.

urbanecm@notebook  ~
$ ssh an-test-client1001.eqiad.wmnet
[urbanecm@an-test-client1001 ~]$ ls -l /usr/bin/systemctl
-rwxr-xr-x 1 root root 868696 Jul  8 13:03 /usr/bin/systemctl
[urbanecm@an-test-client1001 ~]$ ls -l /bin/systemctl
-rwxr-xr-x 1 root root 868696 Jul  8 13:03 /bin/systemctl
[urbanecm@an-test-client1001 ~]$ ls -lh /usr/bin/sh
lrwxrwxrwx 1 root root 4 Oct 21  2020 /usr/bin/sh -> dash
[urbanecm@an-test-client1001 ~]$ ls -lh /bin/sh
lrwxrwxrwx 1 root root 4 Oct 21  2020 /bin/sh -> dash
[urbanecm@an-test-client1001 ~]$ logout
Connection to an-test-client1001.eqiad.wmnet closed.
urbanecm@notebook  ~
$ ssh stat1006.eqiad.wmnet
[urbanecm@stat1006 ~]$ ls -l /usr/bin/systemctl
-rwxr-xr-x 1 root root 868696 Jul  8 13:03 /usr/bin/systemctl
[urbanecm@stat1006 ~]$ ls -l /bin/systemctl
-rwxr-xr-x 1 root root 868696 Jul  8 13:03 /bin/systemctl
[urbanecm@stat1006 ~]$ ls -l /usr/bin/sh
lrwxrwxrwx 1 root root 4 Oct  8  2020 /usr/bin/sh -> dash
[urbanecm@stat1006 ~]$ ls -l /bin/sh
lrwxrwxrwx 1 root root 4 Oct  8  2020 /bin/sh -> dash
[urbanecm@stat1006 ~]$ logout
Connection to stat1006.eqiad.wmnet closed.
urbanecm@notebook  ~
$

Thanks for your feedback @Urbanecm - I'm not 100% clear on the history of stat1005 but it looks like there's some tidy-up needed from previous work on the host.

The /etc/motd.tail file seems to think that this is a Debian Stretch host:

btullis@stat1005:~$ cat /etc/motd.tail 
Debian GNU/Linux 9 auto-installed on Mon Feb 18 10:51:35 UTC 2019.

...whereas the package sources and everything else suggests that it is a Debian Buster host.

btullis@stat1005:~$ cat /etc/debian_version 
10.10

However, the location of systemctl and sh would support the evidence that it is a Stretch host though, because that's where they were installed at the time. They did not have symlinks created in /usr/bin/

In terms of fixing the issue with the location, the simplest solution is probably for us to install the usrmerge package, which was created to effect precisely this change and create symlinks in /usr/bin/ for various files. However, I'd rather spend a little while understanding why we are in this situation, before attempting to fix it. It could be that a clean reinstall of stat1005 would be the preferred outcome, once the cause of the discrepancy has been identified.

For reference there is some context around the cross-distributuion UsrMerge changes here:

https://www.freedesktop.org/wiki/Software/systemd/TheCaseForTheUsrMerge/
https://wiki.debian.org/UsrMerge
https://salsa.debian.org/md/usrmerge/raw/master/debian/README.Debian

I'll update the ticket with more information as soon as I have worked out more. In the meantime you'll still have to run kinit every two days.
You can use kinit -R or krenew to renew your existing ticket for up to 7 days though, if you wish.

I've researched the history of stat1005 as much as I think I need to in order to make a decision.
It looks like it might well have been installed with Jessie and upgraded to Buster, owing to the difficulties with the ROCm packaging and the need for a backported kernel.
The usrmerge package would have been recommended automatically during the upgrade, but we do not apply recommended packages by default.

In this case I think that a reimage would to too disruptive to users.
Therefore a manual install of the usrmerge package should fix the issue.
The package can then be removed afterwards, as conversion is a one-way operation and there is nothing else useful in the package.

These two scripts will be run on package installation, which will create the required symlinks and edit the /etc/shells file:

https://sources.debian.org/src/usrmerge/21/convert-usrmerge/
https://sources.debian.org/src/usrmerge/21/convert-etc-shells/

Therefore I propose to run the following on stat1005.

sudo apt install usrmerge
sudo apt purge usrmerge

@elukey, @Ottomata. @razzi - Any comments or are you happy with this approach?

Looking at /var/log/installer it seems stat1005 was installed in 2019 with Stretch and then later on dist-upgraded to buster (something we rarely do since we prefer reimages, but it happens). Installing usrmerge in this case (and let's check whether other stat* hpsts have the same issue) sounds good to me.

Starting with Buster d-i creates usr-merged paths, so generally speaking for any Puppet code or script which potentially also runs on Stretch it makes sense to use the bin/foo paths (since those are equally available on usr-merged systems via symlinks). We don't seem to use Kerberos on the remaining Stretch hosts, so that seems fine.

Installing this now. The following debconf question was displayed.

image.png (249×1 px, 30 KB)

Answered yes.

One unexpected package removal: molly-guard

The molly-guard package was reinstalled automatically during the next puppet agent run.

Info: Applying configuration version '(edf1e07088) Jbond - spec tests: drop pre_conditions as its not needed'
Notice: /Stage[main]/Base::Standard_packages/Package[molly-guard]/ensure: created (corrective)

Ticket auto-renewal now works for me:

You have a valid Kerberos ticket.

Creating automatic Kerberos ticket renewal service
btullis@stat1005:~$

All of the other stat100x servers have been checked and they are fine, so I think it was only stat1005 which required this change.

Change 727349 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Increase the maximum renewable lifetime of a Kerberos ticket

https://gerrit.wikimedia.org/r/727349

Change 727349 merged by Btullis:

[operations/puppet@production] Increase the maximum renewable lifetime of a Kerberos ticket

https://gerrit.wikimedia.org/r/727349

I have merged that change to increase the maximum renewable lifetime to 14 days.

The related update to Wikitech is here: https://wikitech.wikimedia.org/w/index.php?title=Analytics/Systems/Kerberos/UserGuide&diff=1928827&oldid=1926503