Make the Kerberos infrastructure production ready
Closed, ResolvedPublic21 Estimated Story Points
Actions

Assigned To

Authored By

	elukey
	Jun 19 2019, 10:37 AM

Description

In T212257 a simple KDC + kadmin service was set up on kerberos1001, with minimal puppet automation to:

create principals and keytabs
copy them securely to the puppetmaster's private puppet repo and deploy them via puppet when requested (by hiera variables)

The above unblocked testing Kerberos in the Hadoop test cluster, but it is surely not enough. A few things need to be done:

order hardware for the two hosts that will run Kerberos KDC(s) and kadmin daemons (two misc nodes)
add puppet automation to bootstrap a KDC service from scratch on a node (caveat: this might mean only partial automation since currently the kdc packages, when installing, require manual inputs)
add puppet automation to allow a proper KDC/kadmin failover in case the primary kerberos node goes down.
puppetise basic config properties like a default password policy

Details

Subject	Repo	Branch	Lines +/-
profile::kerberos::kadminserver: fix typo in monitoring	operations/puppet	production	+1 -1
kerberos: enable monitoring	operations/puppet	production	+4 -1
kerberos: test failover (part 2)	operations/puppet	production	+2 -2
kerberos: ensure resources that might change during failover	operations/puppet	production	+32 -24
kerberos: test kadmin failover/swap	operations/puppet	production	+1 -1
kerberos: add nagios process monitoring for kpropd	operations/puppet	production	+10 -0
kerberos: ensure kadmind and rsync only on the master node	operations/puppet	production	+20 -12
kerberos: add nagios process monitors to kdc/kadmind daemons	operations/puppet	production	+21 -0
profile::kerberos::kdc: add support for bacula backups	operations/puppet	production	+6 -0
Add kerberos hosts to analytics-in4 + add kerberos to analytics-in6	operations/homer/public	master	+17 -0
Switch the Hadoop test cluster to krb1001/krb2001	operations/puppet	production	+10 -16
profile::kerberos::replication: fix replicate_krb_database script	operations/puppet	production	+3 -1
Enable kerberos replication on krb[12]001	operations/puppet	production	+4 -2
site.pp: add role::kerberos::kdc to kdc2001	operations/puppet	production	+9 -2
Add role::kerberos::kdc to krb1001	operations/puppet	production	+8 -2
profile::kerberos::kdc: add debconf settings	operations/puppet	production	+75 -0
profile::kerberos::kadminserver: add support for replication	operations/puppet	production	+97 -0
profile::kerberos::kdc: add daily backup for the KDC database	operations/puppet	production	+50 -0

Related Objects
Search...

Status	Assigned	Task
		Restricted Task
Resolved	elukey	T211836 Enable Security (stronger authentication and data encryption) for the Analytics Hadoop cluster and its dependent services
Resolved	elukey	T226089 Make the Kerberos infrastructure production ready
Resolved	RobH	T227288 eqiad: 1 misc node for the Kerberos KDC service
Resolved	elukey	T233141 setup/install krb1001/WMF5173
Resolved	• Cmjohnson	T233642 apply hostname labels for krb1001/WMF5173
Resolved	None	T227425 codfw: 1 misc node for the Kerberos KDC service
Resolved	Papaul	T233142 setup/install krb2001/WMF6577
Resolved	Papaul	T233962 apply hostname labels for krb2001/WMF6577
Resolved	elukey	T234600 Decommission kerberos1001

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

elukey moved this task from Backlog to In Progress on the User-Elukey board.Jul 4 2019, 2:51 PM

elukey moved this task from In Progress to Backlog on the User-Elukey board.

elukey moved this task from Backlog to Kerberos on the User-Elukey board.Jul 5 2019, 6:57 AM

elukey added a subtask: T227425: codfw: 1 misc node for the Kerberos KDC service.Jul 8 2019, 12:59 PM

Very interesting reading: https://www.tldp.org/HOWTO/Kerberos-Infrastructure-HOWTO/server-replication.html

My understanding is that:

kdb5_util dump could be used to periodically dump the status of the master KDC's database to a file. Maybe that could be saved in Bacula or similar?
krepl can be used to get a dump of the master database and then propagate it to the KDC's slaves. It is a good use case for a systemd timer with icinga alarms, to monitor if things fail.

In https://web.mit.edu/kerberos/krb5-1.5/krb5-1.5.4/doc/krb5-install/Switching-Master-and-Slave-KDCs.html there is a simple procedure to swap master/slave in case one fails. Needs to be expanded though..

I tried to use kdb5_util dump on kerberos1001, the resulting file was 24K. It might be worth to avoid Bacula and have a simple rsync on the KDC slave that copies dumps periodically. As far as I understand replicating from master to slave via krepl is not sufficient, since if the master's database gets corrupted or inconsistent then the problem might get propagated before ad admin can act. Having a dump of the database can help in having periodic (hopefully working) backup.

Change 528775 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::kerberos::kdc: add daily backup for the KDC database

https://gerrit.wikimedia.org/r/528775

gerritbot added a project: Patch-For-Review.Aug 7 2019, 12:11 PM

elukey added a project: Analytics-Kanban.Aug 7 2019, 12:41 PM

elukey moved this task from Next Up to In Progress on the Analytics-Kanban board.

Change 528775 merged by Elukey:
[operations/puppet@production] profile::kerberos::kdc: add daily backup for the KDC database

https://gerrit.wikimedia.org/r/528775

Maintenance_bot removed a project: Patch-For-Review.Aug 9 2019, 8:10 AM

Change 529733 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::kerberos::kdc: add support for replication

https://gerrit.wikimedia.org/r/529733

gerritbot added a project: Patch-For-Review.Aug 12 2019, 10:34 AM

Change 529786 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::kerberos::kdc: add debconf settings

https://gerrit.wikimedia.org/r/529786

Change 529733 merged by Elukey:
[operations/puppet@production] profile::kerberos::kadminserver: add support for replication

https://gerrit.wikimedia.org/r/529733

Change 529786 merged by Elukey:
[operations/puppet@production] profile::kerberos::kdc: add debconf settings

https://gerrit.wikimedia.org/r/529786

Maintenance_bot removed a project: Patch-For-Review.Sep 13 2019, 4:10 PM

RobH closed subtask T227288: eqiad: 1 misc node for the Kerberos KDC service as Resolved.Sep 17 2019, 6:40 PM

RobH closed subtask T227425: codfw: 1 misc node for the Kerberos KDC service as Resolved.

Change 539338 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add role::kerberos::kdc to krb1001

https://gerrit.wikimedia.org/r/539338

gerritbot added a project: Patch-For-Review.Sep 26 2019, 2:08 PM

Change 539338 merged by Elukey:
[operations/puppet@production] Add role::kerberos::kdc to krb1001

https://gerrit.wikimedia.org/r/539338

Maintenance_bot removed a project: Patch-For-Review.Sep 26 2019, 3:10 PM

Change 539524 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] site.pp: add role::kerberos::kdc to kdc2001

https://gerrit.wikimedia.org/r/539524

gerritbot added a project: Patch-For-Review.Sep 27 2019, 12:47 PM

Change 539524 merged by Elukey:
[operations/puppet@production] site.pp: add role::kerberos::kdc to kdc2001

https://gerrit.wikimedia.org/r/539524

Maintenance_bot removed a project: Patch-For-Review.Sep 27 2019, 2:10 PM

Change 539546 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Enable kerberos replication on krb[12]001

https://gerrit.wikimedia.org/r/539546

Change 539546 merged by Elukey:
[operations/puppet@production] Enable kerberos replication on krb[12]001

https://gerrit.wikimedia.org/r/539546

Maintenance_bot removed a project: Patch-For-Review.Sep 27 2019, 3:10 PM

Change 539580 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::kerberos::replication: fix replicate_krb_database script

https://gerrit.wikimedia.org/r/539580

gerritbot added a project: Patch-For-Review.Sep 27 2019, 5:03 PM

Change 539580 merged by Elukey:
[operations/puppet@production] profile::kerberos::replication: fix replicate_krb_database script

https://gerrit.wikimedia.org/r/539580

Maintenance_bot removed a project: Patch-For-Review.Sep 30 2019, 2:10 PM

Summary of progresses:

set up krb1001 in eqiad and krb2001 in codfw
set up a basic replication between 1001 and 1002 via kprop/kpropd
documented everything in https://wikitech.wikimedia.org/wiki/User:Elukey/Analytics/Kerberos (will be moved at some point to a more generic location)

Left to do:

add bacula backups to every host of the kerberos cluster to save the snapshots of the database
move the Hadoop test cluster to the new cluster and test failover (kdc on 1001 going down, etc..)

Change 540610 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Switch the Hadoop test cluster to krb1001/krb2001

https://gerrit.wikimedia.org/r/540610

gerritbot added a project: Patch-For-Review.Oct 3 2019, 2:40 PM

Change 540610 merged by Elukey:
[operations/puppet@production] Switch the Hadoop test cluster to krb1001/krb2001

https://gerrit.wikimedia.org/r/540610

Maintenance_bot removed a project: Patch-For-Review.Oct 3 2019, 3:10 PM

elukey@re0.cr1-eqiad# show | compare
[edit firewall family inet filter analytics-in4 term kerberos from destination-address]
         10.64.0.182/32 { ... }
+        /* krb1001 */
+        10.64.0.112/32;
+        /* krb2001 */
+        10.192.48.135/32;
[edit firewall family inet6 filter analytics-in6]
       term scap { ... }
+      term kerberos {
+          from {
+              destination-address {
+                  /* krb1001 */
+                  2620::861:101:10:64:0:112/128;
+                  /* krb2001 */
+                  2620::860:104:10:192:48:135/128;
+              }
+              next-header [ tcp udp ];
+              destination-port [ 88 464 ];
+          }
+          then accept;
+      }
       term default { ... }

elukey@re0.cr2-eqiad# show | compare
[edit firewall family inet filter analytics-in4 term kerberos from destination-address]
         10.64.0.182/32 { ... }
+        /* krb1001 */
+        10.64.0.112/32;
+        /* krb2001 */
+        10.192.48.135/32;
[edit firewall family inet6 filter analytics-in6]
       term scap { ... }
+      term kerberos {
+          from {
+              destination-address {
+                  /* krb1001 */
+                  2620::861:101:10:64:0:112/128;
+                  /* krb2001 */
+                  2620::860:104:10:192:48:135/128;
+              }
+              next-header [ udp tcp ];
+              destination-port [ 88 464 ];
+          }
+          then accept;
+      }
       term default { ... }

CC: @ayounsi

elukey closed subtask T234600: Decommission kerberos1001 as Resolved.Oct 4 2019, 7:32 AM

Updates:

re-created all principals and keytabs for the Hadoop test cluster and move it to krb1001/krb2001
verified that replication works between krb1001 and krb2001 (kadmin.local on krb2001 -> getprincs)
decommissioned kerberos1001
added network firewall rules in the Analytics VLAN to allow IPv4/IPv6 addresses of krb1001 and krb2001

Remaining to do:

test failover of the KDCs and the Hadoop cluster
add bacula backups for the kerberos database

Did a quick test:

kdestroy on an-tool1006
stop kdc on krb1001
kinit on an-tool1006
check kdc logs for my username on krb2001

And everything worked smoothly without any client error. Next step is to test Hadoop daemons.

Change 540832 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::kerberos::kdc: add support for bacula backups

https://gerrit.wikimedia.org/r/540832

gerritbot added a project: Patch-For-Review.Oct 4 2019, 10:12 AM

Change 541370 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/homer/public@master] Add kerberos hosts to analytics-in4 + add kerberos to analytics-in6

https://gerrit.wikimedia.org/r/541370

Change 541370 merged by Ayounsi:
[operations/homer/public@master] Add kerberos hosts to analytics-in4 + add kerberos to analytics-in6

https://gerrit.wikimedia.org/r/541370

ayounsi mentioned this in rOHPUf10738beb2b8: Add kerberos hosts to analytics-in4 + add kerberos to analytics-in6.Oct 7 2019, 8:57 PM

Change 540832 merged by Elukey:
[operations/puppet@production] profile::kerberos::kdc: add support for bacula backups

https://gerrit.wikimedia.org/r/540832

Maintenance_bot removed a project: Patch-For-Review.Oct 8 2019, 8:10 AM

I have stopped the kdc on krb1001 to simulate a host down scenario. I am able to renew my krb ticket but I want to leave it down for hours to see what happens to hadoop daemons (hopefully nothing).

elukey set the point value for this task to 21.Oct 9 2019, 9:48 AM

Another thing we need to do: Add a new flag to data.yaml to annotate that a user is kerberos-enabled (as we need to ensure to also drop Kerberos user principals when offboarding users).

MoritzMuehlenhoff added a project: SRE.Oct 9 2019, 10:29 AM

One other thing (not necessarily now) is to add a monitoring check, e.g. https://exchange.nagios.org/directory/Plugins/Security/check_krb5

In T226089#5559190, @MoritzMuehlenhoff wrote:

One other thing (not necessarily now) is to add a monitoring check, e.g. https://exchange.nagios.org/directory/Plugins/Security/check_krb5

My idea was to add initially only a nagios process count check, and then think about something like check_krb5. Would it be reasonable?

In T226089#5559492, @elukey wrote:

In T226089#5559190, @MoritzMuehlenhoff wrote:

One other thing (not necessarily now) is to add a monitoring check, e.g. https://exchange.nagios.org/directory/Plugins/Security/check_krb5

My idea was to add initially only a nagios process count check, and then think about something like check_krb5. Would it be reasonable?

Makes sense, let's split this to a separate task.

Interesting:

Oct 10 03:00:01 krb2001 kpropd[26599]: Connection from krb1001.eqiad.wmnet
Oct 10 03:26:24 krb2001 systemd[1]: Stopping Kerberos 5 slave KDC update server...
Oct 10 03:26:24 krb2001 systemd[1]: krb5-kpropd.service: Main process exited, code=killed, status=15/TERM
Oct 10 03:26:24 krb2001 systemd[1]: krb5-kpropd.service: Succeeded.
Oct 10 03:26:24 krb2001 systemd[1]: Stopped Kerberos 5 slave KDC update server.
Oct 10 03:26:25 krb2001 systemd[1]: Started Kerberos 5 slave KDC update server.
Oct 10 03:26:25 krb2001 kpropd[30814]: ready
Oct 10 03:56:19 krb2001 systemd[1]: Stopping Kerberos 5 slave KDC update server...
Oct 10 03:56:19 krb2001 systemd[1]: krb5-kpropd.service: Main process exited, code=killed, status=15/TERM
Oct 10 03:56:19 krb2001 systemd[1]: krb5-kpropd.service: Succeeded.
Oct 10 03:56:19 krb2001 systemd[1]: Stopped Kerberos 5 slave KDC update server.
Oct 10 03:56:20 krb2001 systemd[1]: Started Kerberos 5 slave KDC update server.
Oct 10 03:56:20 krb2001 kpropd[3704]: ready
Oct 10 04:00:01 krb2001 kpropd[4837]: Connection from krb1001.eqiad.wmnet
Oct 10 04:26:38 krb2001 systemd[1]: Stopping Kerberos 5 slave KDC update server...
Oct 10 04:26:38 krb2001 systemd[1]: krb5-kpropd.service: Main process exited, code=killed, status=15/TERM
Oct 10 04:26:38 krb2001 systemd[1]: krb5-kpropd.service: Succeeded.
Oct 10 04:26:38 krb2001 systemd[1]: Stopped Kerberos 5 slave KDC update server.
Oct 10 04:26:39 krb2001 systemd[1]: Started Kerberos 5 slave KDC update server.
Oct 10 04:26:39 krb2001 kpropd[9058]: ready
Oct 10 04:56:55 krb2001 systemd[1]: Stopping Kerberos 5 slave KDC update server...
Oct 10 04:56:55 krb2001 systemd[1]: krb5-kpropd.service: Main process exited, code=killed, status=15/TERM
Oct 10 04:56:55 krb2001 systemd[1]: krb5-kpropd.service: Succeeded.
Oct 10 04:56:55 krb2001 systemd[1]: Stopped Kerberos 5 slave KDC update server.
Oct 10 04:56:55 krb2001 systemd[1]: Started Kerberos 5 slave KDC update server.
Oct 10 04:56:55 krb2001 kpropd[14269]: ready
Oct 10 05:00:01 krb2001 kpropd[15324]: Connection from krb1001.eqiad.wmnet
Oct 10 05:27:07 krb2001 systemd[1]: Stopping Kerberos 5 slave KDC update server...
Oct 10 05:27:07 krb2001 systemd[1]: krb5-kpropd.service: Main process exited, code=killed, status=15/TERM
Oct 10 05:27:07 krb2001 systemd[1]: krb5-kpropd.service: Succeeded.
Oct 10 05:27:07 krb2001 systemd[1]: Stopped Kerberos 5 slave KDC update server.
Oct 10 05:27:07 krb2001 systemd[1]: Started Kerberos 5 slave KDC update server.
Oct 10 05:27:07 krb2001 kpropd[19125]: ready
Oct 10 05:56:54 krb2001 systemd[1]: Stopping Kerberos 5 slave KDC update server...
Oct 10 05:56:54 krb2001 systemd[1]: krb5-kpropd.service: Main process exited, code=killed, status=15/TERM
Oct 10 05:56:54 krb2001 systemd[1]: krb5-kpropd.service: Succeeded.
Oct 10 05:56:54 krb2001 systemd[1]: Stopped Kerberos 5 slave KDC update server.
Oct 10 05:56:54 krb2001 systemd[1]: Started Kerberos 5 slave KDC update server.
Oct 10 05:56:54 krb2001 kpropd[23310]: ready
Oct 10 06:00:01 krb2001 kpropd[24370]: Connection from krb1001.eqiad.wmnet
Oct 10 06:26:48 krb2001 systemd[1]: Stopping Kerberos 5 slave KDC update server...
Oct 10 06:26:48 krb2001 systemd[1]: krb5-kpropd.service: Main process exited, code=killed, status=15/TERM
Oct 10 06:26:48 krb2001 systemd[1]: krb5-kpropd.service: Succeeded.
Oct 10 06:26:48 krb2001 systemd[1]: Stopped Kerberos 5 slave KDC update server.
Oct 10 06:26:48 krb2001 systemd[1]: Started Kerberos 5 slave KDC update server.
Oct 10 06:26:49 krb2001 kpropd[28830]: ready

Puppet restarts kadmind every 30 mins but why it shutdowns?

Change 542014 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] kerberos: add nagios process monitors to kdc/kadmind daemons

https://gerrit.wikimedia.org/r/542014

gerritbot added a project: Patch-For-Review.Oct 10 2019, 6:58 AM

elukey@krb2001:~$ sudo systemctl cat krb5-kpropd.service
# /lib/systemd/system/krb5-kpropd.service
[Unit]
Description=Kerberos 5 slave KDC update server
Conflicts=krb5-admin-server.service

[Service]
ExecReload=/bin/kill -HUP $MAINPID
EnvironmentFile=-/etc/default/krb5-kpropd
ExecStart=/usr/sbin/kpropd -D $DAEMON_ARGS
InaccessibleDirectories=-/etc/ssh -/etc/ssl/private  /root
ReadOnlyDirectories=/
ReadWriteDirectories=/var/tmp /tmp /var/lib/krb5kdc /var/run /run
CapabilityBoundingSet=CAP_NET_BIND_SERVICE

[Install]
WantedBy=multi-user.target
elukey@krb2001:~$ cat /etc/default/krb5-kpropd
cat: /etc/default/krb5-kpropd: No such file or directory
elukey@krb2001:~$ sudo less /usr/sbin/kpropd
elukey@krb2001:~$ sudo /usr/sbin/kpropd --help
/usr/sbin/kpropd: unrecognized option '--help'

Usage: /usr/sbin/kpropd [-r realm] [-s srvtab] [-dS] [-f replica_file]
	[-F kerberos_db_file ] [-p kdb5_util_pathname]
	[-x db_args]* [-P port] [-a acl_file]
	[-A admin_server] [--pid-file=pid_file]

kpropd can also run as a standalone daemon, backgrounding itself and waiting for connections on port 754 (or the port specified with the -P option if given). Standalone mode is required for incremental propagation. Starting in release 1.11, kpropd automatically detects whether it was run from inetd and runs in standalone mode if it is not. Prior to release 1.11, the -S option is required to run kpropd in standalone mode; this option is now accepted for backward compatibility but does nothing.

As test, I just added the -P 754 option and restarted kpropd.

Change 542014 merged by Elukey:
[operations/puppet@production] kerberos: add nagios process monitors to kdc/kadmind daemons

https://gerrit.wikimedia.org/r/542014

Nope, it seems that puppet is causing the stop/start of kpropd, rsync and kadmind on 2001:

Notice: /Stage[main]/Profile::Kerberos::Kadminserver/Service[krb5-admin-server]/ensure: ensure changed 'stopped' to 'running'
Info: /Stage[main]/Profile::Kerberos::Kadminserver/Service[krb5-admin-server]: Unscheduling refresh on Service[krb5-admin-server]
Notice: /Stage[main]/Profile::Kerberos::Replication/Service[krb5-kpropd]/ensure: ensure changed 'stopped' to 'running'
Info: /Stage[main]/Profile::Kerberos::Replication/Service[krb5-kpropd]: Unscheduling refresh on Service[krb5-kpropd]
Notice: /Stage[main]/Rsync::Server/Service[rsync]/ensure: ensure changed 'stopped' to 'running'
Info: /Stage[main]/Rsync::Server/Service[rsync]: Unscheduling refresh on Service[rsync]

Two daemons make sense:

elukey@krb2001:~$ sudo systemctl status rsync
● rsync.service - fast remote file copy program daemon
   Loaded: loaded (/lib/systemd/system/rsync.service; enabled; vendor preset: enabled)
   Active: inactive (dead)
Condition: start condition failed at Thu 2019-10-10 07:57:38 UTC; 1min 54s ago
           └─ ConditionPathExists=/etc/rsyncd.conf was not met
     Docs: man:rsync(1)
           man:rsyncd.conf(5)

Oct 10 05:56:56 krb2001 systemd[1]: Condition check resulted in fast remote file copy program daemon being skipped.
Oct 10 06:26:51 krb2001 systemd[1]: Condition check resulted in fast remote file copy program daemon being skipped.
Oct 10 06:57:06 krb2001 systemd[1]: Condition check resulted in fast remote file copy program daemon being skipped.
Oct 10 07:09:21 krb2001 systemd[1]: Condition check resulted in fast remote file copy program daemon being skipped.
Oct 10 07:27:10 krb2001 systemd[1]: Condition check resulted in fast remote file copy program daemon being skipped.
Oct 10 07:51:32 krb2001 systemd[1]: Condition check resulted in fast remote file copy program daemon being skipped.
Oct 10 07:52:11 krb2001 systemd[1]: Condition check resulted in fast remote file copy program daemon being skipped.
Oct 10 07:52:57 krb2001 systemd[1]: Condition check resulted in fast remote file copy program daemon being skipped.
Oct 10 07:56:20 krb2001 systemd[1]: Condition check resulted in fast remote file copy program daemon being skipped.
Oct 10 07:57:38 krb2001 systemd[1]: Condition check resulted in fast remote file copy program daemon being skipped.

elukey@krb2001:~$ sudo systemctl status krb5-admin-server.service
● krb5-admin-server.service - Kerberos 5 Admin Server
   Loaded: loaded (/lib/systemd/system/krb5-admin-server.service; enabled; vendor preset: enabled)
   Active: inactive (dead) since Thu 2019-10-10 07:57:36 UTC; 2min 40s ago
 Main PID: 16372 (code=exited, status=0/SUCCESS)

Oct 10 07:57:36 krb2001 systemd[1]: Stopping Kerberos 5 Admin Server...
Oct 10 07:57:36 krb2001 kadmind[16372]: finished, exiting
Oct 10 07:57:36 krb2001 kadmind[16372]: closing down fd 13
Oct 10 07:57:36 krb2001 kadmind[16372]: closing down fd 12
Oct 10 07:57:36 krb2001 kadmind[16372]: closing down fd 11
Oct 10 07:57:36 krb2001 kadmind[16372]: closing down fd 10
Oct 10 07:57:36 krb2001 kadmind[16372]: closing down fd 9
Oct 10 07:57:36 krb2001 kadmind[16372]: closing down fd 8
Oct 10 07:57:36 krb2001 systemd[1]: krb5-admin-server.service: Succeeded.
Oct 10 07:57:36 krb2001 systemd[1]: Stopped Kerberos 5 Admin Server.

The kadmin server seems stopping by itself, but kadmin.local works on 2001..

ayounsi unsubscribed.Oct 10 2019, 8:03 AM

In T226089#5562297, @elukey wrote:

The kadmin server seems stopping by itself, but kadmin.local works on 2001..

Self answer:

kadmin and kadmin.local are command-line interfaces to the Kerberos V5 administration system. They provide nearly identical functionalities; the difference is that kadmin.local directly accesses the KDC database, while kadmin performs operations using kadmind.

Maintenance_bot removed a project: Patch-For-Review.Oct 10 2019, 8:10 AM

Change 542062 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] kerberos: ensure kadmind and rsync only on the master node

https://gerrit.wikimedia.org/r/542062

gerritbot added a project: Patch-For-Review.Oct 10 2019, 8:22 AM

Change 542062 merged by Elukey:
[operations/puppet@production] kerberos: ensure kadmind and rsync only on the master node

https://gerrit.wikimedia.org/r/542062

Seems fixed now. The culprit I believe it was:

elukey@krb2001:~$ sudo systemctl cat krb5-kpropd.service
# /lib/systemd/system/krb5-kpropd.service
[Unit]
Description=Kerberos 5 slave KDC update server
Conflicts=krb5-admin-server.service   <==================

Change 542067 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] kerberos: add nagios process monitoring for kpropd

https://gerrit.wikimedia.org/r/542067

Change 542067 merged by Elukey:
[operations/puppet@production] kerberos: add nagios process monitoring for kpropd

https://gerrit.wikimedia.org/r/542067

Maintenance_bot removed a project: Patch-For-Review.Oct 10 2019, 10:10 AM

Change 542092 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] kerberos: test kadmin failover/swap

https://gerrit.wikimedia.org/r/542092

gerritbot added a project: Patch-For-Review.Oct 10 2019, 12:02 PM

Change 542092 merged by Elukey:
[operations/puppet@production] kerberos: test kadmin failover/swap

https://gerrit.wikimedia.org/r/542092

In T226089#5559672, @MoritzMuehlenhoff wrote:

In T226089#5559492, @elukey wrote:

In T226089#5559190, @MoritzMuehlenhoff wrote:

One other thing (not necessarily now) is to add a monitoring check, e.g. https://exchange.nagios.org/directory/Plugins/Security/check_krb5

My idea was to add initially only a nagios process count check, and then think about something like check_krb5. Would it be reasonable?

Makes sense, let's split this to a separate task.

And we should also have an Icinga check to ensure the replica is up-to-date.

Maintenance_bot removed a project: Patch-For-Review.Oct 10 2019, 1:10 PM

Change 542102 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] kerberos: ensure resources that might change during failover

https://gerrit.wikimedia.org/r/542102

gerritbot added a project: Patch-For-Review.Oct 10 2019, 1:25 PM

Change 542102 merged by Elukey:
[operations/puppet@production] kerberos: ensure resources that might change during failover

https://gerrit.wikimedia.org/r/542102

In T226089#5562842, @MoritzMuehlenhoff wrote:

In T226089#5559672, @MoritzMuehlenhoff wrote:

In T226089#5559492, @elukey wrote:

In T226089#5559190, @MoritzMuehlenhoff wrote:

One other thing (not necessarily now) is to add a monitoring check, e.g. https://exchange.nagios.org/directory/Plugins/Security/check_krb5

My idea was to add initially only a nagios process count check, and then think about something like check_krb5. Would it be reasonable?

Makes sense, let's split this to a separate task.

And we should also have an Icinga check to ensure the replica is up-to-date.

In theory this should be ensured by the replication script ending up in a zero return no?

In T226089#5562958, @elukey wrote:

In T226089#5562842, @MoritzMuehlenhoff wrote:

In T226089#5559672, @MoritzMuehlenhoff wrote:

In T226089#5559492, @elukey wrote:

In T226089#5559190, @MoritzMuehlenhoff wrote:

One other thing (not necessarily now) is to add a monitoring check, e.g. https://exchange.nagios.org/directory/Plugins/Security/check_krb5

My idea was to add initially only a nagios process count check, and then think about something like check_krb5. Would it be reasonable?

Makes sense, let's split this to a separate task.

And we should also have an Icinga check to ensure the replica is up-to-date.

In theory this should be ensured by the replication script ending up in a zero return no?

Probably, needs a closer look. I was wondering about a case, where it runs, but garbles data or does an outdated copy. But OTOH we can still retrieve backups, so maybe that's actually not that much of a concern in practice.

Change 542112 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] kerberos: test failover (part 2)

https://gerrit.wikimedia.org/r/542112

Change 542112 merged by Elukey:
[operations/puppet@production] kerberos: test failover (part 2)

https://gerrit.wikimedia.org/r/542112

Change 542133 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] kerberos: enable monitoring

https://gerrit.wikimedia.org/r/542133

Change 542133 merged by Elukey:
[operations/puppet@production] kerberos: enable monitoring

https://gerrit.wikimedia.org/r/542133

Maintenance_bot removed a project: Patch-For-Review.Oct 10 2019, 3:10 PM

Tested the failover and improved the puppet code to do proper clean ups when failing back to the original state. Tested a change in password, all worked as expected.

Finally added monitoring.

elukey moved this task from In Progress to Done on the Analytics-Kanban board.Oct 10 2019, 3:15 PM

Change 542159 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::kerberos::kadminserver: fix typo in monitoring

https://gerrit.wikimedia.org/r/542159

gerritbot added a project: Patch-For-Review.Oct 10 2019, 4:03 PM

Change 542159 merged by Elukey:
[operations/puppet@production] profile::kerberos::kadminserver: fix typo in monitoring

https://gerrit.wikimedia.org/r/542159

Maintenance_bot removed a project: Patch-For-Review.Oct 10 2019, 5:11 PM

elukey moved this task from Kerberos to Done on the User-Elukey board.Oct 11 2019, 7:04 AM

• Nuria closed this task as Resolved.Oct 24 2019, 7:08 PM

Make the Kerberos infrastructure production readyClosed, ResolvedPublic21 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

Make the Kerberos infrastructure production ready
Closed, ResolvedPublic21 Estimated Story Points
Actions

Related Objects
Search...