Figure out cert management for Toolforge kubernetes and make it clear in documents, etc. for the upgrade
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Bstorm
	Feb 7 2019, 9:03 PM

Description

Right now, most k8s services, including etcd use puppet certs (which are valid x.509 certs that work with the puppetmaster itself as the CA). The problem with this is that puppet is not entirely designed to be a CA for other services so they cryptically suggest you not introduce SANs to certificates for clients (only for the puppetmaster).

As is evident in the subtasks of this, SANs (Alternate Names) are needed to make k8s consistent between DNS and the certs actually used.

It is also worth noting that client certs don't need to be issued by the same CA as long as the configs specified to the various services know which cacert to validate them against. This suggests that we could use puppet for client certs even if we chose to not using it for the server side certs (at the cost of some complexity).

Using puppet is somewhat attractive because we already have that infrastructure in place, and it distributes certs for us (or at least requires us to deal with its quirks already). Puppet certs do not help with user certs--x.509 certs can be distributed for user groups, and that seems like the kind of thing maintain-kubeusers would be great for with lots of changes. Turnkey solutions for PKI on Linux are interesting, but it may be just as well to use a script and maintain a "CA" on one instance or another. The kubernetes api-server can be a CA, but we are aiming for HA there.

There's enough here to initially gather tasks under and then merge them as we figure out what we are changing/fixing/solving/rejecting.

Details

	Subject	Repo	Branch	Lines +/-
	toolforge: new k8s: add wmcs-k8s-get-cert.sh script	operations/puppet	production	+137 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
		Restricted Task
Resolved	• Bstorm	T246122 Upgrade the Toolforge Kubernetes cluster to v1.16
		Restricted Task
Resolved	bd808	T232536 Toolforge Kubernetes internal API down, causing `webservice` and other tooling to fail
Resolved	• Bstorm	T236565 "tools" Cloud VPS project jessie deprecation
Resolved	aborrero	T101651 Set up toolsbeta more fully to help make testing easier
Resolved	• Bstorm	T166949 Homedir/UID info breaks after a while in Tools Kubernetes (can't read replica.my.cnf)
Resolved	• Bstorm	T246059 Add admin account creation to maintain-kubeusers
Resolved	• Bstorm	T154504 Make webservice backend default to kubernetes
Declined	None	T245230 Investigate cpu/ram requests and limits for DaemonSets pods
Resolved	• Bstorm	T214513 Deploy and migrate tools to a Kubernetes v1.15 or newer cluster
Resolved	• Bstorm	T215553 Figure out cert management for Toolforge kubernetes and make it clear in documents, etc. for the upgrade
Resolved	• Bstorm	T169287 etcd config depends on puppet certs, but puppet doesn't know
Resolved	yuvipanda	T119814 Figure out how to deal with SSL cert issues for kubernetes masters
Duplicate	None	T144153 Move kubernetes authentication to using X.509 client certs
Resolved	• Bstorm	T238162 Establish a process for renewing TLS certs for the 2 webhook controllers

Event Timeline

• Bstorm renamed this task from Figure out cert management for kubernetes and make it clear in documents, etc. for the upgrade to Figure out cert management for Toolforge kubernetes and make it clear in documents, etc. for the upgrade.Feb 7 2019, 9:03 PM

• Bstorm triaged this task as High priority.

• Bstorm created this task.

• Bstorm added a subtask: T169287: etcd config depends on puppet certs, but puppet doesn't know.

• Bstorm added a subtask: T119814: Figure out how to deal with SSL cert issues for kubernetes masters.Feb 7 2019, 9:05 PM

• Bstorm added a subtask: T144153: Move kubernetes authentication to using X.509 client certs.

• Bstorm mentioned this in T215663: Stand up upgraded Toolforge etcd clusters.Feb 8 2019, 9:45 PM

bd808 removed a project: Goal.Feb 10 2019, 8:48 PM

• GTirloni unsubscribed.Mar 21 2019, 9:06 PM

I will give this a try soon.

aborrero moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.May 22 2019, 4:35 PM

I've been testing generating puppet certs with SANs with no issues so far. Is just a hiera override and re-generate the certs. Steps, for the record:

puppet agent running clean in the affected VM
hiera config (horizon)

profile::base::puppet::dns_alt_names: myAltName1,MyAltName2

run puppet agent to see change in /etc/puppet/puppet.conf, you will se something like:

[agent]
server = openstack-puppetmaster-01.openstack.eqiad.wmflabs
dns_alt_names = myAltName1,myAltName2

drop SSL config in the puppet client, the usual rm -rf /var/lib/puppet/ssl
cleanup old certificates in the puppemaster: sudo puppet cert clean $FQDN
ask again for the client certs, in the VM: sudo puppet agent -tv
accept the certificate in the master, note we allow now the SANs: sudo puppet cert --allow-dns-alt-names sign $FQDN
run puppet agent again in the VM: sudo puppet agent -tv

I even wrote a quick & dirty cumin-based script to automate most of the dance, and you can test it in labpuppetmaster (the cumin server) and my testing VMs in the openstack CloudVPS project.
https://wikitech.wikimedia.org/wiki/User:Arturo_Borrero_Gonzalez#switch_puppetmaster_for_CloudVPS_VMs

The result is:

root@arturo-k8s-test-2:~# openssl x509 -in /var/lib/puppet/ssl/certs/arturo-k8s-test-2.openstack.eqiad.wmflabs.pem -text -noout
Certificate:
[..]
            X509v3 Subject Alternative Name: 
                DNS:arturo-k8s-test-2.openstack.eqiad.wmflabs, DNS:myAltName1, DNS:myAltName2
[..]

Looking at the required certificates for K8s (https://kubernetes.io/docs/setup/certificates/#all-certificates) it seems we already have most of them covered by means of puppet?
The only weird thing I see is that system:masters requirement for the O= in the Subject.

So I had this question: How is the prod k8s doing? modules/k8s/manifests/ssl.pp seems to suggest that it just uses puppet certs, confirmed by cumin:

aborrero@cumin1001:~ $ sudo cumin --force "P{R:Class = k8s::apiserver}" 
5 hosts will be targeted:
acrab.codfw.wmnet,acrux.codfw.wmnet,argon.eqiad.wmnet,chlorine.eqiad.wmnet,neon.eqiad.wmnet

This is all for the basic k8s infra. I've not started yet investigating user certificates, and that seems to be a different challenge in the case of Toolforge.

Since we now have a proposal for this, let's write it up in https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals

I think what we have now in toolsbeta is going to be great!

@Bstorm I'm assigning this task to you since its mostly you working on this right now (the maintain-kubeusers).

This describes essentially what we are now doing. Etcd client and server certs are simply the puppet certs (which should keep etcd flexible in case we need to set up routing into calico somewhere), while certs for users are x.509s generated using the certificates API of k8s. Node certs are generated by k8s as well using kubeadm (which interacts with the certs API using tokens). The certs to manage the CA and PKI are copied between k8s control plane nodes at build time. A new cluster will have a new CA, which honestly prevents leakage nicely.

The admin users mentioned in that diagram are still theoretical, but there is no reason to require root to interact with a k8s API. It should be straightforward to add a service or a manually run script that maps the <project>.admin group to admin user accounts and places them in the appropriate locations. That will allow Toolforge admins to interact with k8s as easily as they can Grid Engine (and nobody else--they need to use tool accounts). This should simplify playbooks and procedures for dealing with jobs and services that are misbehaving, etc.

Adding to discussion in order to discuss the proposal for admin users since that is a change from the behavior of the original system as well as to open the design proposal for comment/questions/rejection/redo in general.

• Bstorm closed subtask T169287: etcd config depends on puppet certs, but puppet doesn't know as Resolved.Aug 14 2019, 5:39 PM

Another proposal is enabling automatic rotation for kubelet certs so we don't have to manually re-issue them if we don't upgrade during the course of a year. Since upgrading via kubeadm does rotate the certs for all nodes, as long as there is at least one upgrade during a year, we'll be ok, but why chance it? https://kubernetes.io/docs/tasks/tls/certificate-rotation/#enabling-client-certificate-rotation

It will require reconfiguring kubelets and restarting them. It's best if this is included in the kubeadm init config (though it might actually be necessary to do it via puppet or something intstead since some settings don't propagate on join, if I recall.)

Ideas from 2019-08-27 team meeting:

Make sure we have monitoring and alerting for cert expiration
stick with default 1 year or bump up to 2 year default
plan on 6 month upgrade cycle for Kubernetes itself

• Bstorm moved this task from Needs discussion to Doing on the cloud-services-team (Kanban) board.Aug 27 2019, 5:04 PM

As of k8s 1.8, I think there's a prometheus metric for cert expiry https://github.com/kubernetes/kubernetes/pull/51031

We'll get to see what that looks like when a more recent kubelet has prometheus monitoring.

So this is on hold waiting for monitoring to show the new kubelets (which it should soon if it doesn't already) T237643

• Bstorm added a subtask: T238162: Establish a process for renewing TLS certs for the 2 webhook controllers.Nov 12 2019, 10:30 PM

Change 550673 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: new k8s: add wmcs-k8s-get-cert.sh script

https://gerrit.wikimedia.org/r/550673

gerritbot added a project: Patch-For-Review.Nov 13 2019, 1:17 PM

Change 550673 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: new k8s: add wmcs-k8s-get-cert.sh script

https://gerrit.wikimedia.org/r/550673

• Bstorm mentioned this in T238162: Establish a process for renewing TLS certs for the 2 webhook controllers.Dec 11 2019, 8:07 PM

I created this wikitech page: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Certificates_in_k8s

@aborrero Do you feel we are done with this at this point? We may have more to do like recovery procedures, but it feels pretty good for now.

• Bstorm closed subtask T238162: Establish a process for renewing TLS certs for the 2 webhook controllers as Resolved.Jun 11 2020, 11:25 PM

The docs are more or less in shape. We may want to create a section about how to re-encrypt (what almost happened the other day when we lost labs/private), but other than that, yeah.

	F30025714: certdesign4k8s.png
	Aug 14 2019, 5:05 PM

Figure out cert management for Toolforge kubernetes and make it clear in documents, etc. for the upgradeClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Figure out cert management for Toolforge kubernetes and make it clear in documents, etc. for the upgrade
Closed, ResolvedPublic
Actions

Related Objects
Search...