During T329633: Upgrade the aux-eqiad cluster to k8s 1.23 @elukey discovered calico and coredns services having issues authenticating against the kubernetes api.
From calico-typha:
Failed to determine migration requirements error=unable to query ClusterInformation to determine Calico version: connection is unauthorized: Unauthorized
From calico-node:
[WARNING][9] startup/startup.go 442: Connection to the datastore is unauthorized error=connection is unauthorized: Unauthorized
Some coredns where looping in:
[INFO] plugin/ready: Still waiting on: "kubernetes"
kube-apiserver was logging:
authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, square/go-jose: error in cryptographic primitive]"
The investigation revealed that with the switch to PKI certificates each control-plane node now has it's own private key to sign (in cluster) service account (jwt) tokens with (kube-apiserver parameter --service-account-signing-key-file, see command-line-tools-reference). As a consequence, these tokens can then only be validated via the matching public key e.g. only by the control-plane that has signed the token in first place. Requests to the other control-plane will be rejected with the above error.
Unfortunately it is not possible to use the intermediate CA to validate the tokens which means that all control-planes need access to the public keys of all other control-planes in order to be able to validate tokens (or have a shared private key ofc.). The only documentation regarding this I could come up with (unfortunately the best practice guide does not provide any details on multi control-plane setups) is https://v1-23.docs.kubernetes.io/docs/setup/production-environment/tools/kubeadm/high-availability/#manual-certs which suggests that the keys for the general and front-proxy intermediate need to be shared as well, which I don't think is required. They are also not referenced/used anywhere. What is referenced/used it the public key of both intermediates for validating requests (but that is obviously already shared).
In manual tests we provided both/all of the public keys for service account signing to each kube-apiserver (by specifying --service-account-key-file multiple times) which seemed to work but absolutely needs more testing.
The reason for this being discovered so late in the process is (besides me not realizing obviously ;)) that the wikikube staging clusters run a single master - which we should definitely change now. T329827
Problem
We need to use only one certificate for service account signing or we need to find a way to sync the public keys of all certificates used to all masters.
Workaround
The current workaround to be able to continue the cluster upgrades, especially wikikube-codfw next week, is to continue using the cergen generated certificate (which is shared between control-planes) for service-account signing for now (see attached patches). This is already in setup for the aux cluster (T329633).
Requirements
- No manual certificate management (like with cergen) / Cert(s) should be generated by PKI
- Refreshed certificate(s) should be rolled out automatically
- Ensure that only not all kube-apiservers are restarted at the same time (for a cert refresh, as hot-reloading is not supported)
- The process should not take to long (TBD). In case a certificate key changes, the new public key should be known to all kube-apiservers before a service account token is signed by the new key. This is an edge case though (for example then reimaging a control-plane) as keys don't usually change.
Solution 1: Store public keys as facts in puppet
Making the keys an exported resource is not possible because they are generated on the nodes filesystem via shellout so they never really "reach" puppet.
Have all control-planes generate a certificate and store the public key as fact puppet. All control-planes can then fetch the public keys on the next puppet run and restart kube-apiserver on change.
We would need to create a fact that stores the public key from a particular, hardcoded path like:
Facter.add(:kubernetes_service_account_signing_key) do confine { File.exists?("/etc/kubernetes/service_account_key.pem") } setcode do File.read("/etc/kubernetes/service_account_key.pem") end end
Then puppetdb::query_facts and add them all to kube-apiserver.
New public keys would be made available after maximal 30 minutes (control-plane1 puppet run refreshes the cert and updates the fact, control-plane2 puppet run can download the cert).
Solution 2: Sync via kubernetes etcd and confd CHOSEN
Have all control-planes generate a certificate and make puppet on control-planes upload the public key to the kubernetes etcd (control-planes already depend on etcd). A confd setup on all control-planes could the pull all those public keys onto each control-plane and restart kube-apiserver (with etcd lock) if there has been a key change/update.
New public keys would be mad available almost immediately (depending on the random jitter).
Solution 3: Use only one cert
Have a different entity (cumin host for example) issue a certificate, copy private and public key to all control-planes (via ssh) and issue kube-apiserver restarts (with etcd lock).
Or add a cookbook that:
- checks validity of the cert, if it's more than N days away from expiration, exit
- if it's about to expire, disable puppet on all the k8s masters
- regenerate the cert with the pki, commit it to puppet
- reenable and run puppet on teh k8s masters, one at a time
New public keys would be mad available almost immediately.
Kubernetes clusters will depend on an external system to operate.
Orchestrate kube-apiserver restarts
To ensure that only one apiserver restarts at a given time, we could use an etcd lock (in the kubernetes clusters etcd):
etcdctl lock apiserver-restart systemctrl restart kube-apiserver
This is implemented using a second systemd unit kube-apiserver-safe-restart.service which is called by confd as well as notified by puppet on relevant changes. This means that kube-apiserver restarts issued by some automation will be coordinated.
Todo:
- Configure all clusters to sign new tokens with their own PKI certs
- Drop all service account token secrets from the API that are still signed by cergen certs
- There is a list with the names of all deleted secrets in deploy1002:/home/jayme/kube-apiserver-sa/deleted_service_account_tokens.log
- Configure all clusters to no longer trust cergen signed tokens
- Remove cergen certs from (private) puppet, documentation etc