T379968 revealed a big flaw in the system we use to distribute the public keys used to validate service-account tokens of "other" control planes (T329826: Kubernetes v1.23 use PKI for service-account signing (instead of cergen)).
The way this currently works is:
- A service-account (sa) cert and key is issued via PKI (puppet) on a control-plane (wikikube-ctrl1001.eqiad.wmnet)
- Puppet triggers the kube-publish-sa-cert systemd service which stores the public key of the sa certificate in etcd (as /kube-apiserver-sa-certs/wikikube-ctrl1001.eqiad.wmnet)
- All control-planes watch for changes at /kube-apiserver-sa-certs/ and, on change, dump all public keys which don't match their FQDN into /etc/kubernetes/pki/kube-apiserver-sa-certs.pem, reloading the kub-apiserver afterwards via systemctl restart kube-apiserver-safe-restart.service
Reimaging a control-plane leads to it's service-account cert and key being created from scratch (new filesystem, new key) and that in turn overrides the sa public key stored in etcd with the new one, practically rendering all service-account tokens that have been issued with the old key (pre reimage) unusable as they can no longer be validated.
A (not very glamorous) idea to circumvent this would be to no longer use the FQDN as key in etcd but the fingerprint. That would ensure the old public key does stick around (as it will not be overridden). I don't think filtering out the "own" public key is strictly required.
In addition we would have to make sure that old certificates are cleaned up from etcd. This could be done by extending kube-publish-sa-cert to iterate over all certificates and remove those that are expired:
export ETCDCTL_API=3 etcdctl --endpoints "https://$(hostname -f):2379" get --prefix /kube-apiserver-sa-certs/ --keys-only | sed '/^$/d' | \ while read key; do etcdctl --endpoints "https://$(hostname -f):2379" get "$key" | \ openssl x509 -in /dev/stdin -checkend 1 -noout >/dev/null || etcdctl --endpoints "https://$(hostname -f):2379" del "$key"; done