It seems the k8s certificate for the PAWS cluster has expired. Immediate symptoms are that spawns are not working (timing out) and k8s access returns an error: error: You must be logged in to the server (the server has asked for the client to provide credentials).
Description
Event Timeline
This has basically made it so the cluster is inaccessible for anyone, including toolforge admins.
The cert needs reset, but the apiserver's cert is likely also expired, which is a bigger problem that can cause the whole thing to turn into a big mess.
Verified with openssl that the cert *is* expired:
Validity Not Before: Jan 9 16:22:21 2019 GMT Not After : Jan 9 16:22:25 2020 GMT
For anyone who runs into this in the future, on this version of k8s (which is 1.9 ish), you are looking at doing something like this: https://github.com/kubernetes/kubeadm/issues/581#issuecomment-421477139 with edits for the environment. We had to do something like this once already when we changed the IP addresses of everything to move to the new region.
1.15 provides much better tooling for this (even if it is in alpha). More importantly, upgrading the cluster recycles the certs, which is the real miss here.
Mentioned in SAL (#wikimedia-cloud) [2020-01-09T17:46:28Z] <bstorm_> refreshing the paws cluster's entire x509 environment T242353
root@tools-paws-master-01:~# kubectl get nodes NAME STATUS ROLES AGE VERSION tools-paws-master-01 Ready master 1y v1.9.4 tools-paws-worker-1001 Ready <none> 1y v1.9.4 tools-paws-worker-1002 Ready <none> 1y v1.9.4 tools-paws-worker-1003 Ready <none> 1y v1.9.4 tools-paws-worker-1005 Ready <none> 1y v1.9.4 tools-paws-worker-1006 Ready <none> 1y v1.9.4 tools-paws-worker-1007 Ready <none> 1y v1.9.4 tools-paws-worker-1010 Ready <none> 364d v1.9.4 tools-paws-worker-1013 Ready <none> 1y v1.9.4 tools-paws-worker-1016 Ready <none> 1y v1.9.4 tools-paws-worker-1017 Ready <none> 1y v1.9.4 tools-paws-worker-1019 Ready <none> 1y v1.9.4
Looking a bit better. I doubt a reboot or reconnect of nodes is necessary, per the github comment in version 1.9.4. Things are looking much better.
Aaaand we cannot receive logs from the kube-proxy on the nodes. This suggests we do, in fact, need to reconnect it all.
Mentioned in SAL (#wikimedia-cloud) [2020-01-09T18:06:24Z] <bstorm_> rebooting tools-paws-master-01 T242353
Mentioned in SAL (#wikimedia-cloud) [2020-01-09T18:06:33Z] <bstorm_> rebooting tools-paws-master-01 T242353
After reboot, I can get the appropriate logs and paws is working.
I'm going to validate kubelet client certs and hold off on rejoining them if they are working right. Some versions rotate them automatically.
Nope, the certs are expired on manual checks, which is likely to cause random issues. Fixing that now.
Mentioned in SAL (#wikimedia-cloud) [2020-01-09T18:25:36Z] <bstorm_> re-joining the k8s nodes to the cluster one at a time to rotate the certs T242353
Mentioned in SAL (#wikimedia-cloud) [2020-01-09T18:26:01Z] <bstorm_> re-joining the k8s nodes OF THE PAWS CLUSTER to the cluster one at a time to rotate the certs T242353
For posterity: on 1.9.4 at least, the best rejoin command seems to be:
systemctl stop kubelet && mv /etc/kubernetes/kubelet.conf{,.old} && mv /etc/kubernetes/pki/ca.crt{,.old} && kubeadm join --token=<seecret> 172.16.2.205:6443 --discovery-token-unsafe-skip-ca-verification
The certs are renewed across the cluster now. Hopefully, we'll have it on a supported model before we have to do this again.