Page MenuHomePhabricator

Create new k8s certificate for PAWS cluster
Closed, ResolvedPublic

Description

It seems the k8s certificate for the PAWS cluster has expired. Immediate symptoms are that spawns are not working (timing out) and k8s access returns an error: error: You must be logged in to the server (the server has asked for the client to provide credentials).

Event Timeline

Bstorm moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.

This has basically made it so the cluster is inaccessible for anyone, including toolforge admins.
The cert needs reset, but the apiserver's cert is likely also expired, which is a bigger problem that can cause the whole thing to turn into a big mess.

Verified with openssl that the cert *is* expired:

Validity
    Not Before: Jan  9 16:22:21 2019 GMT
    Not After : Jan  9 16:22:25 2020 GMT
Bstorm raised the priority of this task from High to Unbreak Now!.Jan 9 2020, 5:34 PM

This means the apiserver cert is also invalid.

For anyone who runs into this in the future, on this version of k8s (which is 1.9 ish), you are looking at doing something like this: https://github.com/kubernetes/kubeadm/issues/581#issuecomment-421477139 with edits for the environment. We had to do something like this once already when we changed the IP addresses of everything to move to the new region.

1.15 provides much better tooling for this (even if it is in alpha). More importantly, upgrading the cluster recycles the certs, which is the real miss here.

Mentioned in SAL (#wikimedia-cloud) [2020-01-09T17:46:28Z] <bstorm_> refreshing the paws cluster's entire x509 environment T242353

root@tools-paws-master-01:~# kubectl get nodes
NAME                     STATUS    ROLES     AGE       VERSION
tools-paws-master-01     Ready     master    1y        v1.9.4
tools-paws-worker-1001   Ready     <none>    1y        v1.9.4
tools-paws-worker-1002   Ready     <none>    1y        v1.9.4
tools-paws-worker-1003   Ready     <none>    1y        v1.9.4
tools-paws-worker-1005   Ready     <none>    1y        v1.9.4
tools-paws-worker-1006   Ready     <none>    1y        v1.9.4
tools-paws-worker-1007   Ready     <none>    1y        v1.9.4
tools-paws-worker-1010   Ready     <none>    364d      v1.9.4
tools-paws-worker-1013   Ready     <none>    1y        v1.9.4
tools-paws-worker-1016   Ready     <none>    1y        v1.9.4
tools-paws-worker-1017   Ready     <none>    1y        v1.9.4
tools-paws-worker-1019   Ready     <none>    1y        v1.9.4

Looking a bit better. I doubt a reboot or reconnect of nodes is necessary, per the github comment in version 1.9.4. Things are looking much better.

Aaaand we cannot receive logs from the kube-proxy on the nodes. This suggests we do, in fact, need to reconnect it all.

Mentioned in SAL (#wikimedia-cloud) [2020-01-09T18:06:24Z] <bstorm_> rebooting tools-paws-master-01 T242353

Mentioned in SAL (#wikimedia-cloud) [2020-01-09T18:06:33Z] <bstorm_> rebooting tools-paws-master-01 T242353

After reboot, I can get the appropriate logs and paws is working.
I'm going to validate kubelet client certs and hold off on rejoining them if they are working right. Some versions rotate them automatically.

Nope, the certs are expired on manual checks, which is likely to cause random issues. Fixing that now.

Mentioned in SAL (#wikimedia-cloud) [2020-01-09T18:25:36Z] <bstorm_> re-joining the k8s nodes to the cluster one at a time to rotate the certs T242353

Mentioned in SAL (#wikimedia-cloud) [2020-01-09T18:26:01Z] <bstorm_> re-joining the k8s nodes OF THE PAWS CLUSTER to the cluster one at a time to rotate the certs T242353

For posterity: on 1.9.4 at least, the best rejoin command seems to be:

systemctl stop kubelet && mv /etc/kubernetes/kubelet.conf{,.old} && mv /etc/kubernetes/pki/ca.crt{,.old} && kubeadm join --token=<seecret>  172.16.2.205:6443 --discovery-token-unsafe-skip-ca-verification

The certs are renewed across the cluster now. Hopefully, we'll have it on a supported model before we have to do this again.