11:41:16 <Majavah> where does the "k8s.toolsbeta.eqiad1.wikimedia.cloud." record live? it's not in .svc. so I don't see it on horizon 11:45:24 <Majavah> (same thing for tools.) 12:03:21 <arturo> it should be in the parent zone 12:03:47 <arturo> it was a mistake (my mistake), it should have been in the .svc subdomain since the beginning 12:04:06 <arturo> now changing it is complicated because it involves TLS certs 12:25:27 <Majavah> could it be changed to a cname to something on .svc so that it can be changed without cloudinfra access? 12:26:07 <Majavah> I'm working on adding keepalived ha for the kubernetes haproxies, alternatively I could just ask you to change it when it's ready 12:39:42 <arturo> Majavah: about the CNAME yeah I think we can do that. But I prefer not to do such thing on friday. Would you mind opening a phab task so we don't forget next week?
If we try it in toolsbeta and the certs all still validate, sure. My worry is that cert validation will collapse unless we make sure it's a valid altname for the k8s cluster as well.
We may discover that it doesn't matter, but we will be rebuilding the toolsbeta cluster if it doesn't work out. It's really all about how much we need to rebuild and change. If we are ok rebuilding the toolsbeta cluster a few times (it's about time right?), we can experiment and get it right. I'll be surprised if we can just do what this task says and get away with it. I mean, we might. I don't remember what alt names are on the cert :)
I wonder if we can regenerate the certs via kubeadm with a different altname on the controlplane or something. k8s clients validate the name, kubelet will validate the name and the control plane will. Inside the cluster, it's using a service name and probably will be valid from service cluster names anyway.
It looks like I have to delete it and recreate it as a CNAME. That means that it will briefly cause some chaos on Toolforge when we do it in tools. We might want to do it really fast via CLI there. In toolsbeta, I can do it now.
The caching is frustratingly strong here. The old A record is still seen by the host somehow (if not by dig).
bstorm@toolsbeta-sgebastion-04:~$ kubectl get pods --all-namespaces Unable to connect to the server: dial tcp 172.16.0.146:6443: connect: no route to host
For now, I cannot get the cached A record to be forgotten by the host. Tried restarting nscd, and that did not help. Turning the old proxies off just brought everything down. My thought is to leave the old proxies up until the cache drops off. That would also be the best process for tools anyway and potentially a zero-downtime method.
IF this works when the cache drops off. I think it will, but we should let it be proven before trying in tools.
Oh! There's no stale cache on the control nodes.
bstorm@toolsbeta-test-k8s-control-4:~$ host k8s.toolsbeta.eqiad1.wikimedia.cloud k8s.toolsbeta.eqiad1.wikimedia.cloud is an alias for k8s.svc.toolsbeta.eqiad1.wikimedia.cloud. k8s.svc.toolsbeta.eqiad1.wikimedia.cloud has address 172.16.2.161
kubectl works there, and everything. I think that means this is a go. When the cache on the bastion drops off, I can clean up the old haproxies.