I work for or provide services to the Wikimedia Foundation, but this is my only Phabricator account. Edits, statements, or other contributions made from this account are my own, and may not reflect the views of the Foundation.
Tue, Jun 2
So far so good. It's still replicating. It hasn't fully caught up yet, though.
I think we are done with this one!
I think we are done then!
Please reopen if this isn't fixed! Thanks
This needs an overhaul before more docs are done.
Mon, Jun 1
That leaves what else? Prometheus? @aborrero
Or did we have something else where we made certs by hand?
Using an operations-pod for maintain-kubeusers (so that I could install the openssl package in the pod):
# echo | openssl s_client -showcerts -servername registry-admission.registry-admission.svc -connect registry-admission.registry-admission.svc:443 2>/dev/null | openssl x509 -n oout -dates notBefore=Jun 1 23:23:00 2020 GMT notAfter=Jun 1 23:23:00 2021 GMT
That tells me it worked!
I'll do it in tools. Also I'll document the process in the README files of the controllers. I did restart the pods just in case.
Cool thing, I can just re-run the scripts I've got for the controllers. It works great on minikube. I'll run it in toolsbeta and delete the pods to restart as well if it needs it.
Fri, May 29
I was about to just use my cert scripts, but they won't do. I need to mess with them a bit to get the admission controller scripts renewed...ideally with a simple argument or something to say "just renew the scripts". A second "create" with the same name will fail. It should update the existing secrets with new scripts with all the appropriate alt-names for doing SSL termination.
Seconds_Behind_Master: 213890 sounds a lot better. Let's see if it actually catches up.
So after merging that, I realized that the escapes don't seem to be correct for a table vs a wild_table, so I stopped puppet and made it a wildcard entry. That seems to have got replication moving again.
Finally circled back and added that information! What do folks think now?
Surprisingly, only labsdb1009 and db1141 now remain.
Actually, just succeeded in running on labsdb1010 by chance.
This just needs a run on labsdb1010. When I do the run for T252219 against 1009 and 1010, that will finish this off.
I can! I will do that.
If we end up not merging all of my coming pull request into the paws repo, I'll make a separate PR with just the new image.
Apparently my version works! That makes this stable and I'll close the ticket.
Ok so they work again for now, and I don't plan on restarting the pod for a while. I may try building my newer version of the image (not Debian Jessie-based) and testing it with a tag that will allow quick rollback if it doesn't work.
@Marostegui I won't be able to squeeze this in on 1009-11 (or db1141) without depooling, I think after a first attempt on 1009, which is the most likely one to succeed without. You've been doing a lot of work on 1011 and friends, so I thought I'd check in before I start that rotation on Monday. Ok to proceed?
It likely would have lost database connection during T253738, and it would have been restarted when we upgraded it's k8s nodes yesterday in T246122. If it's working again, I'd be willing to bet that's why and think this can be closed.
So stupid question, @Marostegui, if I do that patch and restart on mysqld on the slave without restarting the master, will that work or do I need to CHANGE REPLICATION FILTER with the whole mess?
Thu, May 28
Never mind! The blasted config is the default on this version: RotateKubeletClientCertificate=true|false (BETA - default=true) from https://v1-16.docs.kubernetes.io/docs/reference/command-line-tools-reference/kubelet/
I did confirm our control plane certs look right.
Looking deeper into things, I think kubeadm is confusingly documented (we knew that). In order to upgrade the client cert for kubelet, we can simply set the kubelets to do it for us with a feature gate. The settings are here https://kubernetes.io/docs/tasks/tls/certificate-rotation/
This is distinct from *serving certificate rotation*, which we deliberately avoided. I'll make another task and a patch to add the args to our kubelets.
We discovered that there is a bug in kubeadm < 1.17 that sets renew-certs to false on node upgrades. The control plane certs rotated fine, but the kubelet certs of worker nodes did not. https://github.com/kubernetes/kubeadm/issues/1818 This is also referenced in the docs here https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-certs/
The backports repo is gone. It needs to change to the debian archive. I'll get that fixed shortly.
Bad news. The image cannot build.
It's looking good after a short problem:
Turns out that we are planning on restarting that pod today. I'll rebuild the image.
Ahh, yeah, oops. I see that Seconds_Behind_Master: NULL and the threads are dead all over the logs. So @Marostegui were the earlier ones that I think are timestamped before T253738#6169013 fixed already? If so, then I'm only aware of s51245__totoazero.maj_articles_recents.
I'd forgot to check deprecated objects by the end of the day yesterday, but I checked this morning in Toolsbeta...and there may not be any there. I replaced all the PSPs already in tools and TB as I recall and the deployments there are replaced.
Wed, May 27
Circling back around to this. It looks like I'm seeing: s51245__totoazero.maj_articles_recents is the major source of trouble. Timing-wise, I don't think I see any other issues.
Got it! I'll close the task since I can then add others if needed. Thank you very much!
@aborrero I presume you'll want access as well?
I made a quay.io user https://quay.io/user/brookestorm, but I still definitely am interested in the answers to those questions.
@Chicocvenancio What are the terms on that registry? Is it free? It looks like somebody is paying for it.
Overall, it looks like a good place to keep things, and I'd like access. I just don't want to inadvertently cost someone money if that's charging someone other than the Foundation. I also don't want someone to stop paying for it suddenly if we continue depending on it :)
@daniel I'm all set to get this closed up, I'm just waiting on a final review of the patch. I think it's good to go. I'll test it locally first because it's really hard to fix sometimes if it goes badly.
Is there a method you recommend for recloning the tables, @Marostegui . I cannot say I have done a single table clone on mysql to a replica this decade 🙂.
Tue, May 26
@Chicocvenancio Good to hear, overall, I haven't enabled the repo for travis yet (nothing stops it from working as is with the current file) because too many balls are in the air to automate it from there. I don't want to automatically break the existing cluster. The travis config points back at the old repo location right now...and that's just fine until a couple more pieces are done! Thanks!
As this is now being pursued as a quarter goal in T211096 with the effort to reuse much of the design and testing work done for Toolforge k8s, there will be significant updates to this soon. I suspect that we will likely be able to use the haproxy (in Gio's model above it was nginx)->ingress model of Toolforge, using the existing front proxy to temporarily smooth the transition. So far, the cluster is all up and ready. We are close to sorting out the last steps of actually deploying a paws in parallel there. I'm also hoping that "paws beta" can become simply the beta namespace inside this cluster.
At this point, I'm just keeping this open until we've moved over the cluster. As long as we actually use what we've built so far, this is effectively done.
I think this should be unblocked and the upgrade might work on the next try. Probably should depool control plane nodes before upgrading then repooling them per https://v1-16.docs.kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/ since that is the newer procedure (in case that fixes anything that my fix didn't--the thing I fixed would have stopped the upgrade no matter what). I don't think we should worry a bit about fussing with haproxy during the upgrade because the tooling should all be compatible between the two versions. The big thing we must check before the tools upgrade is to make sure that all the objects created with old definitions are still working on the upgraded cluster...presuming we get the upgrade rolling.
Ok, while that waits for things, now PAWS just needs the ingress setup finished, testing if the roles and all that work for the pods it needs, ideally using all the upgraded images, and all of this deployed in the new cluster.
root@paws-k8s-control-3:~# kubectl get nodes NAME STATUS ROLES AGE VERSION paws-k8s-control-1 Ready master 28m v1.16.10 paws-k8s-control-2 Ready master 21m v1.16.10 paws-k8s-control-3 Ready master 12m v1.16.10 paws-k8s-worker-1 Ready <none> 3m22s v1.16.10 paws-k8s-worker-2 Ready <none> 2m26s v1.16.10 paws-k8s-worker-3 Ready <none> 105s v1.16.10 paws-k8s-worker-4 Ready <none> 31s v1.16.10
It works now!
Ah, it was a copy/paste error in the hiera! Fixing.
It doesn't look like it is base64 in the config. I wonder why.
So we have progress! There's a new error that is particular to paws:
running docker ps -a to get the container ID and docker logs <hash>
To be clear, this would prevent the api-server pod from starting after upgrade. I suspect that's exactly what caused the error you saw (partly because it is very similar to my kubeadm init error and because the pod cannot start with that value for a volume name).
@aborrero I think I know what is wrong in Toolsbeta. It is the same thing that I saw just now on paws. There is an error in the kubeadm config (which becomes the kubeadm configmap). The name of the extra volume needed for encryption and some other important config for the apiserver is wrong. I must have done this by mistake somewhere during that very long security eval. I made changes in place instead of rebuilding clusters, so I never saw the discrepancy.
So the problem was:
May 26 15:46:53 paws-k8s-control-1 kubelet: E0526 15:46:53.546760 28450 file.go:187] Can't process manifest file "/etc/kubernetes/manifests/kube-apiserver.yaml": invalid p od: [spec.volumes.name: Invalid value: "/etc/kubernetes/admission": a DNS-1123 label must consist of lower case alphanumeric characters or '-', and must start and end with an al phanumeric character (e.g. 'my-name', or '123-abc', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?') spec.containers.volumeMounts.name: Not found: "/etc/kuberne tes/admission"]
So it's the puppetization somewhere. I'll dig that up.
I see there were psp changes around 1.16 https://github.com/kubernetes/kubernetes/pull/77792
That isn't likely to be our issue, but something to be aware of.