Creating this based off what I found investigating https://lists.wikimedia.org/pipermail/cloud/2020-January/000941.html - key points in bold
We used to have 5 worker nodes for this new k8s cluster. Today when I went to look into this issue of pods stuck in ContainerCreating I found events saying /data/project was missing from the host. I looked at the host and found it was a new one - tools-k8s-worker-6. I quickly determined that profile::wmcs::nfsclient should have created the /data/project symlink to the NFS mount (which did exist), but that puppet had the type of cert issue you see on new hosts in projects which use their own puppetmasters.
I also found that there was not just one new worker instance, but workers 6 through 14 had been created. I checked the first few and they were all created by novaadmin. I'm not sure it's ever valid for novaadmin to be creating instances.
I went through the instances missing the /data/project symlink (-6, -7, -8, -13, -14) and fixed their link to the tools-puppetmaster. 4 other instances (-9, -10, -11, -12) somehow already had the symlink but had broken puppet - I left these ones alone and don't know if we consider them to be working or not
While fixing instances talking to puppet I saw that sudo puppet cert list on tools-puppetmaster-01 is a mess and contains some names that should not even exist (stuff ending in {, stuff beginning host-172-16, one literally just for .tools.eqiad.wmflabs, one with a double full stop - tools-worker-1005..eqiad.wmflabs)
We should also determine how it was possible for an instance to get as far as registering itself as a node in k8s, and being considered healthy for pods to be scheduled, without even having working puppet
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Invalid | • Bstorm | T242632 Apparent issues in Toolforge Kubernetes | |||
Resolved | bd808 | T242559 Partialy setup tools-k8s-worker instances created by novaadmin causing problems | |||
Resolved | bd808 | T242642 Cleanup unsigned puppet client certs on tools-puppetmaster-01 |
Event Timeline
These instances were created by @bd808 as reported in the SAL (https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL) in 2020-01-07.
They were created using a script that you can find here: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Deploying_k8s#worker_nodes
This is to try to add some context.
If VM instances are now fully working, with puppet, NFS happy etc, then I don't see any actionable in this task, other than double-checking/making sure puppet runs next time?
We should also determine how it was possible for an instance to get as far as registering itself as a node in k8s, and being considered healthy for pods to be scheduled, without even having working puppet
Because I quick checked them before joining to the cluster, but obviously did not do a very good jobs of that.
contains some names that should not even exist
That seems likely be fall out from the 2 sets of instances I tried to make using Horizon before determining that I would have to write a custom bulk create script. I was trying to make 9 additional instances using the feature of Horizon which allows you to pick a flavor, image, etc and then a count of instances. This was apparently how the initial tools-k8s-worker-[1-5] instances were built and I was hoping that it was repeatable. It is not.
Mentioned in SAL (#wikimedia-cloud) [2020-01-13T16:26:20Z] <bd808> Cordoned and fixed puppet on tools-k8s-worker-9. Rebooting now. T242559
Mentioned in SAL (#wikimedia-cloud) [2020-01-13T16:31:03Z] <bd808> Cordoned and fixed puppet on tools-k8s-worker-10. Rebooting now. T242559
We should also determine how it was possible for an instance to get as far as registering itself as a node in k8s, and being considered healthy for pods to be scheduled, without even having working puppet
This seems perfectly normal to me. A node doesn't have to have any particular config other than the right packages, certs and communication with the control plane, which is primarily a kubeadm thing. On the other hand, I might suggest we look at adding a taint to nodes that run webservice that only gets added when we are sure a node is ready to run a webservice process....automating such a taint is tricky without puppetdb and with puppet in general, but it would be a way we could gate things at the end of a "checklist" if you will like a message at the end of puppet to add the taint. Unless it can be added via the kubelet API (something to look at).
Mentioned in SAL (#wikimedia-cloud) [2020-01-13T16:33:49Z] <bd808> Cordoned and fixed puppet on tools-k8s-worker-11. Rebooting now. T242559
Mentioned in SAL (#wikimedia-cloud) [2020-01-13T16:42:27Z] <bd808> Cordoned and fixed puppet on tools-k8s-worker-12. Rebooting now. T242559
Puppet is happy on tools-k8s-worker-{9,10,11,12} now. They were rebooted after fixing puppet and have been uncordoned as well.
I think this mess is cleaned up. T242637: Create a "health check" for Kubernetes worker nodes which validates local Toolforge config is my attempt at a task that would ideally catch this type of problem automatically instead of waiting for users to stumble across broken nodes and report them.
This feels icky - such nodes would then be reserved only for webservice things and other pods that know to tolerate that taint. Maybe it should be the other way around - nodes are tainted until we know they are able to run webservice and co.
I also think that /data/project itself (the missing thing here) is probably useful for things that don't have web services, it seems pretty fundamental to the tools project - shouldn't it be available on all nodes all the time?
Doh! Yeah, I didn't mean "taint", I mean "label" with an affinity affixed to webservice. Yes, a taint would need to be applied when a problem is detected.