Page MenuHomePhabricator

Partialy setup tools-k8s-worker instances created by novaadmin causing problems
Closed, ResolvedPublic

Description

Creating this based off what I found investigating https://lists.wikimedia.org/pipermail/cloud/2020-January/000941.html - key points in bold
We used to have 5 worker nodes for this new k8s cluster. Today when I went to look into this issue of pods stuck in ContainerCreating I found events saying /data/project was missing from the host. I looked at the host and found it was a new one - tools-k8s-worker-6. I quickly determined that profile::wmcs::nfsclient should have created the /data/project symlink to the NFS mount (which did exist), but that puppet had the type of cert issue you see on new hosts in projects which use their own puppetmasters.
I also found that there was not just one new worker instance, but workers 6 through 14 had been created. I checked the first few and they were all created by novaadmin. I'm not sure it's ever valid for novaadmin to be creating instances.
I went through the instances missing the /data/project symlink (-6, -7, -8, -13, -14) and fixed their link to the tools-puppetmaster. 4 other instances (-9, -10, -11, -12) somehow already had the symlink but had broken puppet - I left these ones alone and don't know if we consider them to be working or not
While fixing instances talking to puppet I saw that sudo puppet cert list on tools-puppetmaster-01 is a mess and contains some names that should not even exist (stuff ending in {, stuff beginning host-172-16, one literally just for .tools.eqiad.wmflabs, one with a double full stop - tools-worker-1005..eqiad.wmflabs)
We should also determine how it was possible for an instance to get as far as registering itself as a node in k8s, and being considered healthy for pods to be scheduled, without even having working puppet

Event Timeline

Krenair updated the task description. (Show Details)
aborrero triaged this task as Medium priority.Jan 13 2020, 10:36 AM
aborrero added subscribers: bd808, aborrero.

These instances were created by @bd808 as reported in the SAL (https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL) in 2020-01-07.
They were created using a script that you can find here: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Deploying_k8s#worker_nodes

This is to try to add some context.

If VM instances are now fully working, with puppet, NFS happy etc, then I don't see any actionable in this task, other than double-checking/making sure puppet runs next time?

bd808 renamed this task from Unpuppetised tools-k8s-worker instances created by novaadmin causing problems to Partialy setup tools-k8s-worker instances created by novaadmin causing problems.Jan 13 2020, 3:54 PM

We should also determine how it was possible for an instance to get as far as registering itself as a node in k8s, and being considered healthy for pods to be scheduled, without even having working puppet

Because I quick checked them before joining to the cluster, but obviously did not do a very good jobs of that.

contains some names that should not even exist

That seems likely be fall out from the 2 sets of instances I tried to make using Horizon before determining that I would have to write a custom bulk create script. I was trying to make 9 additional instances using the feature of Horizon which allows you to pick a flavor, image, etc and then a count of instances. This was apparently how the initial tools-k8s-worker-[1-5] instances were built and I was hoping that it was repeatable. It is not.

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T16:26:20Z] <bd808> Cordoned and fixed puppet on tools-k8s-worker-9. Rebooting now. T242559

bd808 raised the priority of this task from Medium to High.
bd808 moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T16:31:03Z] <bd808> Cordoned and fixed puppet on tools-k8s-worker-10. Rebooting now. T242559

We should also determine how it was possible for an instance to get as far as registering itself as a node in k8s, and being considered healthy for pods to be scheduled, without even having working puppet

This seems perfectly normal to me. A node doesn't have to have any particular config other than the right packages, certs and communication with the control plane, which is primarily a kubeadm thing. On the other hand, I might suggest we look at adding a taint to nodes that run webservice that only gets added when we are sure a node is ready to run a webservice process....automating such a taint is tricky without puppetdb and with puppet in general, but it would be a way we could gate things at the end of a "checklist" if you will like a message at the end of puppet to add the taint. Unless it can be added via the kubelet API (something to look at).

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T16:33:49Z] <bd808> Cordoned and fixed puppet on tools-k8s-worker-11. Rebooting now. T242559

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T16:42:27Z] <bd808> Cordoned and fixed puppet on tools-k8s-worker-12. Rebooting now. T242559

Puppet is happy on tools-k8s-worker-{9,10,11,12} now. They were rebooted after fixing puppet and have been uncordoned as well.

I think this mess is cleaned up. T242637: Create a "health check" for Kubernetes worker nodes which validates local Toolforge config is my attempt at a task that would ideally catch this type of problem automatically instead of waiting for users to stumble across broken nodes and report them.

On the other hand, I might suggest we look at adding a taint to nodes that run webservice that only gets added when we are sure a node is ready to run a webservice process

This feels icky - such nodes would then be reserved only for webservice things and other pods that know to tolerate that taint. Maybe it should be the other way around - nodes are tainted until we know they are able to run webservice and co.
I also think that /data/project itself (the missing thing here) is probably useful for things that don't have web services, it seems pretty fundamental to the tools project - shouldn't it be available on all nodes all the time?

This feels icky - such nodes would then be reserved only for webservice things and other pods that know to tolerate that taint. Maybe it should be the other way around - nodes are tainted until we know they are able to run webservice and co.
I also think that /data/project itself (the missing thing here) is probably useful for things that don't have web services, it seems pretty fundamental to the tools project - shouldn't it be available on all nodes all the time?

Doh! Yeah, I didn't mean "taint", I mean "label" with an affinity affixed to webservice. Yes, a taint would need to be applied when a problem is detected.