Page MenuHomePhabricator

Partialy setup tools-k8s-worker instances created by novaadmin causing problems
Closed, ResolvedPublic

Description

Creating this based off what I found investigating https://lists.wikimedia.org/pipermail/cloud/2020-January/000941.html - key points in bold
We used to have 5 worker nodes for this new k8s cluster. Today when I went to look into this issue of pods stuck in ContainerCreating I found events saying /data/project was missing from the host. I looked at the host and found it was a new one - tools-k8s-worker-6. I quickly determined that profile::wmcs::nfsclient should have created the /data/project symlink to the NFS mount (which did exist), but that puppet had the type of cert issue you see on new hosts in projects which use their own puppetmasters.
I also found that there was not just one new worker instance, but workers 6 through 14 had been created. I checked the first few and they were all created by novaadmin. I'm not sure it's ever valid for novaadmin to be creating instances.
I went through the instances missing the /data/project symlink (-6, -7, -8, -13, -14) and fixed their link to the tools-puppetmaster. 4 other instances (-9, -10, -11, -12) somehow already had the symlink but had broken puppet - I left these ones alone and don't know if we consider them to be working or not
While fixing instances talking to puppet I saw that sudo puppet cert list on tools-puppetmaster-01 is a mess and contains some names that should not even exist (stuff ending in {, stuff beginning host-172-16, one literally just for .tools.eqiad.wmflabs, one with a double full stop - tools-worker-1005..eqiad.wmflabs)
We should also determine how it was possible for an instance to get as far as registering itself as a node in k8s, and being considered healthy for pods to be scheduled, without even having working puppet

Event Timeline

Krenair created this task.Sun, Jan 12, 11:23 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSun, Jan 12, 11:23 PM
Krenair updated the task description. (Show Details)Sun, Jan 12, 11:32 PM
Krenair updated the task description. (Show Details)
aborrero triaged this task as Medium priority.Mon, Jan 13, 10:36 AM
aborrero added subscribers: bd808, aborrero.

These instances were created by @bd808 as reported in the SAL (https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL) in 2020-01-07.
They were created using a script that you can find here: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Deploying_k8s#worker_nodes

This is to try to add some context.

If VM instances are now fully working, with puppet, NFS happy etc, then I don't see any actionable in this task, other than double-checking/making sure puppet runs next time?

bd808 renamed this task from Unpuppetised tools-k8s-worker instances created by novaadmin causing problems to Partialy setup tools-k8s-worker instances created by novaadmin causing problems.Mon, Jan 13, 3:54 PM
bd808 added a comment.Mon, Jan 13, 4:00 PM

We should also determine how it was possible for an instance to get as far as registering itself as a node in k8s, and being considered healthy for pods to be scheduled, without even having working puppet

Because I quick checked them before joining to the cluster, but obviously did not do a very good jobs of that.

contains some names that should not even exist

That seems likely be fall out from the 2 sets of instances I tried to make using Horizon before determining that I would have to write a custom bulk create script. I was trying to make 9 additional instances using the feature of Horizon which allows you to pick a flavor, image, etc and then a count of instances. This was apparently how the initial tools-k8s-worker-[1-5] instances were built and I was hoping that it was repeatable. It is not.

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T16:26:20Z] <bd808> Cordoned and fixed puppet on tools-k8s-worker-9. Rebooting now. T242559

bd808 claimed this task.Mon, Jan 13, 4:27 PM
bd808 raised the priority of this task from Medium to High.
bd808 moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T16:31:03Z] <bd808> Cordoned and fixed puppet on tools-k8s-worker-10. Rebooting now. T242559

Bstorm added a subscriber: Bstorm.Mon, Jan 13, 4:33 PM

We should also determine how it was possible for an instance to get as far as registering itself as a node in k8s, and being considered healthy for pods to be scheduled, without even having working puppet

This seems perfectly normal to me. A node doesn't have to have any particular config other than the right packages, certs and communication with the control plane, which is primarily a kubeadm thing. On the other hand, I might suggest we look at adding a taint to nodes that run webservice that only gets added when we are sure a node is ready to run a webservice process....automating such a taint is tricky without puppetdb and with puppet in general, but it would be a way we could gate things at the end of a "checklist" if you will like a message at the end of puppet to add the taint. Unless it can be added via the kubelet API (something to look at).

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T16:33:49Z] <bd808> Cordoned and fixed puppet on tools-k8s-worker-11. Rebooting now. T242559

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T16:42:27Z] <bd808> Cordoned and fixed puppet on tools-k8s-worker-12. Rebooting now. T242559

bd808 added a comment.Mon, Jan 13, 4:56 PM

Puppet is happy on tools-k8s-worker-{9,10,11,12} now. They were rebooted after fixing puppet and have been uncordoned as well.

bd808 closed this task as Resolved.Mon, Jan 13, 5:54 PM

I think this mess is cleaned up. T242637: Create a "health check" for Kubernetes worker nodes which validates local Toolforge config is my attempt at a task that would ideally catch this type of problem automatically instead of waiting for users to stumble across broken nodes and report them.

On the other hand, I might suggest we look at adding a taint to nodes that run webservice that only gets added when we are sure a node is ready to run a webservice process

This feels icky - such nodes would then be reserved only for webservice things and other pods that know to tolerate that taint. Maybe it should be the other way around - nodes are tainted until we know they are able to run webservice and co.
I also think that /data/project itself (the missing thing here) is probably useful for things that don't have web services, it seems pretty fundamental to the tools project - shouldn't it be available on all nodes all the time?

This feels icky - such nodes would then be reserved only for webservice things and other pods that know to tolerate that taint. Maybe it should be the other way around - nodes are tainted until we know they are able to run webservice and co.
I also think that /data/project itself (the missing thing here) is probably useful for things that don't have web services, it seems pretty fundamental to the tools project - shouldn't it be available on all nodes all the time?

Doh! Yeah, I didn't mean "taint", I mean "label" with an affinity affixed to webservice. Yes, a taint would need to be applied when a problem is detected.