Page MenuHomePhabricator

Create a "health check" for Kubernetes worker nodes which validates local Toolforge config
Open, MediumPublic

Description

It would be very helpful to have a DaemonSet or other mechanism which deployed a Pod on every Kubernetes worker node which then did some internal health checking of NFS mounts, NSS integration, and network routing from inside that node. This could be considered similar to our OpenStack full-stack checks or the "canary" instances we put on each hypervisor.

Ideally it would catch things like T242559: Partialy setup tools-k8s-worker instances created by novaadmin causing problems, automatically mark the node as unschedulable, and alert Toolforge admins of the problem.

Event Timeline

From T242559#5798302

I might suggest we look at adding a taint to nodes that run webservice that only gets added when we are sure a node is ready to run a webservice process....automating such a taint is tricky without puppetdb and with puppet in general, but it would be a way we could gate things at the end of a "checklist" if you will like a message at the end of puppet to add the taint. Unless it can be added via the kubelet API (something to look at).

This is basically what taints are for. We just need a way to instrument it...

https://v1-15.docs.kubernetes.io/docs/reference/command-line-tools-reference/kubelet/

Got an idea here: --register-with-taints []api.Taint is a CLI option

We could gate webservice deploys on a taint that is added to the cli options via puppet.

That's the mechanism we use already to configure lots about the kubelet

TaintNodesByCondition is luckily default on 1.15, so basic things are checked on the new cluster (not on the old, btw). The one thing that "puppet works and contributes a full config" doesn't satisfy here is monitoring the current state of our special needs like sssd. A node-tainting daemonset might still be worth it from that perspective. (a very basic idea is https://github.com/uswitch/nidhogg...and then we'd just need a daemonset that should be running on all webservice nodes that mounts things and connects to sssd)

The other part about "alerts TF admins" I'm not so wild about, though. The whole notion of clusters like this is it shows up on a dashboard to be fixed when we can rather than sending alerts to be fixed immediately. They should not schedule pods from webservice on them, though, and they should be surfaced in some way (we can view them by taints with prometheus). Just a brain dump. Running away now.

TaintNodesByCondition is luckily default on 1.15, so basic things are checked on the new cluster (not on the old, btw). The one thing that "puppet works and contributes a full config" doesn't satisfy here is monitoring the current state of our special needs like sssd. A node-tainting daemonset might still be worth it from that perspective. (a very basic idea is https://github.com/uswitch/nidhogg...and then we'd just need a daemonset that should be running on all webservice nodes that mounts things and connects to sssd). This (or any daemonset that notices a problem and applies a taint) would effectively drop the node from the pool (without "cordon"...and our monitoring would need a bit more nuance.)

NOTE: As @Krenair pointed out on the other ticket this morning, my first suddenly blurted out idea (referenced at T242637#5798354) actually ought to be the concept of "labels and affinity" not taints. This is the idea that a node becomes "Toolforge ready" when it acquires a label affixed to it, and tool pods are pointed at nodes that have the right selector. Also note that node labels can be attached by puppet on the CLI the exact same way https://v1-15.docs.kubernetes.io/docs/reference/command-line-tools-reference/kubelet/ I'd confused myself nicely :)