Create a "health check" for Kubernetes worker nodes which validates local Toolforge config
Open, MediumPublic
Actions

Assigned To

None

Authored By

	bd808
	Jan 13 2020, 4:38 PM

Description

It would be very helpful to have a DaemonSet or other mechanism which deployed a Pod on every Kubernetes worker node which then did some internal health checking of NFS mounts, NSS integration, and network routing from inside that node. This could be considered similar to our OpenStack full-stack checks or the "canary" instances we put on each hypervisor.

Ideally it would catch things like T242559: Partialy setup tools-k8s-worker instances created by novaadmin causing problems, automatically mark the node as unschedulable, and alert Toolforge admins of the problem.

Related Objects

Mentioned In: T304708: toolforge: label kubernetes nodes with nfs access
T140249: [toolforge.infra] Run https://github.com/kubernetes/node-problem-detector on all our nodes
T242559: Partialy setup tools-k8s-worker instances created by novaadmin causing problems
Mentioned Here: T242559: Partialy setup tools-k8s-worker instances created by novaadmin causing problems

Event Timeline

bd808 created this task.Jan 13 2020, 4:38 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 13 2020, 4:38 PM

From T242559#5798302

I might suggest we look at adding a taint to nodes that run webservice that only gets added when we are sure a node is ready to run a webservice process....automating such a taint is tricky without puppetdb and with puppet in general, but it would be a way we could gate things at the end of a "checklist" if you will like a message at the end of puppet to add the taint. Unless it can be added via the kubelet API (something to look at).

This is basically what taints are for. We just need a way to instrument it...

https://v1-15.docs.kubernetes.io/docs/reference/command-line-tools-reference/kubelet/

Got an idea here: --register-with-taints []api.Taint is a CLI option

We could gate webservice deploys on a taint that is added to the cli options via puppet.

That's the mechanism we use already to configure lots about the kubelet

TaintNodesByCondition is luckily default on 1.15, so basic things are checked on the new cluster (not on the old, btw). The one thing that "puppet works and contributes a full config" doesn't satisfy here is monitoring the current state of our special needs like sssd. A node-tainting daemonset might still be worth it from that perspective. (a very basic idea is https://github.com/uswitch/nidhogg...and then we'd just need a daemonset that should be running on all webservice nodes that mounts things and connects to sssd)

The other part about "alerts TF admins" I'm not so wild about, though. The whole notion of clusters like this is it shows up on a dashboard to be fixed when we can rather than sending alerts to be fixed immediately. They should not schedule pods from webservice on them, though, and they should be surfaced in some way (we can view them by taints with prometheus). Just a brain dump. Running away now.

bd808 mentioned this in T242559: Partialy setup tools-k8s-worker instances created by novaadmin causing problems.Jan 13 2020, 5:54 PM

Krenair subscribed.Jan 14 2020, 12:05 AM

NOTE: As @Krenair pointed out on the other ticket this morning, my first suddenly blurted out idea (referenced at T242637#5798354) actually ought to be the concept of "labels and affinity" not taints. This is the idea that a node becomes "Toolforge ready" when it acquires a label affixed to it, and tool pods are pointed at nodes that have the right selector. Also note that node labels can be attached by puppet on the CLI the exact same way https://v1-15.docs.kubernetes.io/docs/reference/command-line-tools-reference/kubelet/ I'd confused myself nicely :)

• Bstorm triaged this task as Medium priority.Jan 22 2020, 10:18 PM

• Bstorm mentioned this in T140249: [toolforge.infra] Run https://github.com/kubernetes/node-problem-detector on all our nodes.Mar 3 2020, 11:18 PM

taavi mentioned this in T304708: toolforge: label kubernetes nodes with nfs access.Mar 25 2022, 3:37 PM

fnegri edited projects, added cloud-services-team; removed cloud-services-team (Kanban).Jan 18 2023, 7:33 PM

fnegri moved this task from Kanban to Inbox on the cloud-services-team board.

Create a "health check" for Kubernetes worker nodes which validates local Toolforge configOpen, MediumPublicActions

Description

Related Objects

Event Timeline

Create a "health check" for Kubernetes worker nodes which validates local Toolforge config
Open, MediumPublic
Actions