Page MenuHomePhabricator

tools-worker-1022 k8s duplicate node
Closed, ResolvedPublic

Description

The k8s tools checker went critical with

HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string 'OK' not found on 'http://checker.tools.wmflabs.org:80/k8s/nodes/ready'

Looks like this is related to a bad puppet agent run that stripped the tools domain from the fqdn.

Jul  7 03:42:46 tools-worker-1022 puppet-agent[17280]: (/Stage[main]/K8s::Kubelet/File[/etc/default/kubelet]/content) -KUBELET_HOSTNAME="--hostname-override=tools-worker-1022.tools.eqiad.wmflabs"
Jul  7 03:42:46 tools-worker-1022 puppet-agent[17280]: (/Stage[main]/K8s::Kubelet/File[/etc/default/kubelet]/content) +KUBELET_HOSTNAME="--hostname-override=tools-worker-1022.eqiad.wmflabs"
...
Jul  7 04:11:43 tools-worker-1022 puppet-agent[9226]: (/Stage[main]/K8s::Kubelet/File[/etc/default/kubelet]/content) -KUBELET_HOSTNAME="--hostname-override=tools-worker-1022.eqiad.wmflabs"
Jul  7 04:11:43 tools-worker-1022 puppet-agent[9226]: (/Stage[main]/K8s::Kubelet/File[/etc/default/kubelet]/content) +KUBELET_HOSTNAME="--hostname-override=tools-worker-1022.tools.eqiad.wmflabs"
$ kubectl get nodes | grep tools-worker-1022 
tools-worker-1022.eqiad.wmflabs         NotReady                   1h
tools-worker-1022.tools.eqiad.wmflabs   Ready                      2y

The bad hostname is marked as NotReady with the reason.
kubelet does not have ClusterDNS IP configured and cannot create Pod using "ClusterFirst" policy

I've acked the alert and I'm leaving this bad host here for further investigation

Event Timeline

Looks like this was effected by DNS testing that was happening on cloudservices1003. Based on the logs, the only way I can see the FQDN changing is with the following example.

The process to determine the FQDN on cloud services looks like:

  1. facter selects hostname -f to determine the FQDN
  2. hostname -f calls getaddrinfo()
  3. getaddrinfo() returns the first short hostname + search domain record found, OR just the short name if no record was found

resolv.conf is configured with

domain tools.eqiad.wmflabs
search tools.eqiad.wmflabs eqiad.wmflabs

(note that domain is technically a search entry)

This host has entries in both search domains

tools-worker-1022.eqiad.wmflabs has address 172.16.4.193
tools-worker-1022.tools.eqiad.wmflabs has address 172.16.4.193

Due to PDNS restarts, getaddrinfo() missed the host in the first search domain tools.eqiad.wmflabs and found it in the second eqiad.wmflabs . Puppet used the different FQDN and rebuilt /etc/mailname and /etc/default/kubelet.

To avoid this from happening in the future we might want to consider configuring the FQDN in /etc/hosts, or a configuration check within puppet to confirm the FQDN is what we expect it to be before applying any configuration.

@aborrero @Bstorm Is there anything else we need or want to check before deleting the bad node?

NOTE: I've been lazy and just deleted the bad node entry in the past.

That said, I recall the old names without the project subdomain were maintained in /etc/hosts files via puppet to keep ancient tooling from breaking after the project subdomains were added. I thought we'd eliminated that from puppet after we upgraded the grid, but I wonder if files or puppet still exist somewhere for the old k8s nodes. We should check that angle maybe? Otherwise, I tend to think this should have stopped happening.

Same, in the past I just deleted the bogus nodes.