toolforge: k8s alert about nodes not healthy
Closed, ResolvedPublic

Description

Today we got a page by icinga:

PROBLEM - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.140 second response time

I checked at tools-k8s-master-01.eqiad.wmflabs and some worker nodes had issues:

aborrero@tools-k8s-master-01:~$ sudo kubectl get nodes -o wide
NAME                                    STATUS                        AGE
tools-worker-1001.tools.eqiad.wmflabs   Ready                         2y
tools-worker-1002.tools.eqiad.wmflabs   Ready                         2y
tools-worker-1003.tools.eqiad.wmflabs   Ready                         2y
tools-worker-1004.tools.eqiad.wmflabs   Ready                         2y
tools-worker-1005.tools.eqiad.wmflabs   Ready                         2y
tools-worker-1006.tools.eqiad.wmflabs   NotReady                      2y
tools-worker-1007.tools.eqiad.wmflabs   NotReady                      2y
tools-worker-1008.tools.eqiad.wmflabs   Ready                         2y
tools-worker-1009.tools.eqiad.wmflabs   Ready                         2y
tools-worker-1010.tools.eqiad.wmflabs   Ready,SchedulingDisabled      1y
tools-worker-1011.tools.eqiad.wmflabs   Ready,SchedulingDisabled      1y
tools-worker-1012.tools.eqiad.wmflabs   Ready,SchedulingDisabled      1y
tools-worker-1013.tools.eqiad.wmflabs   Ready                         1y
tools-worker-1014.tools.eqiad.wmflabs   Ready                         1y
tools-worker-1015.tools.eqiad.wmflabs   Ready                         1y
tools-worker-1016.tools.eqiad.wmflabs   Ready                         1y
tools-worker-1017.tools.eqiad.wmflabs   Ready                         1y
tools-worker-1018.tools.eqiad.wmflabs   Ready                         1y
tools-worker-1019.tools.eqiad.wmflabs   Ready                         1y
tools-worker-1020.tools.eqiad.wmflabs   Ready                         1y
tools-worker-1021.tools.eqiad.wmflabs   NotReady                      1y
tools-worker-1022.tools.eqiad.wmflabs   Ready                         1y
tools-worker-1023.tools.eqiad.wmflabs   Ready                         1y
tools-worker-1025.tools.eqiad.wmflabs   Ready                         1y
tools-worker-1026.tools.eqiad.wmflabs   Ready                         1y
tools-worker-1027.tools.eqiad.wmflabs   Ready                         1y
tools-worker-1028.tools.eqiad.wmflabs   Ready,SchedulingDisabled      1y
tools-worker-1029.tools.eqiad.wmflabs   NotReady,SchedulingDisabled   1y

After several minutes, all was back to normal state:

aborrero@tools-k8s-master-01:~$ sudo kubectl get nodes -o wide
NAME                                    STATUS                        AGE
tools-worker-1001.tools.eqiad.wmflabs   Ready                         2y
tools-worker-1002.tools.eqiad.wmflabs   Ready                         2y
tools-worker-1003.tools.eqiad.wmflabs   Ready                         2y
tools-worker-1004.tools.eqiad.wmflabs   Ready                         2y
tools-worker-1005.tools.eqiad.wmflabs   Ready                         2y
tools-worker-1006.tools.eqiad.wmflabs   Ready                         2y
tools-worker-1007.tools.eqiad.wmflabs   Ready                         2y
tools-worker-1008.tools.eqiad.wmflabs   Ready                         2y
tools-worker-1009.tools.eqiad.wmflabs   Ready                         2y
tools-worker-1010.tools.eqiad.wmflabs   Ready,SchedulingDisabled      1y
tools-worker-1011.tools.eqiad.wmflabs   Ready,SchedulingDisabled      1y
tools-worker-1012.tools.eqiad.wmflabs   Ready,SchedulingDisabled      1y
tools-worker-1013.tools.eqiad.wmflabs   Ready                         1y
tools-worker-1014.tools.eqiad.wmflabs   Ready                         1y
tools-worker-1015.tools.eqiad.wmflabs   Ready                         1y
tools-worker-1016.tools.eqiad.wmflabs   Ready                         1y
tools-worker-1017.tools.eqiad.wmflabs   Ready                         1y
tools-worker-1018.tools.eqiad.wmflabs   Ready                         1y
tools-worker-1019.tools.eqiad.wmflabs   Ready                         1y
tools-worker-1020.tools.eqiad.wmflabs   Ready                         1y
tools-worker-1021.tools.eqiad.wmflabs   Ready                         1y
tools-worker-1022.tools.eqiad.wmflabs   Ready                         1y
tools-worker-1023.tools.eqiad.wmflabs   Ready                         1y
tools-worker-1025.tools.eqiad.wmflabs   Ready                         1y
tools-worker-1026.tools.eqiad.wmflabs   Ready                         1y
tools-worker-1027.tools.eqiad.wmflabs   Ready                         1y
tools-worker-1028.tools.eqiad.wmflabs   Ready,SchedulingDisabled      1y
tools-worker-1029.tools.eqiad.wmflabs   NotReady,SchedulingDisabled   1y

Not sure what's the matter with these disabled nodes, and I can't jump to tools-worker-1029.tools.eqiad.wmflabs.

When I was testing things, I also restarted the checker service:

aborrero@tools-checker-01:~$ sudo service toolschecker_kubernetes_nodes_ready restart
toolschecker_kubernetes_nodes_ready stop/waiting
toolschecker_kubernetes_nodes_ready start/running, process 23148

But something happened somewhere (proxy?) that now I can't access the checker:

arturo@endurance:~$ LANG=C wget https://checker.tools.wmflabs.org/k8s/nodes/ready
--2018-06-12 11:10:49--  https://checker.tools.wmflabs.org/k8s/nodes/ready
Resolving checker.tools.wmflabs.org (checker.tools.wmflabs.org)... 208.80.155.229
Connecting to checker.tools.wmflabs.org (checker.tools.wmflabs.org)|208.80.155.229|:443... failed: Connection refused.

aborrero@tools-clushmaster-01:~$ wget https://checker.tools.wmflabs.org/k8s/nodes/ready
--2018-06-12 09:12:19--  https://checker.tools.wmflabs.org/k8s/nodes/ready
Resolving checker.tools.wmflabs.org (checker.tools.wmflabs.org)... 10.68.16.228
Connecting to checker.tools.wmflabs.org (checker.tools.wmflabs.org)|10.68.16.228|:443... failed: Connection refused.

Not sure how is possible icinga is seeing this as OK.

aborrero created this task.Jun 12 2018, 9:14 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 12 2018, 9:14 AM
chasemp added a subscriber: chasemp.EditedJun 12 2018, 2:50 PM

I believe tools-worker-1029.tools.eqiad.wmflabs needs to be deleted, but I'm unsure why these three are cordon'd. relic of forgotten maint?

tools-worker-1010.tools.eqiad.wmflabs Ready,SchedulingDisabled 1y
tools-worker-1011.tools.eqiad.wmflabs Ready,SchedulingDisabled 1y
tools-worker-1012.tools.eqiad.wmflabs Ready,SchedulingDisabled 1y

The check does return OK for me now:

curl http://checker.tools.wmflabs.org:80/k8s/nodes/ready

OK

These checks are very weird in that multiple checks of the same endpoint have been known to clobber each other, that's the reason each of them was moved to their own service on the checker hosts themselves.


@aborerro - were there any logs to indicate what the trouble was on

tools-worker-1006.tools.eqiad.wmflabs NotReady 2y
tools-worker-1007.tools.eqiad.wmflabs NotReady 2y
tools-worker-1021.tools.eqiad.wmflabs NotReady 1y

?

By chance was there an update to some global ferm variable?

chasemp triaged this task as Normal priority.Jun 12 2018, 3:43 PM

OK I realize now I was testing https vs http. That's why I got the connection refuse error.

Some interesting logs:

tools-worker-1006.tools.eqiad.wmflabs /var/log/syslog
[...]
Jun 12 08:07:35 tools-worker-1006 puppet-agent[8532]: (/Stage[main]/K8s::Kubelet/File[/etc/default/kubelet]/content)
Jun 12 08:07:35 tools-worker-1006 puppet-agent[8532]: (/Stage[main]/K8s::Kubelet/File[/etc/default/kubelet]/content) --- /etc/default/kubelet#0112017-08-30 09:20:55.751518734 +0000
Jun 12 08:07:35 tools-worker-1006 puppet-agent[8532]: (/Stage[main]/K8s::Kubelet/File[/etc/default/kubelet]/content) +++ /tmp/puppet-file20180612-8532-13jypi7#0112018-06-12 08:07:35.414041051 +0000
Jun 12 08:07:35 tools-worker-1006 puppet-agent[8532]: (/Stage[main]/K8s::Kubelet/File[/etc/default/kubelet]/content) @@ -9,9 +9,6 @@
Jun 12 08:07:35 tools-worker-1006 puppet-agent[8532]: (/Stage[main]/K8s::Kubelet/File[/etc/default/kubelet]/content)  # You may leave this blank to use the actual hostname
Jun 12 08:07:35 tools-worker-1006 puppet-agent[8532]: (/Stage[main]/K8s::Kubelet/File[/etc/default/kubelet]/content)  KUBELET_HOSTNAME="--hostname-override=tools-worker-1006.tools.eqiad.wmflabs"
Jun 12 08:07:35 tools-worker-1006 puppet-agent[8532]: (/Stage[main]/K8s::Kubelet/File[/etc/default/kubelet]/content)
Jun 12 08:07:35 tools-worker-1006 puppet-agent[8532]: (/Stage[main]/K8s::Kubelet/File[/etc/default/kubelet]/content) -# location of the api-server
Jun 12 08:07:35 tools-worker-1006 puppet-agent[8532]: (/Stage[main]/K8s::Kubelet/File[/etc/default/kubelet]/content) -KUBELET_API_SERVER="--api-servers=https://k8s-master.tools.wmflabs.org:6443"
Jun 12 08:07:35 tools-worker-1006 puppet-agent[8532]: (/Stage[main]/K8s::Kubelet/File[/etc/default/kubelet]/content) -
Jun 12 08:07:35 tools-worker-1006 puppet-agent[8532]: (/Stage[main]/K8s::Kubelet/File[/etc/default/kubelet]/content)  # Docker endpoint to connect to
Jun 12 08:07:35 tools-worker-1006 puppet-agent[8532]: (/Stage[main]/K8s::Kubelet/File[/etc/default/kubelet]/content)  # Default: unix:///var/run/docker.sock
Jun 12 08:07:35 tools-worker-1006 puppet-agent[8532]: (/Stage[main]/K8s::Kubelet/File[/etc/default/kubelet]/content)  #DOCKER_ENDPOINT="--docker-endpoint=unix:///var/run/docker.sock"
Jun 12 08:07:35 tools-worker-1006 puppet-agent[8532]: Computing checksum on file /etc/default/kubelet
Jun 12 08:07:35 tools-worker-1006 puppet-agent[8532]: (/Stage[main]/K8s::Kubelet/File[/etc/default/kubelet]) Filebucketed /etc/default/kubelet to puppet with sum 007e0613fada9c3de1b92525679d8c97
Jun 12 08:07:35 tools-worker-1006 puppet-agent[8532]: (/Stage[main]/K8s::Kubelet/File[/etc/default/kubelet]/content) content changed '{md5}007e0613fada9c3de1b92525679d8c97' to '{md5}0228191aeb765ec7d41b14c6f9948f83'
Jun 12 08:07:35 tools-worker-1006 puppet-agent[8532]: (/Stage[main]/K8s::Kubelet/File[/etc/default/kubelet]) Scheduling refresh of Service[kubelet]
Jun 12 08:07:35 tools-worker-1006 puppet-agent[8532]: (/Stage[main]/K8s::Kubelet/File[/etc/default/kubelet]) Scheduling refresh of Service[kubelet]
Jun 12 08:07:36 tools-worker-1006 puppet-agent[8532]: (/Stage[main]/K8s::Kubelet/Service[kubelet]) Triggered 'refresh' from 2 events
Jun 12 08:07:36 tools-worker-1006 kubelet[9622]: W0612 08:07:36.247689    9622 server.go:383] No API client: no api servers specified
Jun 12 08:07:36 tools-worker-1006 kubelet[9622]: I0612 08:07:36.248334    9622 docker.go:375] Connecting to docker on unix:///var/run/docker.sock
Jun 12 08:07:36 tools-worker-1006 kubelet[9622]: I0612 08:07:36.248644    9622 docker.go:395] Start docker client with request timeout=2m0s
Jun 12 08:07:36 tools-worker-1006 kubelet[9622]: E0612 08:07:36.249069    9622 cni.go:163] error updating cni config: No networks found in /etc/cni/net.d
[...]
[...]
Jun 12 08:37:07 tools-worker-1006 puppet-agent[29434]: (/Stage[main]/K8s::Kubelet/File[/etc/default/kubelet]/content)
Jun 12 08:37:07 tools-worker-1006 puppet-agent[29434]: (/Stage[main]/K8s::Kubelet/File[/etc/default/kubelet]/content) --- /etc/default/kubelet#0112018-06-12 08:07:35.490041483 +0000
Jun 12 08:37:07 tools-worker-1006 puppet-agent[29434]: (/Stage[main]/K8s::Kubelet/File[/etc/default/kubelet]/content) +++ /tmp/puppet-file20180612-29434-1wxxh8e#0112018-06-12 08:37:07.688416156 +0000
Jun 12 08:37:07 tools-worker-1006 puppet-agent[29434]: (/Stage[main]/K8s::Kubelet/File[/etc/default/kubelet]/content) @@ -9,6 +9,9 @@
Jun 12 08:37:07 tools-worker-1006 puppet-agent[29434]: (/Stage[main]/K8s::Kubelet/File[/etc/default/kubelet]/content)  # You may leave this blank to use the actual hostname
Jun 12 08:37:07 tools-worker-1006 puppet-agent[29434]: (/Stage[main]/K8s::Kubelet/File[/etc/default/kubelet]/content)  KUBELET_HOSTNAME="--hostname-override=tools-worker-1006.tools.eqiad.wmflabs"
Jun 12 08:37:07 tools-worker-1006 puppet-agent[29434]: (/Stage[main]/K8s::Kubelet/File[/etc/default/kubelet]/content)
Jun 12 08:37:07 tools-worker-1006 puppet-agent[29434]: (/Stage[main]/K8s::Kubelet/File[/etc/default/kubelet]/content) +# location of the api-server
Jun 12 08:37:07 tools-worker-1006 puppet-agent[29434]: (/Stage[main]/K8s::Kubelet/File[/etc/default/kubelet]/content) +KUBELET_API_SERVER="--api-servers=https://k8s-master.tools.wmflabs.org:6443"
Jun 12 08:37:07 tools-worker-1006 puppet-agent[29434]: (/Stage[main]/K8s::Kubelet/File[/etc/default/kubelet]/content) +
Jun 12 08:37:07 tools-worker-1006 puppet-agent[29434]: (/Stage[main]/K8s::Kubelet/File[/etc/default/kubelet]/content)  # Docker endpoint to connect to
Jun 12 08:37:07 tools-worker-1006 puppet-agent[29434]: (/Stage[main]/K8s::Kubelet/File[/etc/default/kubelet]/content)  # Default: unix:///var/run/docker.sock
Jun 12 08:37:07 tools-worker-1006 puppet-agent[29434]: (/Stage[main]/K8s::Kubelet/File[/etc/default/kubelet]/content)  #DOCKER_ENDPOINT="--docker-endpoint=unix:///var/run/docker.sock"
Jun 12 08:37:07 tools-worker-1006 puppet-agent[29434]: Computing checksum on file /etc/default/kubelet
Jun 12 08:37:07 tools-worker-1006 puppet-agent[29434]: (/Stage[main]/K8s::Kubelet/File[/etc/default/kubelet]) Filebucketed /etc/default/kubelet to puppet with sum 0228191aeb765ec7d41b14c6f9948f83
Jun 12 08:37:07 tools-worker-1006 puppet-agent[29434]: (/Stage[main]/K8s::Kubelet/File[/etc/default/kubelet]/content) content changed '{md5}0228191aeb765ec7d41b14c6f9948f83' to '{md5}007e0613fada9c3de1b92525679d8c97'
Jun 12 08:37:07 tools-worker-1006 puppet-agent[29434]: (/Stage[main]/K8s::Kubelet/File[/etc/default/kubelet]) Scheduling refresh of Service[kubelet]
Jun 12 08:37:07 tools-worker-1006 puppet-agent[29434]: (/Stage[main]/K8s::Kubelet/File[/etc/default/kubelet]) Scheduling refresh of Service[kubelet]
Jun 12 08:37:08 tools-worker-1006 puppet-agent[29434]: (/Stage[main]/K8s::Kubelet/Service[kubelet]) Triggered 'refresh' from 2 events

same in tools-worker-1007.tools.eqiad.wmflabs and in tools-worker-1021.tools.eqiad.wmflabs.

A puppet patch?

Vvjjkkii renamed this task from toolforge: k8s alert about nodes not healthy to u7aaaaaaaa.Jul 1 2018, 1:05 AM
Vvjjkkii raised the priority of this task from Normal to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
CommunityTechBot lowered the priority of this task from High to Normal.
CommunityTechBot renamed this task from u7aaaaaaaa to toolforge: k8s alert about nodes not healthy.
CommunityTechBot added a subscriber: Aklapper.

tools-worker-1029 is in Hiera:Tools (k8s::worker_nodes). It was added by Madhuvishy on March 2007. This VM doesn't exist in Horizon though.

I've reverted that change in Hiera:Tools and removed tools-worker-1029 from the cluster (kubectl delete node tools-worker-1029.tools.eqiad.wmflabs).

Mentioned in SAL (#wikimedia-cloud) [2018-12-05T12:17:21Z] <gtirloni> remoted node tools-worker-1029.tools.eqiad.wmflabs from cluster (T196973)

The original issue reported here seems to be related to this change:

commit c2f456b4e8cab411078111c63ecd74008af8dd0f
Author: Alexandros Kosiaris <akosiaris@wikimedia.org>
Date:   Thu May 31 11:40:38 2018 +0300

    kubernetes: Remove deprecated --api-servers parameter
    
    api-servers has been deprecated for quite some time, and was expected to
    be removed in 1.6 but ended up being removed in 1.8. We populate
    correctly anyway kubeconfig files that are already in used by the
    kubelet so remove this. Add the --require-kubeconfig parameter to make
    sure it is used
    
    Change-Id: I951406f884a3faf1ace74970a6ffbf73f7de6627

Confirmed API server configuration is coming from /etc/kubernetes/kubeconfig now.

GTirloni closed this task as Resolved.Wed, Dec 12, 3:52 PM