Today we got a page by icinga:
PROBLEM - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.140 second response time
I checked at tools-k8s-master-01.eqiad.wmflabs and some worker nodes had issues:
aborrero@tools-k8s-master-01:~$ sudo kubectl get nodes -o wide NAME STATUS AGE tools-worker-1001.tools.eqiad.wmflabs Ready 2y tools-worker-1002.tools.eqiad.wmflabs Ready 2y tools-worker-1003.tools.eqiad.wmflabs Ready 2y tools-worker-1004.tools.eqiad.wmflabs Ready 2y tools-worker-1005.tools.eqiad.wmflabs Ready 2y tools-worker-1006.tools.eqiad.wmflabs NotReady 2y tools-worker-1007.tools.eqiad.wmflabs NotReady 2y tools-worker-1008.tools.eqiad.wmflabs Ready 2y tools-worker-1009.tools.eqiad.wmflabs Ready 2y tools-worker-1010.tools.eqiad.wmflabs Ready,SchedulingDisabled 1y tools-worker-1011.tools.eqiad.wmflabs Ready,SchedulingDisabled 1y tools-worker-1012.tools.eqiad.wmflabs Ready,SchedulingDisabled 1y tools-worker-1013.tools.eqiad.wmflabs Ready 1y tools-worker-1014.tools.eqiad.wmflabs Ready 1y tools-worker-1015.tools.eqiad.wmflabs Ready 1y tools-worker-1016.tools.eqiad.wmflabs Ready 1y tools-worker-1017.tools.eqiad.wmflabs Ready 1y tools-worker-1018.tools.eqiad.wmflabs Ready 1y tools-worker-1019.tools.eqiad.wmflabs Ready 1y tools-worker-1020.tools.eqiad.wmflabs Ready 1y tools-worker-1021.tools.eqiad.wmflabs NotReady 1y tools-worker-1022.tools.eqiad.wmflabs Ready 1y tools-worker-1023.tools.eqiad.wmflabs Ready 1y tools-worker-1025.tools.eqiad.wmflabs Ready 1y tools-worker-1026.tools.eqiad.wmflabs Ready 1y tools-worker-1027.tools.eqiad.wmflabs Ready 1y tools-worker-1028.tools.eqiad.wmflabs Ready,SchedulingDisabled 1y tools-worker-1029.tools.eqiad.wmflabs NotReady,SchedulingDisabled 1y
After several minutes, all was back to normal state:
aborrero@tools-k8s-master-01:~$ sudo kubectl get nodes -o wide NAME STATUS AGE tools-worker-1001.tools.eqiad.wmflabs Ready 2y tools-worker-1002.tools.eqiad.wmflabs Ready 2y tools-worker-1003.tools.eqiad.wmflabs Ready 2y tools-worker-1004.tools.eqiad.wmflabs Ready 2y tools-worker-1005.tools.eqiad.wmflabs Ready 2y tools-worker-1006.tools.eqiad.wmflabs Ready 2y tools-worker-1007.tools.eqiad.wmflabs Ready 2y tools-worker-1008.tools.eqiad.wmflabs Ready 2y tools-worker-1009.tools.eqiad.wmflabs Ready 2y tools-worker-1010.tools.eqiad.wmflabs Ready,SchedulingDisabled 1y tools-worker-1011.tools.eqiad.wmflabs Ready,SchedulingDisabled 1y tools-worker-1012.tools.eqiad.wmflabs Ready,SchedulingDisabled 1y tools-worker-1013.tools.eqiad.wmflabs Ready 1y tools-worker-1014.tools.eqiad.wmflabs Ready 1y tools-worker-1015.tools.eqiad.wmflabs Ready 1y tools-worker-1016.tools.eqiad.wmflabs Ready 1y tools-worker-1017.tools.eqiad.wmflabs Ready 1y tools-worker-1018.tools.eqiad.wmflabs Ready 1y tools-worker-1019.tools.eqiad.wmflabs Ready 1y tools-worker-1020.tools.eqiad.wmflabs Ready 1y tools-worker-1021.tools.eqiad.wmflabs Ready 1y tools-worker-1022.tools.eqiad.wmflabs Ready 1y tools-worker-1023.tools.eqiad.wmflabs Ready 1y tools-worker-1025.tools.eqiad.wmflabs Ready 1y tools-worker-1026.tools.eqiad.wmflabs Ready 1y tools-worker-1027.tools.eqiad.wmflabs Ready 1y tools-worker-1028.tools.eqiad.wmflabs Ready,SchedulingDisabled 1y tools-worker-1029.tools.eqiad.wmflabs NotReady,SchedulingDisabled 1y
Not sure what's the matter with these disabled nodes, and I can't jump to tools-worker-1029.tools.eqiad.wmflabs.
When I was testing things, I also restarted the checker service:
aborrero@tools-checker-01:~$ sudo service toolschecker_kubernetes_nodes_ready restart toolschecker_kubernetes_nodes_ready stop/waiting toolschecker_kubernetes_nodes_ready start/running, process 23148
But something happened somewhere (proxy?) that now I can't access the checker:
arturo@endurance:~$ LANG=C wget https://checker.tools.wmflabs.org/k8s/nodes/ready --2018-06-12 11:10:49-- https://checker.tools.wmflabs.org/k8s/nodes/ready Resolving checker.tools.wmflabs.org (checker.tools.wmflabs.org)... 208.80.155.229 Connecting to checker.tools.wmflabs.org (checker.tools.wmflabs.org)|208.80.155.229|:443... failed: Connection refused. aborrero@tools-clushmaster-01:~$ wget https://checker.tools.wmflabs.org/k8s/nodes/ready --2018-06-12 09:12:19-- https://checker.tools.wmflabs.org/k8s/nodes/ready Resolving checker.tools.wmflabs.org (checker.tools.wmflabs.org)... 10.68.16.228 Connecting to checker.tools.wmflabs.org (checker.tools.wmflabs.org)|10.68.16.228|:443... failed: Connection refused.
Not sure how is possible icinga is seeing this as OK.