Page MenuHomePhabricator

Toolforge: k8s nodes aren't healthy
Closed, ResolvedPublic

Description

We got paged:

16:15 <+icinga-wm> PROBLEM - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 177 bytes in 0.164 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker

Event Timeline

Mentioned in SAL (#wikimedia-cloud) [2019-05-07T14:31:52Z] <arturo> T222718 reboot tools-worker-1009 and 1022 after being drained

aborrero triaged this task as High priority.May 7 2019, 2:32 PM
aborrero moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.

This is weird, there seems to be a couple of duplicate nodes, both 1009 and 1022:

root@tools-k8s-master-01:~# kubectl get nodes
NAME                                    STATUS                     AGE
tools-worker-1001.tools.eqiad.wmflabs   Ready                      2y
tools-worker-1002.tools.eqiad.wmflabs   Ready                      3y
tools-worker-1003.tools.eqiad.wmflabs   Ready                      3y
tools-worker-1004.tools.eqiad.wmflabs   Ready                      3y
tools-worker-1005.tools.eqiad.wmflabs   Ready                      3y
tools-worker-1006.tools.eqiad.wmflabs   Ready                      3y
tools-worker-1007.tools.eqiad.wmflabs   Ready                      3y
tools-worker-1008.tools.eqiad.wmflabs   Ready                      3y
tools-worker-1009.eqiad.wmflabs         NotReady                   2d
tools-worker-1009.tools.eqiad.wmflabs   Ready                      3y
tools-worker-1010.tools.eqiad.wmflabs   Ready                      2y
tools-worker-1011.tools.eqiad.wmflabs   Ready                      2y
tools-worker-1012.tools.eqiad.wmflabs   Ready                      2y
tools-worker-1013.tools.eqiad.wmflabs   Ready                      2y
tools-worker-1014.tools.eqiad.wmflabs   Ready                      2y
tools-worker-1015.tools.eqiad.wmflabs   Ready                      2y
tools-worker-1016.tools.eqiad.wmflabs   Ready                      2y
tools-worker-1017.tools.eqiad.wmflabs   Ready                      2y
tools-worker-1018.tools.eqiad.wmflabs   Ready                      2y
tools-worker-1019.tools.eqiad.wmflabs   Ready,SchedulingDisabled   2y
tools-worker-1020.tools.eqiad.wmflabs   Ready                      2y
tools-worker-1021.tools.eqiad.wmflabs   Ready                      2y
tools-worker-1022.eqiad.wmflabs         NotReady                   2d
tools-worker-1022.tools.eqiad.wmflabs   Ready                      2y
tools-worker-1023.tools.eqiad.wmflabs   Ready                      2y
tools-worker-1025.tools.eqiad.wmflabs   Ready                      2y
tools-worker-1026.tools.eqiad.wmflabs   Ready                      2y
tools-worker-1027.tools.eqiad.wmflabs   Ready                      2y
tools-worker-1028.tools.eqiad.wmflabs   Ready                      2y

I manually deleted the duplicated nodes:

root@tools-k8s-master-01:~# kubectl delete node tools-worker-1022.eqiad.wmflabs
node "tools-worker-1022.eqiad.wmflabs" deleted

root@tools-k8s-master-01:~# kubectl delete node tools-worker-1009.eqiad.wmflabs
node "tools-worker-1009.eqiad.wmflabs" deleted

root@tools-k8s-master-01:~# kubectl get nodes
NAME                                    STATUS                     AGE
tools-worker-1001.tools.eqiad.wmflabs   Ready                      2y
tools-worker-1002.tools.eqiad.wmflabs   Ready                      3y
tools-worker-1003.tools.eqiad.wmflabs   Ready                      3y
tools-worker-1004.tools.eqiad.wmflabs   Ready                      3y
tools-worker-1005.tools.eqiad.wmflabs   Ready                      3y
tools-worker-1006.tools.eqiad.wmflabs   Ready                      3y
tools-worker-1007.tools.eqiad.wmflabs   Ready                      3y
tools-worker-1008.tools.eqiad.wmflabs   Ready                      3y
tools-worker-1009.tools.eqiad.wmflabs   Ready                      3y
tools-worker-1010.tools.eqiad.wmflabs   Ready                      2y
tools-worker-1011.tools.eqiad.wmflabs   Ready                      2y
tools-worker-1012.tools.eqiad.wmflabs   Ready                      2y
tools-worker-1013.tools.eqiad.wmflabs   Ready                      2y
tools-worker-1014.tools.eqiad.wmflabs   Ready                      2y
tools-worker-1015.tools.eqiad.wmflabs   Ready                      2y
tools-worker-1016.tools.eqiad.wmflabs   Ready                      2y
tools-worker-1017.tools.eqiad.wmflabs   Ready                      2y
tools-worker-1018.tools.eqiad.wmflabs   Ready                      2y
tools-worker-1019.tools.eqiad.wmflabs   Ready,SchedulingDisabled   2y
tools-worker-1020.tools.eqiad.wmflabs   Ready                      2y
tools-worker-1021.tools.eqiad.wmflabs   Ready                      2y
tools-worker-1022.tools.eqiad.wmflabs   Ready                      2y
tools-worker-1023.tools.eqiad.wmflabs   Ready                      2y
tools-worker-1025.tools.eqiad.wmflabs   Ready                      2y
tools-worker-1026.tools.eqiad.wmflabs   Ready                      2y
tools-worker-1027.tools.eqiad.wmflabs   Ready                      2y
tools-worker-1028.tools.eqiad.wmflabs   Ready                      2y

Mentioned in SAL (#wikimedia-cloud) [2019-05-07T14:38:08Z] <arturo> T222718 uncordon tools-worker-1019, I couldn't find a reason for it to be cordoned

aborrero lowered the priority of this task from High to Medium.

Seems stable now. It's a mystery how those dup nodes ended up there.
They were there before my reboot.