Page MenuHomePhabricator

[infra,k8s,toolsbeta] k8s worker node toolsbeta-test-k8s-worker-nfs-5 is failing to tail pods
Closed, ResolvedPublic

Description

For whichever reason the worker is failing to serve the logs to the bastion, giving a timeout instead, tried rebooting it and such but it did not help.

It blocks the deployment as it makes the tests fail.

toolsbeta.test@toolsbeta-bastion-7:~$ toolforge build start https://gitlab.wikimedia.org/toolforge-repos/wm-lol
Waiting for the logs... if the build just started this might take a minute
BuildClientError: Error getting the logs for test-buildpacks-pipelinerun-85l2k: Get "https://172.16.5.174:10250/containerLogs/image-build/test-buildpacks-pipelinerun-85l2k-build-from-git-pod/prepare?follow=true&timestamps=true": dial tcp 172.16.5.174:10250: i/o timeout
Please report this issue to the Toolforge admins if it persists: https://w.wiki/6Zuu
dcaro@toolsbeta-bastion-7:~$ kubectl-sudo logs -n alloy alloy-2js27
Error from server: Get "https://172.16.5.174:10250/containerLogs/alloy/alloy-2js27/alloy": dial tcp 172.16.5.174:10250: i/o timeout

Will remove and recreate.

Event Timeline

dcaro changed the task status from Open to In Progress.Sep 16 2025, 1:40 PM
dcaro triaged this task as High priority.
dcaro moved this task from Next Up to In Progress on the Toolforge (Toolforge iteration 24) board.

Mentioned in SAL (#wikimedia-cloud-feed) [2025-09-16T13:41:42Z] <wmbot~dcaro@acme> START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the toolsbeta cluster (T404721)

Deleted with:

dcaro@acme$ wmcs-cookbooks wmcs.toolforge.k8s.worker.depool_and_remove_node --hostname-to-remove toolsbeta-test-k8s-worker-nfs-5 --role worker_nfs --cluster-name toolsbeta --force

It failed adding the new node with prefilght checks:

----- OUTPUT of 'sudo -i kubeadm ...16f541ca6dd18704' -----                                                                                                                                                                                                                                                                                            
[preflight] Running pre-flight checks                                                                                                                                                                                                                                                                                                                  
        [WARNING Service-Kubelet]: kubelet service is not enabled, please run 'systemctl enable kubelet.service'                                                                                                                                                                                                                                       
error execution phase preflight: [preflight] Some fatal errors occurred:                                                                                                                                                                                                                                                                               
        [ERROR FileExisting-conntrack]: conntrack not found in system path                                                                                                                                                                                                                                                                             
        [ERROR KubeletVersion]: couldn't get kubelet version: cannot execute 'kubelet --version': executable file not found in $PATH                                                                                                                                                                                                                   
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`                                                                                                                                                                                                                                        
To see the stack trace of this error execute with --v=5 or higher

Mentioned in SAL (#wikimedia-cloud-feed) [2025-09-16T14:09:58Z] <wmbot~dcaro@acme> START - Cookbook wmcs.vps.remove_instance for instance toolsbeta-test-k8s-worker-nfs-11 (T404721)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-09-16T14:11:11Z] <wmbot~dcaro@acme> END (PASS) - Cookbook wmcs.vps.remove_instance (exit_code=0) for instance toolsbeta-test-k8s-worker-nfs-11 (T404721)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-09-16T14:11:33Z] <wmbot~dcaro@acme> START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the toolsbeta cluster (T404721)

dcaro moved this task from In Progress to Done on the Toolforge (Toolforge iteration 24) board.

Mentioned in SAL (#wikimedia-cloud-feed) [2025-09-16T15:23:58Z] <wmbot~dcaro@acme> START - Cookbook wmcs.toolforge.add_k8s_node for a ingress role in the toolsbeta cluster (T404721)

dcaro reopened this task as In Progress.Sep 16 2025, 3:42 PM
dcaro moved this task from Done to In Progress on the Toolforge (Toolforge iteration 24) board.

Mentioned in SAL (#wikimedia-cloud-feed) [2025-09-16T15:42:08Z] <wmbot~dcaro@acme> START - Cookbook wmcs.toolforge.k8s.worker.depool_and_remove_node for host toolsbeta-test-k8s-ingress-10 (T404721)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-09-16T15:43:29Z] <wmbot~dcaro@acme> END (PASS) - Cookbook wmcs.toolforge.k8s.worker.depool_and_remove_node (exit_code=0) for host toolsbeta-test-k8s-ingress-10 (T404721)

dcaro moved this task from In Progress to Done on the Toolforge (Toolforge iteration 24) board.

ended up also scrubbing toolsbeta-test-k8s-ingress-10