Page MenuHomePhabricator

hw troubleshooting: Management and main interfaces down for kubernetes1051.eqiad.wmnet
Closed, ResolvedPublic

Description

  • - Provide FQDN of system.
  • - If other than a hard drive issue, please depool the machine (and confirm that it’s been depooled) for us to work on it. If not, please provide time frame for us to take the machine down.
  • - Put system into a failed state in Netbox.
  • - Provide urgency of request, along with justification (redundancy, dependencies, etc)
  • - Describe issue and/or attach hardware failure log. (Refer to https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook if you need help)
  • - Assign correct project tag and appropriate owner (based on above). Also, please ensure the service owners of the host(s) are added as subscribers to provide any additional input.

FQDN: kubernetes1051.eqiad.wmnet
Urgency: Low, server is part of the kubernetes cluster
Failure summary: Management and main network interfaces are down and not coming back up after reboot.

Details

During today's morning backport window kubernetes1051 consistently failed to pull images with the docker command just being stuck. I've drained the node and set it to pooled=inactive for now.

There has been a constant rate of package drops (~900mp/s) since 6:30Z which is down to normal levels again around 8:30 (e.g. since depool)

from SAL:

08:07 <jayme> draining kubernetes1051.eqiad.wmnet
08:30 <jayme@cumin1002> conftool action : set/pooled=inactive; selector: name=kubernetes1051.eqiad.wmnet

Something odd I saw while briefly checking the syslog is the USB hub constantly being (re-)detected since this morning

root@kubernetes1051:~# grep 'usb 1-14: New USB device found' /var/log/syslog | grep ^2024-07-02 |head
2024-07-02T06:37:07.457280+00:00 kubernetes1051 kernel: usb 1-14: New USB device found, idVendor=1604, idProduct=10c0, bcdDevice= 0.0
...
root@kubernetes1051:~# grep 'usb 1-14: New USB device found' /var/log/syslog | grep ^2024-07-02 -c
350

Details

Other Assignee
Jclark-ctr

Event Timeline

Host rebooted by jayme@cumin1002 with reason: None

Hey DC-Ops ops-eqiad,
Could you please check on this node? It does not come up after reboot and mgmt (ssh and webinterface) is not reachable to me.
The server is depooled, so feel free to powercycle etc. at your convenience.

Host is flapping, setting downtime until tomorrow

Icinga downtime and Alertmanager silence (ID=1d5196ee-59a9-4e12-b2fc-c8c25de6ab16) set by cgoubert@cumin1002 for 20:00:00 on 1 host(s) and their services with reason: Hardware issue

kubernetes1051.eqiad.wmnet

I've deleted the node from the k8s API as a required istio update would not finish successfully because it was waiting for the deamonset to be scheduled on the broken node. The node should auto-join the cluster (cordoned) when it comes back online.

Icinga downtime and Alertmanager silence (ID=53cf057a-4641-401a-ab84-392d5d8f2444) set by cgoubert@cumin1002 for 4 days, 0:00:00 on 1 host(s) and their services with reason: Hardware issue

kubernetes1051.eqiad.wmnet

I've deleted the node from the k8s API as a required istio update would not finish successfully because it was waiting for the deamonset to be scheduled on the broken node. The node should auto-join the cluster (cordoned) when it comes back online.

Hi. I notice BGP for this host is still down on the switch side? If it's likely to continue let me know and I will set the Netbox flag for BGP to off and remove the config from the switch with Homer. Thanks.

I've deleted the node from the k8s API as a required istio update would not finish successfully because it was waiting for the deamonset to be scheduled on the broken node. The node should auto-join the cluster (cordoned) when it comes back online.

Hi. I notice BGP for this host is still down on the switch side? If it's likely to continue let me know and I will set the Netbox flag for BGP to off and remove the config from the switch with Homer. Thanks.

I've disabled BGP for this node for now.

I've disabled BGP for this node for now.

A cool. It's generally not a problem for a short time but it can mask the issue if another node fails for some reason. Thanks.

Icinga downtime and Alertmanager silence (ID=3f55d01c-31c3-4e2a-8c13-b4c6da9484f8) set by cgoubert@cumin1002 for 7 days, 0:00:00 on 1 host(s) and their services with reason: Hardware issue

kubernetes1051.eqiad.wmnet
Clement_Goubert renamed this task from kubernetes1051.eqiad.wmnet failed to pull mediawiki images to hw troubleshooting: Management and main interfaces down for kubernetes1051.eqiad.wmnet.Jul 10 2024, 2:04 PM
Clement_Goubert assigned this task to Jclark-ctr.
Clement_Goubert triaged this task as Medium priority.
Clement_Goubert updated the task description. (Show Details)
VRiley-WMF updated Other Assignee, added: Jclark-ctr.
VRiley-WMF added a subscriber: Jclark-ctr.

I have shut down the server and completed a flea power drain. Booted this server back up and I checked to see if the ports were active and it looks good. Please let us know if there is anything else that needs to be done with this server.

Mentioned in SAL (#wikimedia-operations) [2024-07-25T10:00:18Z] <cgoubert@cumin1002> conftool action : set/pooled=yes; selector: name=kubernetes1051.eqiad.wmnet,cluster=kubernetes,service=kubesvc [reason: Uncordoning kubernetes1051 - T369011]

Host BGP re-enabled, back in Active status and uncordoned, all looks good. Thanks @VRiley-WMF

Host BGP re-enabled, back in Active status and uncordoned, all looks good. Thanks @VRiley-WMF

Small note about the confusing names of the BGP state machine. "Active" means "actively trying to establish the session". "Established" means its managed to connect, and usually things progress from though all the states to "Esatablished" in a second or two unless there is a problem. So if you see a state of "active" when checking it usually means there is a problem!

In this case all is good:

cmooney@lsw1-e3-eqiad> show bgp summary | match "10.64.132.28|2620:0:861:10b:10:64:132:28" 
10.64.132.28          64601        917        883       0       1     2:17:09 Establ
2620:0:861:10b:10:64:132:28       64601        920        883       0       1     2:17:09 Establ

image.png (430×475 px, 85 KB)

Host BGP re-enabled, back in Active status and uncordoned, all looks good. Thanks @VRiley-WMF

Small note about the confusing names of the BGP state machine. "Active" means "actively trying to establish the session". "Established" means its managed to connect, and usually things progress from though all the states to "Esatablished" in a second or two unless there is a problem. So if you see a state of "active" when checking it usually means there is a problem!

In this case all is good:

cmooney@lsw1-e3-eqiad> show bgp summary | match "10.64.132.28|2620:0:861:10b:10:64:132:28" 
10.64.132.28          64601        917        883       0       1     2:17:09 Establ
2620:0:861:10b:10:64:132:28       64601        920        883       0       1     2:17:09 Establ

image.png (430×475 px, 85 KB)

I see where the confusion is, I meant Active in netbox, BGP is Established

Mentioned in SAL (#wikimedia-operations) [2024-07-25T13:30:30Z] <cgoubert@cumin1002> conftool action : set/pooled=inactive; selector: name=kubernetes1051.eqiad.wmnet,cluster=kubernetes,service=kubesvc [reason: Cordoning kubernetes1051 for missed upgrades - T369011]

Host rebooted by cgoubert@cumin1002 with reason: Missed kernel upgrade

Mentioned in SAL (#wikimedia-operations) [2024-07-25T13:41:22Z] <cgoubert@cumin1002> conftool action : set/pooled=yes; selector: name=kubernetes1051.eqiad.wmnet,cluster=kubernetes,service=kubesvc [reason: Uncordoning kubernetes1051 for missed upgrades - T369011]