- - Provide FQDN of system.
- - If other than a hard drive issue, please depool the machine (and confirm that it’s been depooled) for us to work on it. If not, please provide time frame for us to take the machine down.
- - Put system into a failed state in Netbox.
- - Provide urgency of request, along with justification (redundancy, dependencies, etc)
- - Describe issue and/or attach hardware failure log. (Refer to https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook if you need help)
- - Assign correct project tag and appropriate owner (based on above). Also, please ensure the service owners of the host(s) are added as subscribers to provide any additional input.
FQDN: kubernetes1051.eqiad.wmnet
Urgency: Low, server is part of the kubernetes cluster
Failure summary: Management and main network interfaces are down and not coming back up after reboot.
Details
During today's morning backport window kubernetes1051 consistently failed to pull images with the docker command just being stuck. I've drained the node and set it to pooled=inactive for now.
There has been a constant rate of package drops (~900mp/s) since 6:30Z which is down to normal levels again around 8:30 (e.g. since depool)
from SAL:
08:07 <jayme> draining kubernetes1051.eqiad.wmnet 08:30 <jayme@cumin1002> conftool action : set/pooled=inactive; selector: name=kubernetes1051.eqiad.wmnet
Something odd I saw while briefly checking the syslog is the USB hub constantly being (re-)detected since this morning
root@kubernetes1051:~# grep 'usb 1-14: New USB device found' /var/log/syslog | grep ^2024-07-02 |head 2024-07-02T06:37:07.457280+00:00 kubernetes1051 kernel: usb 1-14: New USB device found, idVendor=1604, idProduct=10c0, bcdDevice= 0.0 ... root@kubernetes1051:~# grep 'usb 1-14: New USB device found' /var/log/syslog | grep ^2024-07-02 -c 350
