Page MenuHomePhabricator

cloudvirt1040 primary NIC disconnected
Closed, ResolvedPublic

Description

I just now put cloudvirt1040 into service. It worked for a half hour or so but is now unreachable.

The mgmt console works fine and I can log in with the root password; ip addr shows the primary nic (eno2np1) as UP but the host cannot be reached via ssh and also cannot talk to the puppet server.

This host is slightly different from cloudvirt1041-1046 despite being part of the same order -- it was the 'seed server' and John installed the nic after arrival.

Related Objects

StatusSubtypeAssignedTask
ResolvedJclark-ctr

Event Timeline

this recovered after a reboot. We'll see if it holds...

Mentioned in SAL (#wikimedia-cloud) [2021-04-28T19:40:36Z] <andrewbogott> putting cloudvirt1040 into the maintenance aggregate pending more info about T281399

Icinga downtime set by dcaro@cumin1001 for 2:00:00 1 host(s) and their services with reason: primary nic disconnected

cloudvirt1040.eqiad.wmnet

nic shows link. card was installed previously unsure if needs to be updated not sure why it says intel?
packing slip for nic is Broadcom 57412 2 Port 10Gb SFP+ + 5720 2 Port 1Gb Base-T, rNDC, Customer Install

this server isn't in use currently, you're welcome to reboot it or shut it down as part of troubleshooting.

Reset default configurations is showing up at this time

It's down again. Whenever we prod it it seems to recover briefly and then fall off the network after an hour or two.

opened dell support ticket Service Request Detail: 1060698910 even though it shows connected now will follow up with dell

@Andrew updated firmware shows connection on icinga will monitor

It's still up! I will re-enable monitoring and if it doesn't flap then we can declare it to be cured.

hm... mgmt shows 'DNS CRITICAL - expected '0.0.0.0' but got '10.65.0.227' -- I've never seen that before.

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

['cloudvirt1040.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202106041443_andrew_14472.log.

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

['cloudvirt1040.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202106041532_andrew_20422.log.

Completed auto-reimage of hosts:

['cloudvirt1040.eqiad.wmnet']

Of which those FAILED:

['cloudvirt1040.eqiad.wmnet']

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

['cloudvirt1040.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202106041638_andrew_30711.log.

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

['cloudvirt1040.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202106041643_andrew_31194.log.

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

['cloudvirt1040.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202106041710_andrew_7137.log.

Completed auto-reimage of hosts:

['cloudvirt1040.eqiad.wmnet']

and were ALL successful.

Mentioned in SAL (#wikimedia-cloud) [2021-06-07T14:26:59Z] <andrewbogott> moving cloudvirt1040 from 'maintenance' aggregate to 'ceph' aggregate T281399

This host is now back in normal service. Thank you @Jclark-ctr !