Page MenuHomePhabricator

mw1360's NIC is faulty
Closed, ResolvedPublic

Description

mw1360 was reported down from icinga, and after a check on the serial console (that wasn't available, I had to run a ipmi mc cold reset from cumin1001) it seems to me that the NIC on the host is not working as expected:

root@mw1360:~# ifconfig
lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1  (Local Loopback)
        RX packets 2  bytes 100 (100.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 2  bytes 100 (100.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo:LVS: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 10.2.2.22  netmask 255.255.255.255
        loop  txqueuelen 1  (Local Loopback)

root@mw1360:~# cat /etc/network/interfaces
# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5).

source /etc/network/interfaces.d/*

# The loopback network interface
auto lo
iface lo inet loopback

# The primary network interface
allow-hotplug eno1
iface eno1 inet static
        address 10.64.48.202/22
        gateway 10.64.48.1
        # dns-* options are implemented by the resolvconf package, if installed
        dns-nameservers 10.3.0.1
        dns-search eqiad.wmnet
   pre-up /sbin/ip token set ::10:64:48:202 dev eno1
   up ip addr add 2620:0:861:107:10:64:48:202/64 dev eno1

root@mw1360:~# ifup eno1
Error: argument "eno1" is wrong: dev is invalid

ifup: failed to bring up eno1

Troubleshooting Summary

  • updated bios and idrac to newest firmware revisions
  • nic is enabled in bios integrated peripherals but doesn't show up in PCI devices of the support report (where it normally would)
  • nic has error message in idrac inventory: RAC1021: NIC objects are not available in the current system configuration. Make sure the NIC devices are correctly installed in the system and retry the operation after the Collect System Inventory On Restart (CSIOR) feature has updated the system inventory. If the issue persists, contact your service provider.
  • support collection report generated after all the above troubleshooting:
  • suggested we drain power fully on-site and return it, see if resets mainboard issue before opening self dispatch to send new mainboard.

Event Timeline

wiki_willy added a subscriber: wiki_willy.

This one looks like it's under warranty, just installed a year ago

The device is still active in Netbox, shouldn't be marked as failed?

The device is still active in Netbox, shouldn't be marked as failed?

Yep, its not online so I'm putting it failed so the reports clear up in netbox. Since it has a task, swapping its state seems ok to me.

RobH added a subscriber: Jclark-ctr.

Mentioned in SAL (#wikimedia-operations) [2020-09-22T16:00:04Z] <robh> running dell epsa test on down host mw1360 per T262151

Ok, running the troubleshooting steps as follows:

Copy over the SEL, erase it so it won't throw errors in testing. At this time, there are no errors for the NIC on the SEL, but full testing may turn up issues:

/admin1-> racadm getsel
Record:      1
Date/Time:   09/09/2019 21:46:06
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
Record:      2
Date/Time:   09/17/2020 12:30:39
Source:      system
Severity:    Critical
Description: The power input for power supply 2 is lost.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   09/17/2020 12:30:46
Source:      system
Severity:    Critical
Description: Power supply redundancy is lost.
-------------------------------------------------------------------------------
Record:      4
Date/Time:   09/17/2020 12:44:45
Source:      system
Severity:    Ok
Description: The input power for power supply 2 has been restored.
-------------------------------------------------------------------------------
Record:      5
Date/Time:   09/17/2020 12:44:46
Source:      system
Severity:    Ok
Description: The power supplies are redundant.
-------------------------------------------------------------------------------
Record:      6
Date/Time:   09/17/2020 12:53:42
Source:      system
Severity:    Critical
Description: The power input for power supply 1 is lost.
-------------------------------------------------------------------------------
Record:      7
Date/Time:   09/17/2020 12:53:46
Source:      system
Severity:    Critical
Description: Power supply redundancy is lost.
-------------------------------------------------------------------------------
Record:      8
Date/Time:   09/17/2020 13:00:11
Source:      system
Severity:    Ok
Description: The power supplies are redundant.
-------------------------------------------------------------------------------
Record:      9
Date/Time:   09/17/2020 13:00:12
Source:      system
Severity:    Ok
Description: The input power for power supply 1 has been restored.
-------------------------------------------------------------------------------
  • now booting it into the dell testing suite, this can take an hour or two for the full run.

quick tests complete no errors, full testing continuing, eta on screen 3 hours.

Mentioned in SAL (#wikimedia-operations) [2020-09-23T15:57:58Z] <robh> updating firmware on mw1360, troubleshooting nic failure issue T262151

All tests passed with no issues. I've updated the firmware to the newest version of bios, which is the mainboard firmware. The system no longer sees the NIC. I suppose we should try to reimage and see if it re-dectects and if not, then RMA the mainboard.

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

mw1360.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202009231613_robh_1862_mw1360_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['mw1360.eqiad.wmnet']

Of which those FAILED:

['mw1360.eqiad.wmnet']

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

mw1360.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202009231616_robh_4519_mw1360_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['mw1360.eqiad.wmnet']

Of which those FAILED:

['mw1360.eqiad.wmnet']

So both myself and Papaul have looked into this, checking multiple items:

  • updated bios and idrac to newest firmware revisions
  • nic is enabled in bios
  • nic has error message in idrac inventory: RAC1021: NIC objects are not available in the current system configuration. Make sure the NIC devices are correctly installed in the system and retry the operation after the Collect System Inventory On Restart (CSIOR) feature has updated the system inventory. If the issue persists, contact your service provider.
    • nic is permanently installed onto the mainboard, so this isn't really something we can fix.
    • nic won't show in boot order options, wont accept PXE boot option since NIC isn't listed at all.

Papaul suggests that we drain power (unplug since we don't have switched PDUs in normal racks) and let sit for a couple minutes and then plug it all back in. Worth a shot, since it has cleared up other Dell issues in the past.

The support collection report is attached for use filing a support request if the power drain doesn't fix things:

Chris:

Please remove power entirely from this device for a few minutes before returning it (unplug and plug back in both power cables), and then assign back to me to check for the NIC. If it doesn't show back up, I can open a support dispatch on your behalf and have a new mainboard dispatched to DC6.

RobH triaged this task as Medium priority.Sep 23 2020, 5:25 PM
RobH updated the task description. (Show Details)
Cmjohnson added subscribers: RobH, Cmjohnson.

@RobH power has been pulled and flea power drained

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

mw1360.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202009241438_robh_29643_mw1360_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['mw1360.eqiad.wmnet']

Of which those FAILED:

['mw1360.eqiad.wmnet']

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

mw1360.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202009241439_robh_30270_mw1360_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['mw1360.eqiad.wmnet']

Of which those FAILED:

['mw1360.eqiad.wmnet']

Mentioned in SAL (#wikimedia-operations) [2020-09-24T15:15:30Z] <robh> mw1360 scap and repooled post work via T262151

Host is now online (reimaged) and returned to service post scap pull and repool. Set to active in netbox. However, its not in the DSH node groups, and the directions aren't clear on where the file exists to edit.

https://wikitech.wikimedia.org/wiki/Application_servers#Apache_setup_checklist

After checkign with @Joe via irc, it seems this should automatically be added back into DSH and clear after the puppet run and repooling, but has not.

All other checks green, but I'd like to know what I've missed so I can return mw hosts to service in future without assistance.

It looks it's marked as inactive on conftool:

$ confctl select 'name=mw1360.eqiad.wmnet' get
{"mw1360.eqiad.wmnet": {"weight": 30, "pooled": "inactive"}, "tags": "dc=eqiad,cluster=api_appserver,service=apache2"}
{"mw1360.eqiad.wmnet": {"weight": 30, "pooled": "inactive"}, "tags": "dc=eqiad,cluster=api_appserver,service=nginx"}

Mentioned in SAL (#wikimedia-operations) [2020-09-24T16:26:22Z] <robh> properly pooled mw1360 this time T262151

Ok, all is now green for the host in icinga and it shows in pooled/in service state.