mw1360 was reported down from icinga, and after a check on the serial console (that wasn't available, I had to run a ipmi mc cold reset from cumin1001) it seems to me that the NIC on the host is not working as expected:
root@mw1360:~# ifconfig lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 inet6 ::1 prefixlen 128 scopeid 0x10<host> loop txqueuelen 1 (Local Loopback) RX packets 2 bytes 100 (100.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 2 bytes 100 (100.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 lo:LVS: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 inet 10.2.2.22 netmask 255.255.255.255 loop txqueuelen 1 (Local Loopback) root@mw1360:~# cat /etc/network/interfaces # This file describes the network interfaces available on your system # and how to activate them. For more information, see interfaces(5). source /etc/network/interfaces.d/* # The loopback network interface auto lo iface lo inet loopback # The primary network interface allow-hotplug eno1 iface eno1 inet static address 10.64.48.202/22 gateway 10.64.48.1 # dns-* options are implemented by the resolvconf package, if installed dns-nameservers 10.3.0.1 dns-search eqiad.wmnet pre-up /sbin/ip token set ::10:64:48:202 dev eno1 up ip addr add 2620:0:861:107:10:64:48:202/64 dev eno1 root@mw1360:~# ifup eno1 Error: argument "eno1" is wrong: dev is invalid ifup: failed to bring up eno1
- updated bios and idrac to newest firmware revisions
- nic is enabled in bios integrated peripherals but doesn't show up in PCI devices of the support report (where it normally would)
- nic has error message in idrac inventory: RAC1021: NIC objects are not available in the current system configuration. Make sure the NIC devices are correctly installed in the system and retry the operation after the Collect System Inventory On Restart (CSIOR) feature has updated the system inventory. If the issue persists, contact your service provider.
- support collection report generated after all the above troubleshooting:
- suggested we drain power fully on-site and return it, see if resets mainboard issue before opening self dispatch to send new mainboard.