Page MenuHomePhabricator

ms-fe2007 NIC failure
Closed, ResolvedPublic

Description

Host is off the network, I've depooled it and got this from dmesg on console:

[16926444.160930] bnx2x: [bnx2x_attn_int_deasserted0:4160(enp5s0f0)]SPIO5 hw attention
[16926444.169490] bnx2x 0000:05:00.0 enp5s0f0: Fan Failure on Network Controller has caused the driver to shutdown the card to prevent permanent damage.
                  Please contact OEM Support for assistance

@Papaul I haven't rebooted the host or otherwise touched it, in case that's helpful with diagnosis

Details

Related Gerrit Patches:
operations/puppet : productionDHCP: Change MAC address for ms-fe2007

Related Objects

StatusSubtypeAssignedTask
ResolvedPapaul

Event Timeline

Restricted Application added a project: Operations. · View Herald TranscriptDec 4 2019, 12:32 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
fgiunchedi renamed this task from ms-fe2007 nic failure to ms-fe2007 NIC failure.Dec 4 2019, 3:11 PM
Papaul added a comment.Dec 4 2019, 5:29 PM

@fgiunchedi the 10G NiC is dead

1- option replace the server with another server
https://netbox.wikimedia.org/dcim/devices/1099/
2- option Buy another 10G NIC

@fgiunchedi the 10G NiC is dead
1- option replace the server with another server
https://netbox.wikimedia.org/dcim/devices/1099/
2- option Buy another 10G NIC

Followup from IRC, we can live with three ms-fe in codfw while the 10G replacement is ordered, so let's go with option 2.

RobH added a subtask: Unknown Object (Task).Dec 4 2019, 5:47 PM
colewhite triaged this task as Medium priority.Dec 5 2019, 5:59 PM
RobH added a subscriber: RobH.Dec 11 2019, 10:25 PM

As this is down and not calling into puppet, its been set to 'failed' in netbox. Please place back to active when its working again.

Papaul closed subtask Unknown Object (Task) as Resolved.Dec 20 2019, 7:25 PM

Change 560409 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/puppet@production] DHCP: Change MAC address for ms-fe2007

https://gerrit.wikimedia.org/r/560409

@fgiunchedi NIC replaced
new MAC address
F4:E9:D4:95:61:40

Change 560409 merged by Filippo Giunchedi:
[operations/puppet@production] DHCP: Change MAC address for ms-fe2007

https://gerrit.wikimedia.org/r/560409

Script wmf-auto-reimage was launched by filippo on cumin1001.eqiad.wmnet for hosts:

ms-fe2007.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202001020923_filippo_73650_ms-fe2007_codfw_wmnet.log.

@Papaul thanks! I tried to reimage the host but the NIC shows its link as offline from bios:

Link Status                                           <Disconnected>         *

Completed auto-reimage of hosts:

['ms-fe2007.codfw.wmnet']

Of which those FAILED:

['ms-fe2007.codfw.wmnet']

@Papaul thanks! I tried to reimage the host but the NIC shows its link as offline from bios:

Link Status                                           <Disconnected>         *

Although the host does come back online in the network and the link is up, I'll try reimaging once more

Configured both ports to use PXE when booting, now the host is running the reimage correctly:

NIC in Slot 2 Port 1: QLogic 577xx/578xx 10 Gb Ethernet BCM57810 -            
F4:E9:D4:95:61:40                                                             
Main Configuration Page > NIC Configuration                                   
                                                                              
Main Configuration Page > NIC Configuration                                   
                                                                              
QLogic 577xx/578xx 10 Gb Ethernet BCM57810 - F4:E9:D4:95:61:40                
Legacy Boot Protocol                                  <PXE>
fgiunchedi closed this task as Resolved.Thu, Jan 2, 11:10 AM

Host is back in service!