Page MenuHomePhabricator

elastic1017 lost network after reboot
Closed, ResolvedPublic

Description

elastic1017.eqiad.wmnet has lost network after a planned reboot.

Connecting through management interface, there seems to be no link on eno1 (see below). Could it be a bad cable? @Cmjohnson, could you have a look?

root@elastic1017:~# ethtool eno1
Settings for eno1:
	Supported ports: [ TP ]
	Supported link modes:   10baseT/Half 10baseT/Full 
	                        100baseT/Half 100baseT/Full 
	                        1000baseT/Half 1000baseT/Full 
	Supported pause frame use: No
	Supports auto-negotiation: Yes
	Advertised link modes:  10baseT/Half 10baseT/Full 
	                        100baseT/Half 100baseT/Full 
	                        1000baseT/Half 1000baseT/Full 
	Advertised pause frame use: Symmetric
	Advertised auto-negotiation: Yes
	Speed: Unknown!
	Duplex: Unknown! (255)
	Port: Twisted Pair
	PHYAD: 1
	Transceiver: internal
	Auto-negotiation: on
	MDI-X: Unknown
	Supports Wake-on: g
	Wake-on: d
	Current message level: 0x000000ff (255)
			       drv probe link timer ifdown ifup rx_err tx_err
	Link detected: no

Related Objects

Event Timeline

Gehel created this task.Aug 14 2019, 10:19 PM
  • I checked the network switch and the port shows up/up meaning that link from the server to the network switch is up

ge-3/0/17 up up elastic1017

  • I pulled the power and did a hard reset and that did not resolve the issue
  • I replaced the cable and that did not resolve the issue
  • I checked the dns entries and they are correct

At this point, I cannot confirm it's a hardware issue

I will add that this server is out of warranty and would require a motherboard replacement if it is the nic. We typically do not do this after the warranty period and the host should be decommissioned.

Gehel closed this task as Resolved.Aug 15 2019, 5:46 PM

@Cmjohnson don't spend more time on it, it is scheduled for replacement and the replacement should arrive August 21. We can live without this server for 2 weeks.

This server is showing up as a stale host in debmonitor and fails in Cumin runs, if the server is dead and won't be fixed, can we start the decom process for it?

Change 538606 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] elasticsearch: decommission elastic1017

https://gerrit.wikimedia.org/r/538606

Change 538606 merged by Gehel:
[operations/puppet@production] elasticsearch: decommission elastic1017

https://gerrit.wikimedia.org/r/538606

Gehel added a subscriber: RobH.Mon, Sep 23, 1:07 PM

Steps for decommission of elastic1017:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.

Remaining steps will be done at the same time as elastic1017-1031