Page MenuHomePhabricator

Connect or troubleshoot eth1 on labvirt1019 and labvirt1020
Closed, ResolvedPublic

Description

labvirt1019 and labvirt1020 are up and running with the right storage setup now, but the interface eth1 is showing NO-CARRIER.

On labvirt1019:

3: eth1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000
    link/ether 30:e1:71:6d:5a:f1 brd ff:ff:ff:ff:ff:ff

On labvirt1020:

3: eth1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000
    link/ether 30:e1:71:6d:5a:e5 brd ff:ff:ff:ff:ff:ff

Main task T193264

Event Timeline

Bstorm created this task.May 18 2018, 3:59 PM
Restricted Application added a project: Operations. · View Herald TranscriptMay 18 2018, 3:59 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
StjnVMF renamed this task from Connect or troubleshoot eth1 on labvirt1019 and labvirt1020 to unban reguyla.May 18 2018, 5:24 PM
StjnVMF updated the task description. (Show Details)
JJMC89 renamed this task from unban reguyla to Connect or troubleshoot eth1 on labvirt1019 and labvirt1020.May 18 2018, 5:28 PM
JJMC89 updated the task description. (Show Details)

I tried to bring the eth1 interfaces up and no dice. My thought is they are not connected.

Bstorm assigned this task to Cmjohnson.May 18 2018, 6:04 PM

eth1 on both should be connected and configured to be in the cloud-instance-ports interface-range which makes them trunks that pass the instance network.

Change 434737 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/puppet@production] Updating labvirt1019 mac

https://gerrit.wikimedia.org/r/434737

Change 434737 merged by Cmjohnson:
[operations/puppet@production] Updating labvirt1019 mac

https://gerrit.wikimedia.org/r/434737

Change 434957 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/puppet@production] labvirt1019 changing dhcpd MAC

https://gerrit.wikimedia.org/r/434957

Change 434957 merged by Cmjohnson:
[operations/puppet@production] labvirt1019 changing dhcpd MAC

https://gerrit.wikimedia.org/r/434957

I have attempted to get the 10G NICS work but am having zero luck. I am able to enable them in the bios, set the PXE boot. However, if i leave the 1GB NICs enabled, then the PXE attempt is only with them. If I disable the 1GB network ports then the 10G card does not show up in the boot order. For now, i am going to move them to the new switch with 1GB ports.

Bstorm added a comment.EditedMay 24 2018, 5:20 PM

Huh. The 19 and 20 are both already imaged fully, if that matters. I don't think we want to be stuck unable to re-image, though.

Cmjohnson added a subscriber: ayounsi.EditedMay 29 2018, 2:44 PM

labvirt1019 is now connected to the new switch and the second ethernet port is connected. @ayounsi can you help getting the 2nd port trunked. ge-4/0/33 Thanks

@ayounsi that goes for labvirt1020 as well, I connected the 2nd port to asw2-b ge-7/0/14

ayounsi added a comment.EditedMay 29 2018, 2:51 PM

ge-4/0/33 labvirt1019:eth1 moved to the cloud-instance-ports interface-range.

Edit:
ge-7/0/14 labvirt1020:eth1 as well.

@Cmjohnson I currently see labvirt1020 with both ports live, but labvirt1019 shows NO-CARRIER no matter what I do from my end on eth1, still.

chasemp triaged this task as Normal priority.May 31 2018, 2:17 PM

@Bstorm I replaced the cable and the sfp-t just in case and labvirt1019 is in the installer now

Cmjohnson closed this task as Resolved.Jun 5 2018, 5:31 PM

Resolving this, please open again if you're still having issues.

Bstorm reopened this task as Open.Jun 5 2018, 6:22 PM

labvirt1019 still has a dead eth1.

root@labvirt1019:~# ip link show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 30:e1:71:6d:5a:f0 brd ff:ff:ff:ff:ff:ff
3: eth1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000
    link/ether 30:e1:71:6d:5a:f1 brd ff:ff:ff:ff:ff:ff

All actions taken in the OS fail to bring it up. Everything is already installed/imaged. It just behaves as though the one port is disconnected.

@Bstorm oh..sorry I think i just ran labvirt1019 back through the installer

Bstorm added a comment.Jun 8 2018, 4:07 PM

I will add that 1G ethernet is likely fine for these servers, being labvirts, unless I was very out of the loop on something. This server does need eth1 to be working, though.

Cmjohnson reassigned this task from Cmjohnson to faidon.Jun 13 2018, 4:54 PM
Cmjohnson added subscribers: faidon, Cmjohnson.

@faidon I disconnected the 1G cables from the switch and plugged the cables into the 10G ports on the server using the junipe sfp+ cables. The switch ports will need to be reconfigured to use 10G over 1G.

@Cmjohnson I'm afraid I don't understand fully what steps you've taken on which server, port or switch. So perhaps let's look at the current status: could you describe where each of labvirt1019's and labvirt1020's ports are connected to, and specifically to which ports on the switch and with what kind of cable? Thanks!

@faidon,

I had previously tried to connect to 10G and did not have any luck so I
ended up connecting the ethernet ports. After the meeting, I disconnected
the ethernet ports and connected them via 10G. I just updated the switch
port to include xe-4/0/16 and xe-4/0/33. xe-4/0/16 is vlan-cloud-hosts and
xe-4/0/33 is not in a vlan until we figure this out.

xe-4/0/16 up up labvirt1019 eth5

OK, I managed to get this server to boot from its 10G interfaces. The issue was fairly straightforward to resolve ("network boot" was set to "disabled" for the 10G ports and only set to "network boot" for the first 1G port), but here are the steps I took to troubleshoot for future reference:

  • Live-hacked install1002 to update the DHCP config with the 10G port's MAC address, as this was still pointing to the 1G interface.
  • Attempted to boot with "network boot" (ESC-@ I think) and verified that I couldn't, as I was getting "media check failed" from the Broadcom PXE menu. I was running tcpdump -i any port 67 or port 68 on install1002 simultaneously to grab DHCP requests, but we didn't get that far, as the PXE option ROM wasn't even attempting to do DHCP. This pointed that either the card or cable isn't working, or more likely that this is the option ROM for a different interface, e.g. one of the 1G ones.
  • Booted into the previously installed system (running Debian) from the console and verified that the port works in Linux. I did that by setting the interface (eno49) as up, then checking the switch on the other end (asw2-b-eqiad:xe-4/0/16) with show interfaces description and show configuration interfaces xe-4/0/16 | display inheritance and verifying that it sees the link as "up up", and that the config is correct. Then I ran ethtool on the system itself, and verified that it sees the link as negotiated/up and with the right speed. Finally, I ran dhclient eno49 there and it worked and got an IP assigned. By all that I verified that both the card and the cable actually work and that the network configuration is correct, and thus the issues were just about PXE.
  • Rebooted and then entered the system config. In the BIOS/Platform config (RBSU) and the PCI interface, I disabled the 4x1G card (Embedded LOM). This is not actually required, but it made things a bit easier to debug as I could figure out e.g. whether the PXE prompt you get is from the 1G card or the 10G card.
  • In the 10G card's configuration, I disabled "HP Shared Memory", per T167299, although I'm not sure if this is actually required anymore. From that task, it sounds like it would affect the network past the PXE stage and in the installer, but I had verified that it works in Linux, so that was probably not needed (but we also don't use these features as far as I know). I also disabled SR-IOV for good measure since we don't use it, although I doubt it would affect this.
  • In the BIOS/Platform config (RBSU), under Network Options > Network Boot Options, the option "Embedded FlexibleLOM 1 Port 1" was set to "Disabled". I set that to "Network boot". This is certainly related and likely the entire cause of this issues.
  • After enabling, you immediately get a warning that says "Important: When enabling network boot support for an Embedded FlexibleLOM embedded NIC, the NIC boot option does not appear in the UEFI Boot Order or Legacy IPL lists until the next system reboot.". So I just did a server reboot after that (easy).
  • After that, I booted normally, hit ESC-@ for network boot and was presented with a PXE prompt; from there on, network boot worked, d-i started loading and also acquired an IP and the preseed configuration. It stopped with an error at a partman prompt (likely because of a misconfigured partman profile, unrelated to all this).

So, it seems like this works now. Steps remaining:

  • Fix the MAC address of these systems in puppet/dhcpd for both of these systems to point to the 10G interfaces' MAC addresses.
  • Test whether "HP Shared Memory" and SR-IOV makes a difference; in any case, make sure that both of those systems have the exact same config, no matter what values we end up choosing.
  • Do the same config (disable 4x1G, enable network boot for 10G, reboot) on labvirt1020.
  • Fix the partman config(?)
  • (Re)install those systems :)

Change 440899 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] Updating labvirt1019 mac

https://gerrit.wikimedia.org/r/440899

Change 440899 merged by Bstorm:
[operations/puppet@production] Updating labvirt1019 mac

https://gerrit.wikimedia.org/r/440899

Running re-install on labvirt1019 to cover changes. Then I'll rebuild the canary instance.

So, the good, we are on 10G Ethernet

[bstorm@labvirt1019]:~ $ sudo ethtool eth0
Settings for eth0:
	Supported ports: [ FIBRE ]
	Supported link modes:   1000baseT/Full
	                        10000baseT/Full
	Supported pause frame use: Symmetric Receive-only
	Supports auto-negotiation: No
	Advertised link modes:  10000baseT/Full
	Advertised pause frame use: No
	Advertised auto-negotiation: No
	Speed: 10000Mb/s
	Duplex: Full
	Port: Direct Attach Copper
	PHYAD: 1
	Transceiver: internal
	Auto-negotiation: off
	Supports Wake-on: g
	Wake-on: g
	Current message level: 0x00000000 (0)

The bad, for some reason, even though eth1 shows up ok as up, the VM on there has no access to the network and is failing at DHCP. That seems more fixable in this state that it was before, though!

Now to give 1020 the 10G ethernet treatment.

I probably shouldn't actually switch things in the bios for 1020 until we confirm it is cabled for that? @Cmjohnson is that good to go from that end?

The bad, for some reason, even though eth1 shows up ok as up, the VM on there has no access to the network and is failing at DHCP. That seems more fixable in this state that it was before, though!

eth1 is now working fine.

  1. Switch was configured for ge-4/0/33 and not xe-4/0/33
  2. The test IP was configured on eth1.1102 instead of br1102. Moved it and it can now ping other IPs on the same subnet.

It looks to me like it is not?

To be clear, looks to me like labvirt1020 is not connected to 10G Ethernet. Labvirt1019 is working perfectly on both interfaces now. I noticed the order of messages above could have been confusing.

I am pretty sure it’s not connected to 10G. I will take care of next week
when I get back from the off site.

labvirt1020

Cabling to 10G ports completed
Switch cfg completed
Bios changes as per Faidon's instructions completed.

@Bstorm you should be okay to finish the rest of the changes for the re-install.

Cmjohnson moved this task from Blocked to Cloud Tasks on the ops-eqiad board.Jun 26 2018, 3:51 PM

Change 442158 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] Change labvirt1020 MAC

https://gerrit.wikimedia.org/r/442158

Change 442158 merged by Bstorm:
[operations/puppet@production] Change labvirt1020 MAC

https://gerrit.wikimedia.org/r/442158

The bad, for some reason, even though eth1 shows up ok as up, the VM on there has no access to the network and is failing at DHCP. That seems more fixable in this state that it was before, though!

eth1 is now working fine.

  1. Switch was configured for ge-4/0/33 and not xe-4/0/33
  2. The test IP was configured on eth1.1102 instead of br1102. Moved it and it can now ping other IPs on the same subnet.

@ayounsi Now I'm seeing the same behavior on labvirt1020 after re-imaging and having it moved to the 10G ethernet (vm cannot access the network). Could you give it the same treatment?

eth1 was in the wrong vlan:

[edit interfaces interface-range cloud-instance-ports]
     member xe-4/0/33 { ... }
+    member xe-7/0/14;
[edit interfaces interface-range vlan-cloud-instances1-b-eqiad]
-    member xe-7/0/14;

@Bstorm can you give it another try?

Bstorm closed this task as Resolved.Jun 27 2018, 4:09 PM

Looking good! The VM is doing a puppet run. I think the network is working on these things now.

Vvjjkkii renamed this task from Connect or troubleshoot eth1 on labvirt1019 and labvirt1020 to nrcaaaaaaa.Jul 1 2018, 1:09 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii removed faidon as the assignee of this task.
Vvjjkkii raised the priority of this task from Normal to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed subscribers: gerritbot, Aklapper.
CommunityTechBot renamed this task from nrcaaaaaaa to Connect or troubleshoot eth1 on labvirt1019 and labvirt1020.Jul 2 2018, 3:13 PM
CommunityTechBot closed this task as Resolved.
CommunityTechBot assigned this task to faidon.
CommunityTechBot lowered the priority of this task from High to Normal.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot edited projects, added Cloud-Services; removed Hashtags.
CommunityTechBot added subscribers: gerritbot, Aklapper.