Page MenuHomePhabricator

labnet1003 and labnet1004 moving and enabling 10G NICs
Closed, ResolvedPublic

Description

These two servers are both the router/gateway's for the coming Neutron based topology for Cloud Services. They will basically mimic labnet100[12]'s functionality with the added bonus of a connection for overlay networks.

Gist:

These should have 10G NICs in them but they seem to be disabled(?), hopefully enabling them is all we need, and then labnet1004 is in a 1G rack and needs to be moved to a 10G rack. Thank you @Cmjohnson for helping me figure this out, and thanks!

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
chasemp renamed this task from labnet1003 and labnet1004 extra NIC connections to labnet1003 and labnet1004 moving and enabling 10G NICs.Apr 26 2018, 7:07 PM
chasemp updated the task description. (Show Details)

@chasemp Confirmed both are 10G w/2 nics, labnet1004 can go to B2...I do not currently have any labnet server in that rack. labnet1001 is in B3 and labnet1002 is B4. FYI all of these servers will be moving to new switches soon. labnet1003 will need to move to B2/B4 or B7

I am thinking this
labnet1001 moves from B3 to B2
labnet1002 stays in B4
labnet1003 stay in B7
labnet1004 moves to B2

Let me know if that works for you and let's schedule the move sooner rather than later. Thanks

@chasemp Confirmed both are 10G w/2 nics, labnet1004 can go to B2...I do not currently have any labnet server in that rack. labnet1001 is in B3 and labnet1002 is B4. FYI all of these servers will be moving to new switches soon. labnet1003 will need to move to B2/B4 or B7

I am thinking this
labnet1001 moves from B3 to B2
labnet1002 stays in B4
labnet1003 stay in B7
labnet1004 moves to B2

Let me know if that works for you and let's schedule the move sooner rather than later. Thanks


labnet1001 moves from B3 to B2

Ack, let me talk to @Andrew about failing active traffic over to labnet1002 so we can move labnet1001. We have known this is coming outside of this task but haven't had time to talk it through.

labnet1004 moves to B2

sounds good. anytime is great.

labnet1003 stay in B7

Great, this can be shutdown/rebooted/whatever to enable 10G here anytime this week or next. A brief heads up is about all I need atm.

@Cmjohnson, I propose to fail over to labnet1002 on May 8th (Tuesday) and switch back to 1001 on May 15th (also a Tuesday). Can you commit to re-racking labnet1001 sometime on Wednesday, Thursday or Friday next week?

@Cmjohnson, sorry, now we're talking about doing this all in one day. Could you be available for a specific appointment (probably around 1PM) to re-rack this box on either Friday May 11th or Tuesday May 15th? The 15th is mildly better but if either one works for you that'd be great.

chasemp added a parent task: Restricted Task.May 2 2018, 2:01 PM

@Andrew @chasemp I am on vacation 5/11 so let's plan for 5/15 I am available anytime

@Cmjohnson Just to clarify, labnet1003 and 1004 aren't in active service so you can move them anytime; just check in with @chasemp after they're moved to make sure he's happy.

ping :) I know there is much shuffling happening, it would be useful if this could happen sometime this week

labnet1003 is already in B7 and the 10G ports are connected to the new switch and the descriptions were updated @ayounsi will need to configure the vlans and any other lab related networking.

asw2-b-eqiad xe-7/0/9
asw2-b-eqiad xe-7/0/19 (second port)

labvirt1004 has been moved to b4 and connected
asw2-b-eqiad xe-4/0/3
asw2-b-eqiad xe-4/0/46 second port

Also, worth noting is I have not been able to get the 10G ports to pxe boot for HPs. These will still need dhcp file changed and probably some bios manipulation for booting the 10G ports and not the GigE ports.

asw2-b-eqiad xe-7/0/9 and xe-4/0/3 moved to group "vlan-cloud-hosts1-b-eqiad"
asw2-b-eqiad xe-7/0/19 and xe-4/0/46 moved to group "vlan-cloud-instances1-b-eqiad"
I did that based on other hosts, let me know if it needs to change.

Once the host is up and the interfaces name are identified, please updated the switch port descriptions.

labnet1003 currently list eth0 to eth3, which are all 1G copper ports:
ip link sudo ethtool eth3 etc.

My understanding of the current situation:

  • Currently labnet1003 only shows eth0 connected (should be both eth0 and eth1 if 1G), and labnet1004 is not available (switched ports?)
  • The 10G nics in labnet100[34] are not working when they are hooked up. @Cmjohnson has tried to move over to them a few times and no go.
  • The servers are moved into the appropriate racks so they should not have to be physically moved again based on current switch layout
  • Because the 10G ports here are not working I asked @Cmjohnson to hook up the 1G ports for now so we can start moving forward with configuration because it should be a small bit of work to move back to 10G from the openstack perspective when we can

The cabling has been fixed. Both servers are now connected to asw2-b-eqiad. They are ready for install

ge-7/0/9 up up labnet1003 eth0 . cloud-hosts1-b-eqiad
ge-7/0/19 up up labnet1003 eth1 cloud-instances1-b-eqiad

ge-4/0/3 up up labnet1004 eth0 . cloud-hosts1-b-eqiad
ge-4/0/46 up up labnet1004 eth1 cloud-instances1-b-eqiad

labnet1003

cabled
Updated BIOS per Faidon's instructions
Updaetd the switch cfg
xe-7/0/9 up up labnet1003 eth0
xe-7/0/19 up up labnet1003 eth1

labnet1004

cabled
bios updated
switch cfg updated
xe-4/0/3 up up labnet1004 eth0
xe-4/0/46 up up labnet1004 eth1

Change 442290 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] openstack: labnet100[34] 10g nic for install_server

https://gerrit.wikimedia.org/r/442290

Change 442290 merged by Rush:
[operations/puppet@production] openstack: labnet100[34] 10g nic for install_server

https://gerrit.wikimedia.org/r/442290

Script wmf-auto-reimage was launched by rush on neodymium.eqiad.wmnet for hosts:

labnet1003.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201806271252_rush_26039_labnet1003_eqiad_wmnet.log.

Attempting Boot From NIC

QLogic UNDI PXE-2.1 v7.14.5
Copyright (C) 2016 QLogic Corporation
Copyright (C) 1997-2000 Intel Corporation
All rights reserved.

CLIENT MAC ADDR: E0 07 1B EF F5 98  GUID: 32353537-3835-584D-5137-323130363838
CLIENT IP: 10.64.20.35  MASK: 255.255.255.0  DHCP IP: 208.80.154.22
GATEWAY IP: 10.64.20.1

PXELINUX 6.03 lwIP 20150819 Copyright (C) 1994-2014 H. Peter Anvin et al

Failed to load ldlinux.c32

That is the correct MAC specified for labnet1003 in the installer module hardware ethernet e0:07:1b:ef:f5:98;

Script wmf-auto-reimage was launched by rush on neodymium.eqiad.wmnet for hosts:

labnet1003.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201806271320_rush_31996_labnet1003_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['labnet1003.eqiad.wmnet']

Of which those FAILED:

['labnet1003.eqiad.wmnet']

Script wmf-auto-reimage was launched by rush on neodymium.eqiad.wmnet for hosts:

labnet1003.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201806271323_rush_32577_labnet1003_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['labnet1003.eqiad.wmnet']

Of which those FAILED:

['labnet1003.eqiad.wmnet']

Script wmf-auto-reimage was launched by rush on neodymium.eqiad.wmnet for hosts:

labnet1003.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201806271323_rush_32632_labnet1003_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['labnet1003.eqiad.wmnet']

Of which those FAILED:

['labnet1003.eqiad.wmnet']

Script wmf-auto-reimage was launched by rush on neodymium.eqiad.wmnet for hosts:

labnet1003.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201806271326_rush_829_labnet1003_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['labnet1003.eqiad.wmnet']

Of which those FAILED:

['labnet1003.eqiad.wmnet']

Script wmf-auto-reimage was launched by rush on neodymium.eqiad.wmnet for hosts:

labnet1003.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201806271328_rush_1233_labnet1003_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['labnet1003.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['labnet1003.eqiad.wmnet']

Of which those FAILED:

['labnet1003.eqiad.wmnet']

Script wmf-auto-reimage was launched by rush on neodymium.eqiad.wmnet for hosts:

labnet1004.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201806271450_rush_17283_labnet1004_eqiad_wmnet.log.

@chasemp

labnet1004 the cable in eth4 is connected the correct port and according the bios the mac address is E0:07:1B:EF:15:D0 which is the one attempting to hit the installer.

I verified the mac address for the port is Embedded FlexibleLOM 1 Port 1 : HP FlexFabric 10Gb 2-port 534FLR-SFP+ Adapter - CNA
I confirmed the switch interface is connected LOM1 Port 1
cmjohnson@asw2-b-eqiad> show ethernet-switching table | match E0:07:1B:EF:15:D0

cloud-hosts1-b-eqiad e0:07:1b:ef:15:d0  D             -   xe-4/0/3.0

This is the boot order in bios

Standard Boot Order(IPL)

CD ROM/DVD
USB DriveKey
Hard Drive C: (see Boot Controller Order)
Embedded FlexibleLOM 1 Port 1 : HP FlexFabric 10Gb 2-port 534FLR-SFP+ Adapter - NIC

Script wmf-auto-reimage was launched by rush on neodymium.eqiad.wmnet for hosts:

labnet1004.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201806271528_rush_25102_labnet1004_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['labnet1004.eqiad.wmnet']

Of which those FAILED:

['labnet1004.eqiad.wmnet']

Script wmf-auto-reimage was launched by rush on neodymium.eqiad.wmnet for hosts:

labnet1004.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201806271528_rush_25137_labnet1004_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['labnet1004.eqiad.wmnet']

Of which those FAILED:

['labnet1004.eqiad.wmnet']

Completed auto-reimage of hosts:

['labnet1004.eqiad.wmnet']

Of which those FAILED:

['labnet1004.eqiad.wmnet']

We moved past the DHCP/NIC issue and now are failing with

Loading Linux 4.9.0-0.bpo.6-amd64 ...
Loading initial ramdisk ...
[    0.125572] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 38d is 330)
Loading, please wait...
mdadm: No devices listed in conf file were found.
Gave up waiting for root device.  Common problems:
 - Boot args (cat /proc/cmdline)
   - Check rootdelay= (did the system wait long enough?)
   - Check root= (did the system wait for the right device?)
 - Missing modules (cat /proc/modules; ls /dev)
ALERT!  /dev/disk/by-uuid/9043aead-964d-4489-ae97-9d154e128577 does not exist.  Dropping to a shell!
modprobe: module ehci-orion not found in modules.dep


BusyBox v1.22.1 (Debian 1:1.22.0-9+deb8u1) built-in shell (ash)
Enter 'help' for a list of built-in commands.

/bin/sh: can't access tty; job control turned off
(initramfs)

dug up an old task that said rootdelay is the way to address this in jessie, with permanent fixes having landed in stretch. So far that seems to have corrected the issue. Both labnet1003 and 1004 are now running jessie using 10G nics. Need to get a bit further to confirm the instance facing interface works but I'm considering this closed until I know better ;)

Vvjjkkii renamed this task from labnet1003 and labnet1004 moving and enabling 10G NICs to r4daaaaaaa.Jul 1 2018, 1:12 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii removed Cmjohnson as the assignee of this task.
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii edited subscribers, added: Cmjohnson; removed: gerritbot, Aklapper.
Yann renamed this task from r4daaaaaaa to labnet1003 and labnet1004 moving and enabling 10G NICs.Jul 1 2018, 1:32 PM
Yann closed this task as Resolved.
Yann assigned this task to Cmjohnson.
Yann lowered the priority of this task from High to Medium.
Yann updated the task description. (Show Details)
Yann edited subscribers, added: gerritbot, Aklapper; removed: Cmjohnson.