Page MenuHomePhabricator

rack/setup/install lvs101[3-6]
Open, NormalPublic

Description

This task will track the racking and setup of 4 new lvs systems for eqiad. The hostname sequence will be lvs101[3-6]. These are replacing lvs1001-lvs1006.

Racking proposal: Rack one LVS server per row.

lvs1013:

  • - receive in system on procurement task T181419
  • - rack system with proposed racking plan (see above) & update racktables (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description only)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run - set to staged in netbox
  • - handoff for service implementation - set to active when service is live

lvs1014:

  • - receive in system on procurement task T181419
  • - rack system with proposed racking plan (see above) & update racktables (include all system info plus location)

[]x - bios/drac/serial setup/testing

  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description only)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run - set to staged in netbox
  • - handoff for service implementation - set to active when service is live

lvs1015:

  • - receive in system on procurement task T181419
  • - rack system with proposed racking plan (see above) & update racktables (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run
  • - handoff for service implementation

lvs1016:

  • - receive in system on procurement task T181419
  • - rack system with proposed racking plan (see above) & update racktables (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run
  • - handoff for service implementation
HostnameHostportSwitchportnote
lvs1013eth0asw2-a:xe-7/0/34ready
lvs1013eth1asw2-b:xe-4/0/15vlan only
lvs1013eth2asw2-c:xe-4/0/32
lvs1013eth3asw2-d:xe-2/0/12Should be moved to D4 when possible
lvs1014eth0asw2-b:xe-7/0/29ready
lvs1014eth1asw2-a:xe-4/0/18vlan only
lvs1014eth2asw2-c:xe-2/0/13
lvs1014eth3asw2-d:xe-7/0/4
lvs1015enp4s0f0asw2-c-eqiad:xe-7/0/19
lvs1015enp5s0f0asw2-a-eqiad:xe-2/0/0
lvs1015enp4s0f1asw2-b-eqiad:xe-2/0/3
lvs1015enp5s0f1asw2-d-eqiad:xe-2/0/4
lvs1016eth0asw2-d:xe-7/0/17
lvs1016eth1asw-a:xe-4/1/2
lvs1016eth2asw2-b:ge-4/0/34
lvs1016eth3asw-c:xe-4/1/0Need a 10G module in C4

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Are we good for handoff to Traffic for OS-level install/config now on lvs1016?

There is some stuff still missing, like setting the management password, if that's handled I think I can continue from there.

@Cmjohnson I think @BBlack's question above was for you -- task description seems to point at a few of the steps on your side being still pending at least.

@Vgutierrez Sorry about that, it was set but I had an extra . in the subnet. Anyway, that is fixed. Also, I am not sure which image you want to install so I did not set the dhcp file yet. This is the mac address for eth0 F4:E9:D4:DB:25:40

@Cmjohnson we will go with stretch and raid1-lvm (modules/install_server/files/autoinstall/netboot.cfg). Could you add the production dns entries for lvs1016 as well? Thanks!

Change 429254 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/puppet@production] adding dhcpd and netboot.cfg for lvs1016

https://gerrit.wikimedia.org/r/429254

Change 429254 merged by Cmjohnson:
[operations/puppet@production] adding dhcpd and netboot.cfg for lvs1016

https://gerrit.wikimedia.org/r/429254

@ayounsi Can you create a subnet for LVS for row D please.

This comment was removed by Vgutierrez.

Change 430402 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/dns@master] lvs10[13-16] production DNS entries, all vlans

https://gerrit.wikimedia.org/r/430402

Change 430402 merged by Vgutierrez:
[operations/dns@master] lvs10[13-16] production DNS entries, all vlans

https://gerrit.wikimedia.org/r/430402

Vgutierrez updated the task description. (Show Details)May 3 2018, 6:55 AM
Vgutierrez added a comment.EditedMay 3 2018, 4:29 PM

@Cmjohnson, I've been trying to boot lvs1016 with PXE with no luck, after some debugging with @ayounsi we've seen traffic incoming traffic on eth2 (asw2-b:xe-4/0/34) instead of eth0 (asw2-d:xe-7/0/15) while power cycling the server with PXE boot forced.

Could you check the cables?

From our side it looks like cable #3931 it's connected to asw2-d:xe-7/0/15 and cable #4061 to asw2-b:xe-4/0/34 and it should be the other way around

@Vgutierrez I flipped the cables. I did put the cables into what is on the card labeled port 1 and port 2 but I think the card is inserted upside down on the daughter board. Please let me know if that works.

BBlack added a comment.May 3 2018, 4:44 PM

I don't think it was a flip of the two ports on the same card that was needed, but instead switching all the cables between the two cards (order of cards, not order of ports in each card).

Ok... this is the current picture from what I see:
eth0 is still connected to asw2-b:xe-4/0/34 instead of asw2-d:xe-7/0/15
asw2-c:xe-4/0/5 is showing no link so it must be some problem with cable #3918

Please @Cmjohnson double check that lvs1016 connections match the table posted by @ayounsi above:

lvs1016eth0/eno1asw2-d:xe-7/0/15cable #4061
lvs1016eth1/eno2asw2-a:xe-4/0/7cable #3917
lvs1016eth2/ens1f0asw2-b:xe-4/0/34cable #3931
lvs1016eth3/ens1f1asw2-c:xe-4/0/5cable #3918
BBlack added a comment.May 3 2018, 5:08 PM

[I still bet if you undo the already-done cable swap, and then switch the two cards' cables (leaving port1/2 ordering the same), this will all magically come out right]

@Vgutierrez I did what bblack suggested and switched the cables to the opposite card. Let's see if the magic works

Awesome, I just confirmed the new interface naming for lvs1016:

  • eth0 -> enp4s0f0
  • eth1 -> enp4s0f1
  • eth2 -> enp5s0f0
  • eth3 -> enp5s0f1

Change 430927 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] lvs: lvs1016.eqiad.wmnet configuration

https://gerrit.wikimedia.org/r/430927

Awesome, I just confirmed the new interface naming for lvs1016:

  • eth0 -> enp4s0f0
  • eth1 -> enp4s0f1
  • eth2 -> enp5s0f0
  • eth3 -> enp5s0f1

Updated switch ports descriptions to reflect those.

Change 430927 merged by Vgutierrez:
[operations/puppet@production] lvs: lvs1016.eqiad.wmnet configuration

https://gerrit.wikimedia.org/r/430927

Change 430941 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] pybal: Set lvs1016 bgp peer address

https://gerrit.wikimedia.org/r/430941

Change 430941 merged by Vgutierrez:
[operations/puppet@production] pybal: Set lvs1016 bgp peer address

https://gerrit.wikimedia.org/r/430941

root@lvs1016:~# ethtool -l enp4s0f0
Channel parameters for enp4s0f0:
Pre-set maximums:
RX:		0
TX:		0
Other:		0
Combined:	15
Current hardware settings:
RX:		0
TX:		0
Other:		0
Combined:	8

root@lvs1016:~# ethtool -l enp4s0f1
Channel parameters for enp4s0f1:
Pre-set maximums:
RX:		0
TX:		0
Other:		0
Combined:	15
Current hardware settings:
RX:		0
TX:		0
Other:		0
Combined:	8

root@lvs1016:~# ethtool -l enp5s0f0
Channel parameters for enp5s0f0:
Pre-set maximums:
RX:		0
TX:		0
Other:		0
Combined:	30
Current hardware settings:
RX:		0
TX:		0
Other:		0
Combined:	16

root@lvs1016:~# ethtool -l enp5s0f1
Channel parameters for enp5s0f1:
Pre-set maximums:
RX:		0
TX:		0
Other:		0
Combined:	30
Current hardware settings:
RX:		0
TX:		0
Other:		0
Combined:	16
04:00.0 Ethernet controller: Broadcom Limited NetXtreme II BCM57810 10 Gigabit Ethernet (rev 10)
	Subsystem: Broadcom Limited NetXtreme II BCM57810 10 Gigabit Ethernet
	Flags: bus master, fast devsel, latency 0, IRQ 32, NUMA node 0
	Memory at 93000000 (64-bit, prefetchable) [size=8M]
	Memory at 93800000 (64-bit, prefetchable) [size=8M]
	Memory at 94010000 (64-bit, prefetchable) [size=64K]
	Expansion ROM at 91a00000 [disabled] [size=512K]
	Capabilities: [48] Power Management version 3
	Capabilities: [50] Vital Product Data
	Capabilities: [58] MSI: Enable- Count=1/8 Maskable- 64bit+
	Capabilities: [a0] MSI-X: Enable+ Count=17 Masked-
	Capabilities: [ac] Express Endpoint, MSI 00
	Capabilities: [100] Advanced Error Reporting
	Capabilities: [13c] Device Serial Number f4-e9-d4-ff-fe-db-25-40
	Capabilities: [150] Power Budgeting <?>
	Capabilities: [160] Virtual Channel
	Capabilities: [1b8] Alternative Routing-ID Interpretation (ARI)
	Capabilities: [220] #15
	Capabilities: [300] #19
	Kernel driver in use: bnx2x
	Kernel modules: bnx2x

04:00.1 Ethernet controller: Broadcom Limited NetXtreme II BCM57810 10 Gigabit Ethernet (rev 10)
	Subsystem: Broadcom Limited NetXtreme II BCM57810 10 Gigabit Ethernet
	Flags: bus master, fast devsel, latency 0, IRQ 45, NUMA node 0
	Memory at 92000000 (64-bit, prefetchable) [size=8M]
	Memory at 92800000 (64-bit, prefetchable) [size=8M]
	Memory at 94000000 (64-bit, prefetchable) [size=64K]
	Expansion ROM at 91a80000 [disabled] [size=512K]
	Capabilities: [48] Power Management version 3
	Capabilities: [50] Vital Product Data
	Capabilities: [58] MSI: Enable- Count=1/8 Maskable- 64bit+
	Capabilities: [a0] MSI-X: Enable+ Count=17 Masked-
	Capabilities: [ac] Express Endpoint, MSI 00
	Capabilities: [100] Advanced Error Reporting
	Capabilities: [13c] Device Serial Number f4-e9-d4-ff-fe-db-25-40
	Capabilities: [150] Power Budgeting <?>
	Capabilities: [160] Virtual Channel
	Capabilities: [1b8] Alternative Routing-ID Interpretation (ARI)
	Capabilities: [220] #15
	Kernel driver in use: bnx2x
	Kernel modules: bnx2x

05:00.0 Ethernet controller: Broadcom Limited NetXtreme II BCM57810 10 Gigabit Ethernet (rev 10)
	Subsystem: Broadcom Limited NetXtreme II BCM57810 10 Gigabit Ethernet
	Flags: bus master, fast devsel, latency 0, IRQ 56, NUMA node 0
	Memory at 95800000 (64-bit, prefetchable) [size=8M]
	Memory at 96000000 (64-bit, prefetchable) [size=8M]
	Memory at 96810000 (64-bit, prefetchable) [size=64K]
	Expansion ROM at 91b00000 [disabled] [size=512K]
	Capabilities: [48] Power Management version 3
	Capabilities: [50] Vital Product Data
	Capabilities: [58] MSI: Enable- Count=1/8 Maskable- 64bit+
	Capabilities: [a0] MSI-X: Enable+ Count=32 Masked-
	Capabilities: [ac] Express Endpoint, MSI 00
	Capabilities: [100] Advanced Error Reporting
	Capabilities: [13c] Device Serial Number f4-e9-d4-ff-fe-cf-26-50
	Capabilities: [150] Power Budgeting <?>
	Capabilities: [160] Virtual Channel
	Capabilities: [1b8] Alternative Routing-ID Interpretation (ARI)
	Capabilities: [220] #15
	Capabilities: [300] #19
	Kernel driver in use: bnx2x
	Kernel modules: bnx2x

05:00.1 Ethernet controller: Broadcom Limited NetXtreme II BCM57810 10 Gigabit Ethernet (rev 10)
	Subsystem: Broadcom Limited NetXtreme II BCM57810 10 Gigabit Ethernet
	Flags: bus master, fast devsel, latency 0, IRQ 67, NUMA node 0
	Memory at 94800000 (64-bit, prefetchable) [size=8M]
	Memory at 95000000 (64-bit, prefetchable) [size=8M]
	Memory at 96800000 (64-bit, prefetchable) [size=64K]
	Expansion ROM at 91b80000 [disabled] [size=512K]
	Capabilities: [48] Power Management version 3
	Capabilities: [50] Vital Product Data
	Capabilities: [58] MSI: Enable- Count=1/8 Maskable- 64bit+
	Capabilities: [a0] MSI-X: Enable+ Count=32 Masked-
	Capabilities: [ac] Express Endpoint, MSI 00
	Capabilities: [100] Advanced Error Reporting
	Capabilities: [13c] Device Serial Number f4-e9-d4-ff-fe-cf-26-50
	Capabilities: [150] Power Budgeting <?>
	Capabilities: [160] Virtual Channel
	Capabilities: [1b8] Alternative Routing-ID Interpretation (ARI)
	Capabilities: [220] #15
	Kernel driver in use: bnx2x
	Kernel modules: bnx2x

@BBlack any ideas of what could be causing this difference between enp4s0* and enp5s0*?

BBlack added a comment.EditedMay 4 2018, 8:11 PM

The key to the ethool difference is this in the lspci stuff:
Capabilities: [a0] MSI-X: Enable+ Count=17 Masked-
vs
Capabilities: [a0] MSI-X: Enable+ Count=32 Masked-

I rebooted back to BIOS/firmware setup stuff and confirmed the enp4 card ("first" from OS perspective, showing limited MSI-X) has significantly outdated firmware rev numbers vs the enp5 card. Both should be updated to the latest available (or at least, update enp4 to match enp5). Aside from probably fixing the MSI-X IRQ counts issue: there's only one shared driver for the two cards at the OS level, and we've observed before that it doesn't deal well in general with mixed firmware revs like this.

I left the host back online and out of service, and I put in a downtime in Icinga through tuesday.

Right, from ethtool:

root@lvs1016:~# ethtool -i enp5s0f0 | grep firmware
firmware-version: bc 7.14.10
root@lvs1016:~# ethtool -i enp4s0f0 |grep firmware
firmware-version: FFV08.07.00 bc 7.13.54

Although checking lvs2006 interfaces, they're running a quite older firmware but a proper MSI-X count (32)

vgutierrez@lvs2006:~$ sudo ethtool -i ens1f0 |grep firmware
firmware-version: bc 7.8.79

Checking kernel logs we've something weird going on with bnx2x fws across LVS servers:

root@lvs1016:~# dmesg |grep direct-loading
[   15.744692] bnx2x 0000:04:00.0: firmware: direct-loading firmware bnx2x/bnx2x-e2-7.13.1.0.fw

Note that the bnx2x @ 0000:05:00.0 is missing here! Same is going on in lvs2006:

vgutierrez@lvs2006:~$ sudo dmesg |grep direct-loading
[   13.879369] bnx2x 0000:03:00.0: firmware: direct-loading firmware bnx2x/bnx2x-e2-7.13.1.0.fw

So, if we believe that bnx2x only loads the firmware in one bnx2x device, on lvs2006 we have one bnx2x @ 00:03.00 running the firmware version 7.13.1.0 and the other @ 04:00.0 running 7.8.79.

This behaviour apparently goes along all our LVS instances running 2 bnx2x adapters:

vgutierrez@neodymium:~$ sudo cumin lvs* 'lspci | grep "Broadcom Limited NetXtreme II BCM57810 10 Gigabit" |wc -l'
30 hosts will be targeted:
(7) lvs[2001-2006].codfw.wmnet,lvs1016.eqiad.wmnet
----- OUTPUT of 'lspci | grep "Br... Gigabit" |wc -l' -----
4
(output omitted)

and in all of them, firmware is direct-loaded once instead of twice:

vgutierrez@neodymium:~$ sudo cumin 'lvs[2001-2006].codfw.wmnet,lvs1016.eqiad.wmnet' 'dmesg | grep direct-loading | wc -l'
7 hosts will be targeted:
lvs[2001-2006].codfw.wmnet,lvs1016.eqiad.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====
(7) lvs[2001-2006].codfw.wmnet,lvs1016.eqiad.wmnet
----- OUTPUT of 'dmesg | grep direct-loading | wc -l' -----
1
================

oh, and it looks like it isn't consistent across reboots:

root@lvs1016:/var/log# grep direct-loading kern.log
May  4 14:34:11 lvs1016 kernel: [    7.013568] bnx2x 0000:04:00.0: firmware: direct-loading firmware bnx2x/bnx2x-e2-7.13.1.0.fw
May  4 17:42:24 lvs1016 kernel: [   15.868253] bnx2x 0000:05:00.0: firmware: direct-loading firmware bnx2x/bnx2x-e2-7.13.1.0.fw
May  4 19:51:54 lvs1016 kernel: [   15.802742] bnx2x 0000:04:00.0: firmware: direct-loading firmware bnx2x/bnx2x-e2-7.13.1.0.fw
May  4 20:05:28 lvs1016 kernel: [   15.744692] bnx2x 0000:04:00.0: firmware: direct-loading firmware bnx2x/bnx2x-e2-7.13.1.0.fw

note that 04:00.0 gets the firmware on three reboots and 05:00.0 only once

BBlack added a comment.EditedMay 5 2018, 10:40 AM

I don't think the firmware versions you see there in ethtool and/or dmesg are the whole story anyways. The ones visible from bios-level setup have like 6 different inter-related version numbers for different things, and even the "main" firmware version number was very different between the two cards. We should probably have @Cmjohnson flash them with latest (or at least, consistent) firmwares.

Note the most-recent boot also has this, on the newer-firmware of the two cards:

[    5.119538] bnx2x 0000:05:00.0: VPD access failed.  This is likely a firmware bug on this device.  Contact the card vendor for a firmware update

Perhaps if runtime is loading 7.13.1.0, our flashing should roughly correspond with that as well (as opposed to using something much-newer, which might require using a much-newer upstream kernel+driver).

I've seen the VPD access failed error in other boxes as well, currently:

===== NODE GROUP =====
(1) lvs2005.codfw.wmnet
----- OUTPUT of 'dmesg | grep VPD' -----
[    2.775622] bnx2x 0000:03:00.0: VPD access failed.  This is likely a firmware bug on this device.  Contact the card vendor for a firmware update
[    2.837438] bnx2x 0000:03:00.0: invalid large VPD tag 10 size at offset 47

Yeah, lvs200x are all HPs as well, so they're "different" in many respects for better or worse, and getting replaced this quarter with something more-like lvs1016.

right, so let's @Cmjohnson update firmware on lvs1016 NICs and I'll check MSI-X status after that :)

@vguiterrez I updated the firmware on lvs1016

@Cmjohnson I still see the same FW version from ethtool and same MSI-X:

FW version
root@lvs1016:~# ethtool -i enp4s0f0 |grep firmware
firmware-version: FFV08.07.00 bc 7.13.54
MSI-X count
root@lvs1016:~# lspci -v -s 04:00.0 |grep MSI-X
	Capabilities: [a0] MSI-X: Enable+ Count=17 Masked-

Could we have the same FWs in enp4s0f0 and enp5s0f0 @Cmjohnson?

So MSI-X limit can be changed on the NIC BIOS, it was set to 16 for enp4s0f0, after setting it to 32 and power cycling the server, lspci showed the proper MSI-X count and ethtool -L allowed proper configuration:

MSI-X
vgutierrez@lvs1016:~$ sudo lspci -v -s 04:00.0 |grep MSI-X
	Capabilities: [a0] MSI-X: Enable+ Count=32 Masked-
vgutierrez@lvs1016:~$ sudo lspci -v -s 05:00.0 |grep MSI-X
	Capabilities: [a0] MSI-X: Enable+ Count=32 Masked-
RPS
vgutierrez@lvs1016:~$ sudo ethtool -l enp4s0f0
Channel parameters for enp4s0f0:
Pre-set maximums:
RX:		0
TX:		0
Other:		0
Combined:	30
Current hardware settings:
RX:		0
TX:		0
Other:		0
Combined:	16

vgutierrez@lvs1016:~$ sudo ethtool -l enp5s0f0
Channel parameters for enp5s0f0:
Pre-set maximums:
RX:		0
TX:		0
Other:		0
Combined:	30
Current hardware settings:
RX:		0
TX:		0
Other:		0
Combined:	16

Change 432091 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] pybal: switch lvs1016 to cr1-eqiad

https://gerrit.wikimedia.org/r/432091

Change 432091 merged by Vgutierrez:
[operations/puppet@production] pybal: switch lvs1016 to cr1-eqiad

https://gerrit.wikimedia.org/r/432091

Mentioned in SAL (#wikimedia-operations) [2018-05-09T15:26:57Z] <vgutierrez> Replacing lvs1003 with lvs1016 - T184293

Change 432102 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] pybal: set lvs1016 as primary instead of lvs1003

https://gerrit.wikimedia.org/r/432102

Change 432102 merged by Vgutierrez:
[operations/puppet@production] pybal: set lvs1016 as primary instead of lvs1003

https://gerrit.wikimedia.org/r/432102

Change 432116 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] install_server: Reimage lvs1003 as strech spare system

https://gerrit.wikimedia.org/r/432116

Change 432116 merged by Vgutierrez:
[operations/puppet@production] install_server: Reimage lvs1003 as stretch spare system

https://gerrit.wikimedia.org/r/432116

Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts:

lvs1003.wikimedia.org

The log can be found in /var/log/wmf-auto-reimage/201805100720_vgutierrez_30370_lvs1003_wikimedia_org.log.

Completed auto-reimage of hosts:

['lvs1003.wikimedia.org']

and were ALL successful.

Vgutierrez updated the task description. (Show Details)May 14 2018, 8:58 AM
Cmjohnson moved this task from Being worked on to Blocked on the ops-eqiad board.May 29 2018, 3:12 PM

For lvs1015, @Cmjohnson can you cable the following?

hosthostportswitch:switchport
lvs1015enp4s0f0 (primary)asw2-c-eqiad:xe-7/0/19
lvs1015enp4s0f1asw2-b-eqiad:xe-2/0/3
lvs1015enp5s0f0asw2-a-eqiad:xe-2/0/0
lvs1015enp5s0f1asw2-d-eqiad:xe-2/0/4
RobH closed subtask Unknown Object (Task) as Resolved.May 31 2018, 4:40 PM

@Cmjohnson any updates regarding lvs1015?

238482n375 set Security to Software security bug.Jun 15 2018, 8:04 AM
238482n375 changed the visibility from "Public (No Login Required)" to "Custom Policy".
This comment was removed by Vgutierrez.
Vgutierrez raised the priority of this task from Lowest to Normal.Jun 15 2018, 9:29 AM
Vgutierrez assigned this task to Cmjohnson.
Vgutierrez changed the visibility from "Custom Policy" to "Public (No Login Required)".
Vgutierrez edited subscribers, added: Aklapper; removed: 238482n375.
Restricted Application added a project: Security. · View Herald TranscriptJun 15 2018, 9:29 AM
Cmjohnson moved this task from Blocked to Being worked on on the ops-eqiad board.Jun 27 2018, 12:48 PM

lvs1015 idrac is setup, I think it's cabled correctly but I am not really sure, enp4s0f1 doesn't translate for me looking at h/w but I am pretty sure it matches the port order. I am not sure what you need from here to make it all work. I am attaching the picture of the mac addresses.

@Cmjohnson take into account that eth0 should be enp4s0f0, not enp4s0f1 :)

BTW, would you mind checking the ethernet firmware version and update them if needed (same as we did with lvs1016)

RobH added a comment.Jul 10 2018, 4:34 PM

I updated both network cards to the latest firmware. They were very outdated and mismatched (08.07.04 & 14.02.12) in firmware versions. Now both are 14.04.18.

@ayounsi could you enable lvs1015 network ports? thanks!

Change 445162 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] site: add lvs1015 as spare system

https://gerrit.wikimedia.org/r/445162

mark added a comment.Jul 11 2018, 1:19 PM

@ayounsi could you enable lvs1015 network ports? thanks!

I added lvs1015 to interface-range LVS-balancer on asw2-c-eqiad, and to LVS-cross-row on the other 3 row switches, for the respective ports.

@ayounsi: I did notice two inconsistencies between the switches:

  1. On asw2-d-eqiad, xe-2/0/4 is part of the "access-ports" group which sets a high MTU, whereas it doesn't seem to be on the other switches.
  2. On asw2-c-eqiad, interface-range LVS-balancer explicitly adds the private vlan, whereas on at least asw2-d-eqiad it does not. It probably doesn't matter since it also sets the "native vlan" id for the private vlan, but good to be aware of.
mark added a comment.Jul 11 2018, 1:32 PM
  1. On asw2-c-eqiad, interface-range LVS-balancer explicitly adds the private vlan, whereas on at least asw2-d-eqiad it does not. It probably doesn't matter since it also sets the "native vlan" id for the private vlan, but good to be aware of.

So this didn't seem to work, the "native vlan" setting in the interface-range didn't explicitly add the vlan until I manually added it. Now it appears to work:

mark@asw2-c-eqiad# run show ethernet-switching interface xe-7/0/19 
...
Logical          Vlan          TAG     MAC         STP         Logical           Tagging 
interface        members               limit       state       interface flags  
xe-7/0/19.0                            294912                   DN                tagged     
                 public1-c-eqiad 1003  294912      Discarding                     tagged     
                 private1-c-eqiad 1019 294912      Discarding                     untagged

Change 445162 merged by Vgutierrez:
[operations/puppet@production] site: add lvs1015 as spare system

https://gerrit.wikimedia.org/r/445162

From lldpcli everything looks good:

lldpcli show neighbors
root@lvs1015:~# lldpcli show neighbors | egrep "Interface|PortDescr"
Interface:    enp4s0f0, via: LLDP, RID: 1, Time: 0 day, 00:09:21
    PortDescr:    lvs1015:enp4s0f0
Interface:    enp4s0f1, via: LLDP, RID: 2, Time: 0 day, 00:01:00
    PortDescr:    lvs1015:enp4s0f1
Interface:    enp5s0f0, via: LLDP, RID: 3, Time: 0 day, 00:00:48
    PortDescr:    lvs1015:enp5s0f0
Interface:    enp5s0f1, via: LLDP, RID: 4, Time: 0 day, 00:00:38
    PortDescr:    lvs1015:enp5s0f1

Thx @Cmjohnson & @mark

Vgutierrez updated the task description. (Show Details)Jul 11 2018, 2:09 PM
ayounsi added a comment.EditedJul 16 2018, 8:29 PM
  1. On asw2-d-eqiad, xe-2/0/4 is part of the "access-ports" group which sets a high MTU, whereas it doesn't seem to be on the other switches.

Migrating from asw to asw2 the access-ports group made less sens, as it used to apply to <[xg]e-*/0/*> and the uplinks were on xe-*/1/*.
asw2-d kept using the access-ports group by applying it to all the interfaces and then "un-applying" it to the infrastructure group, containing the uplinks. Which makes it more complicated to troubleshot.
For example where interface-mode access is applied to a LVS interface from the group access-ports, but also is interface-mode trunk from the interface-range LVS-balancer and takes the precedence.

The way I did it on the other asw2- is to apply the MTU and access-mode directly to the interface-ranges, which means a few duplicated lines but more clear configuration overall.
I'm planning on making the configuration similar on asw2-d-eqiad when we retrofit that stack with a 3rd 10G member.

Note that codfw has access-port defined but not used, so this should be fixed as well.
EDIT: MTU set for all interface-range

asw2-a/b/c had the proper mtu set for all the interfaces-ranges except the LVS- ones, I added it and it's now running with a proper mtu.

Cmjohnson moved this task from Being worked on to Blocked on the ops-eqiad board.Aug 30 2018, 4:52 PM

It looks like the last thing needed for lvs1015 is to connect lvs1015:enp5s0f0 to asw2-a-eqiad:xe-2/0/0 @Cmjohnson

ayounsi updated the task description. (Show Details)Nov 1 2018, 4:56 PM
ayounsi updated the task description. (Show Details)Nov 1 2018, 6:56 PM

In addition, I updated the task's description and included the ports for lvs1013 and lvs1014.

ayounsi updated the task description. (Show Details)Jan 16 2019, 10:13 PM
ayounsi moved this task from Blocked to Backlog on the ops-eqiad board.
Cmjohnson moved this task from Backlog to Racking Tasks on the ops-eqiad board.Jan 30 2019, 9:59 PM
This comment was removed by Cmjohnson.

lvs1013 and lvs1014 still need to be connected.

Cmjohnson updated the task description. (Show Details)Tue, Apr 16, 8:43 PM

@ayounsi lvs1013 and 1014 on-site work has been completed. I did not add the LVS vlan....I will leave that to you. I still need to run the cross-connects but the servers can be installed.

ayounsi updated the task description. (Show Details)Tue, Apr 16, 10:18 PM

Vlan is configured for eth0 of those two servers, the port is still showing as down though.
I also configured the vlan for their eth1 port.

RobH updated the task description. (Show Details)Thu, Apr 18, 9:51 PM