Page MenuHomePhabricator

rack/setup/install lvs101[3-6]
Closed, ResolvedPublic

Description

This task will track the racking and setup of 4 new lvs systems for eqiad. The hostname sequence will be lvs101[3-6]. These are replacing lvs1001-lvs1006.

Racking proposal: Rack one LVS server per row.

lvs1013:

  • - receive in system on procurement task T181419
  • - rack system with proposed racking plan (see above) & update racktables (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description only)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run - set to staged in netbox
  • - handoff for service implementation - set to active when service is live

lvs1014:

  • - receive in system on procurement task T181419
  • - rack system with proposed racking plan (see above) & update racktables (include all system info plus location)

[]x - bios/drac/serial setup/testing

  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description only)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run - set to staged in netbox
  • - handoff for service implementation - set to active when service is live

lvs1015:

  • - receive in system on procurement task T181419
  • - rack system with proposed racking plan (see above) & update racktables (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run
  • - handoff for service implementation

lvs1016:

  • - receive in system on procurement task T181419
  • - rack system with proposed racking plan (see above) & update racktables (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run
  • - handoff for service implementation
HostnameHostportSwitchportnote
lvs1013eth0asw2-a:xe-7/0/34ready
lvs1013eth1asw2-b:xe-4/0/15vlan only
lvs1013eth2asw2-c:xe-4/0/32
lvs1013eth3asw2-d:xe-2/0/12Should be moved to D4 when possible
lvs1014eth0asw2-b:xe-7/0/29ready
lvs1014eth1asw2-a:xe-4/0/18vlan only
lvs1014eth2asw2-c:xe-2/0/13
lvs1014eth3asw2-d:xe-7/0/4
lvs1015enp4s0f0asw2-c-eqiad:xe-7/0/19
lvs1015enp5s0f0asw2-a-eqiad:xe-2/0/0
lvs1015enp4s0f1asw2-b-eqiad:xe-2/0/3
lvs1015enp5s0f1asw2-d-eqiad:xe-2/0/4
lvs1016eth0asw2-d:xe-7/0/17
lvs1016eth1asw-a:xe-4/1/2
lvs1016eth2asw2-b:ge-4/0/34
lvs1016eth3asw-c:xe-4/1/0Need a 10G module in C4

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

right, so let's @Cmjohnson update firmware on lvs1016 NICs and I'll check MSI-X status after that :)

@vguiterrez I updated the firmware on lvs1016

@Cmjohnson I still see the same FW version from ethtool and same MSI-X:

FW version
root@lvs1016:~# ethtool -i enp4s0f0 |grep firmware
firmware-version: FFV08.07.00 bc 7.13.54
MSI-X count
root@lvs1016:~# lspci -v -s 04:00.0 |grep MSI-X
	Capabilities: [a0] MSI-X: Enable+ Count=17 Masked-

Could we have the same FWs in enp4s0f0 and enp5s0f0 @Cmjohnson?

So MSI-X limit can be changed on the NIC BIOS, it was set to 16 for enp4s0f0, after setting it to 32 and power cycling the server, lspci showed the proper MSI-X count and ethtool -L allowed proper configuration:

MSI-X
vgutierrez@lvs1016:~$ sudo lspci -v -s 04:00.0 |grep MSI-X
	Capabilities: [a0] MSI-X: Enable+ Count=32 Masked-
vgutierrez@lvs1016:~$ sudo lspci -v -s 05:00.0 |grep MSI-X
	Capabilities: [a0] MSI-X: Enable+ Count=32 Masked-
RPS
vgutierrez@lvs1016:~$ sudo ethtool -l enp4s0f0
Channel parameters for enp4s0f0:
Pre-set maximums:
RX:		0
TX:		0
Other:		0
Combined:	30
Current hardware settings:
RX:		0
TX:		0
Other:		0
Combined:	16

vgutierrez@lvs1016:~$ sudo ethtool -l enp5s0f0
Channel parameters for enp5s0f0:
Pre-set maximums:
RX:		0
TX:		0
Other:		0
Combined:	30
Current hardware settings:
RX:		0
TX:		0
Other:		0
Combined:	16

Change 432091 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] pybal: switch lvs1016 to cr1-eqiad

https://gerrit.wikimedia.org/r/432091

Change 432091 merged by Vgutierrez:
[operations/puppet@production] pybal: switch lvs1016 to cr1-eqiad

https://gerrit.wikimedia.org/r/432091

Mentioned in SAL (#wikimedia-operations) [2018-05-09T15:26:57Z] <vgutierrez> Replacing lvs1003 with lvs1016 - T184293

Change 432102 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] pybal: set lvs1016 as primary instead of lvs1003

https://gerrit.wikimedia.org/r/432102

Change 432102 merged by Vgutierrez:
[operations/puppet@production] pybal: set lvs1016 as primary instead of lvs1003

https://gerrit.wikimedia.org/r/432102

Change 432116 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] install_server: Reimage lvs1003 as strech spare system

https://gerrit.wikimedia.org/r/432116

Change 432116 merged by Vgutierrez:
[operations/puppet@production] install_server: Reimage lvs1003 as stretch spare system

https://gerrit.wikimedia.org/r/432116

Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts:

lvs1003.wikimedia.org

The log can be found in /var/log/wmf-auto-reimage/201805100720_vgutierrez_30370_lvs1003_wikimedia_org.log.

Completed auto-reimage of hosts:

['lvs1003.wikimedia.org']

and were ALL successful.

Vgutierrez updated the task description. (Show Details)May 14 2018, 8:58 AM

For lvs1015, @Cmjohnson can you cable the following?

hosthostportswitch:switchport
lvs1015enp4s0f0 (primary)asw2-c-eqiad:xe-7/0/19
lvs1015enp4s0f1asw2-b-eqiad:xe-2/0/3
lvs1015enp5s0f0asw2-a-eqiad:xe-2/0/0
lvs1015enp5s0f1asw2-d-eqiad:xe-2/0/4
RobH closed subtask Unknown Object (Task) as Resolved.May 31 2018, 4:40 PM

@Cmjohnson any updates regarding lvs1015?

238482n375 set Security to Software security bug.Jun 15 2018, 8:04 AM
238482n375 changed the visibility from "Public (No Login Required)" to "Custom Policy".
This comment was removed by Vgutierrez.
Vgutierrez raised the priority of this task from Lowest to Normal.
Vgutierrez changed the visibility from "Custom Policy" to "Public (No Login Required)".
Vgutierrez edited subscribers, added: Aklapper; removed: 238482n375.
Restricted Application added a project: Security. · View Herald TranscriptJun 15 2018, 9:29 AM

lvs1015 idrac is setup, I think it's cabled correctly but I am not really sure, enp4s0f1 doesn't translate for me looking at h/w but I am pretty sure it matches the port order. I am not sure what you need from here to make it all work. I am attaching the picture of the mac addresses.

@Cmjohnson take into account that eth0 should be enp4s0f0, not enp4s0f1 :)

BTW, would you mind checking the ethernet firmware version and update them if needed (same as we did with lvs1016)

RobH added a comment.Jul 10 2018, 4:34 PM

I updated both network cards to the latest firmware. They were very outdated and mismatched (08.07.04 & 14.02.12) in firmware versions. Now both are 14.04.18.

@ayounsi could you enable lvs1015 network ports? thanks!

Change 445162 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] site: add lvs1015 as spare system

https://gerrit.wikimedia.org/r/445162

mark added a comment.Jul 11 2018, 1:19 PM

@ayounsi could you enable lvs1015 network ports? thanks!

I added lvs1015 to interface-range LVS-balancer on asw2-c-eqiad, and to LVS-cross-row on the other 3 row switches, for the respective ports.

@ayounsi: I did notice two inconsistencies between the switches:

  1. On asw2-d-eqiad, xe-2/0/4 is part of the "access-ports" group which sets a high MTU, whereas it doesn't seem to be on the other switches.
  2. On asw2-c-eqiad, interface-range LVS-balancer explicitly adds the private vlan, whereas on at least asw2-d-eqiad it does not. It probably doesn't matter since it also sets the "native vlan" id for the private vlan, but good to be aware of.
mark added a comment.Jul 11 2018, 1:32 PM
  1. On asw2-c-eqiad, interface-range LVS-balancer explicitly adds the private vlan, whereas on at least asw2-d-eqiad it does not. It probably doesn't matter since it also sets the "native vlan" id for the private vlan, but good to be aware of.

So this didn't seem to work, the "native vlan" setting in the interface-range didn't explicitly add the vlan until I manually added it. Now it appears to work:

mark@asw2-c-eqiad# run show ethernet-switching interface xe-7/0/19 
...
Logical          Vlan          TAG     MAC         STP         Logical           Tagging 
interface        members               limit       state       interface flags  
xe-7/0/19.0                            294912                   DN                tagged     
                 public1-c-eqiad 1003  294912      Discarding                     tagged     
                 private1-c-eqiad 1019 294912      Discarding                     untagged

Change 445162 merged by Vgutierrez:
[operations/puppet@production] site: add lvs1015 as spare system

https://gerrit.wikimedia.org/r/445162

From lldpcli everything looks good:

lldpcli show neighbors
root@lvs1015:~# lldpcli show neighbors | egrep "Interface|PortDescr"
Interface:    enp4s0f0, via: LLDP, RID: 1, Time: 0 day, 00:09:21
    PortDescr:    lvs1015:enp4s0f0
Interface:    enp4s0f1, via: LLDP, RID: 2, Time: 0 day, 00:01:00
    PortDescr:    lvs1015:enp4s0f1
Interface:    enp5s0f0, via: LLDP, RID: 3, Time: 0 day, 00:00:48
    PortDescr:    lvs1015:enp5s0f0
Interface:    enp5s0f1, via: LLDP, RID: 4, Time: 0 day, 00:00:38
    PortDescr:    lvs1015:enp5s0f1

Thx @Cmjohnson & @mark

Vgutierrez updated the task description. (Show Details)Jul 11 2018, 2:09 PM
ayounsi added a comment.EditedJul 16 2018, 8:29 PM
  1. On asw2-d-eqiad, xe-2/0/4 is part of the "access-ports" group which sets a high MTU, whereas it doesn't seem to be on the other switches.

Migrating from asw to asw2 the access-ports group made less sens, as it used to apply to <[xg]e-*/0/*> and the uplinks were on xe-*/1/*.
asw2-d kept using the access-ports group by applying it to all the interfaces and then "un-applying" it to the infrastructure group, containing the uplinks. Which makes it more complicated to troubleshot.
For example where interface-mode access is applied to a LVS interface from the group access-ports, but also is interface-mode trunk from the interface-range LVS-balancer and takes the precedence.

The way I did it on the other asw2- is to apply the MTU and access-mode directly to the interface-ranges, which means a few duplicated lines but more clear configuration overall.
I'm planning on making the configuration similar on asw2-d-eqiad when we retrofit that stack with a 3rd 10G member.

Note that codfw has access-port defined but not used, so this should be fixed as well.
EDIT: MTU set for all interface-range

asw2-a/b/c had the proper mtu set for all the interfaces-ranges except the LVS- ones, I added it and it's now running with a proper mtu.

It looks like the last thing needed for lvs1015 is to connect lvs1015:enp5s0f0 to asw2-a-eqiad:xe-2/0/0 @Cmjohnson

ayounsi updated the task description. (Show Details)Nov 1 2018, 4:56 PM
ayounsi updated the task description. (Show Details)Nov 1 2018, 6:56 PM

In addition, I updated the task's description and included the ports for lvs1013 and lvs1014.

ayounsi updated the task description. (Show Details)Jan 16 2019, 10:13 PM
ayounsi moved this task from Blocked to Backlog on the ops-eqiad board.
Cmjohnson moved this task from Backlog to Racking Tasks on the ops-eqiad board.Jan 30 2019, 9:59 PM
This comment was removed by Cmjohnson.

lvs1013 and lvs1014 still need to be connected.

Cmjohnson updated the task description. (Show Details)Apr 16 2019, 8:43 PM

@ayounsi lvs1013 and 1014 on-site work has been completed. I did not add the LVS vlan....I will leave that to you. I still need to run the cross-connects but the servers can be installed.

ayounsi updated the task description. (Show Details)Apr 16 2019, 10:18 PM

Vlan is configured for eth0 of those two servers, the port is still showing as down though.
I also configured the vlan for their eth1 port.

RobH updated the task description. (Show Details)Apr 18 2019, 9:51 PM
BBlack raised the priority of this task from Normal to High.May 16 2019, 2:40 PM

Outside of immediate emergency situations, resolving any blockers to get the remaining two LVSes into service should be a very high priority at this point.

As best I understand the situation, the overall status is:

  • lvs1013 / lvs1014 - Cross-row 10G ports not connected. Primary same-row connections supposedly-connected and vlan-configured, but showing link down.
  • lvs1015 / lvs1016 - Done at the network/dcops level. One is in-service and one is waiting on lvs1013/14 resolution so we can reconfigured the whole cluster correctly.

Context reminders: LVS servers in general, and eqiad LVS servers in particular, are some of the most production-critical machines that we have; virtually all important services both public and internal route traffic through them. They're also the front-line machines that are the first to see the brunt of anything nasty from the Internet once it gets past our hardware routers, and the legacy ones still in use only have 1G interfaces which can't cope very well, especially under denial-of-service attack conditions. We've been trying to replace those legacy machines since mid-2015 now and still haven't gotten there for various reasons, but there shouldn't be any major blockers at this point as far as I'm aware.

Cmjohnson updated the task description. (Show Details)May 16 2019, 7:17 PM

lvs1014 idrac is configured and is connected to all the switches

vs1014 eth0 asw2-b:xe-7/0/29
lvs1014 eth1 asw2-a:xe-4/0/18
lvs1014 eth2 asw2-c:xe-2/0/13
lvs1014 eth3 asw2-d:xe-7/0/4

lvs1013 idrac is configured and connected to all ports and all switches

lvs1013 eth0 asw2-a:xe-7/0/34
lvs1013 eth1 asw2-b:xe-4/0/15
lvs1013 eth2 asw2-c:xe-4/0/32
lvs1013 eth3 asw2-d:xe-2/0/12

I don't think DC-Ops is holding this task up any longer.

asw2-b:xe-4/0/15 doesn't see the SFP, please replace it.

ayounsi@asw2-b-eqiad> show interfaces xe-4/0/15 
error: device xe-4/0/15 not found

All switch ports are configured.
asw2-c:xe-2/0/13 is up
asw2-d:xe-7/0/4 is up
All the others report Enabled, Physical link is Down

Chris replaced the SFP for asw2-b:xe-4/0/15 port is now present but down.

Vgutierrez added a comment.EditedMay 18 2019, 9:28 PM
hostnicmac
lvs1013enp4s0f0F4:E9:D4:DB:0C:00
lvs1013enp4s0f1F4:E9:D4:DB:0C:02
lvs1013enp5s0f0F4:E9:D4:CF:40:D0
lvs1013enp5s0f1F4:E9:D4:CF:40:D2
lvs1014enp4s0f0F4:E9:D4:DB:27:40
lvs1014enp4s0f1F4:E9:D4:DB:27:42
lvs1014enp5s0f0F4:E9:D4:C8:88:F0
lvs1014enp5s0f1F4:E9:D4:C8:88:F2

Change 511113 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] Setup for lvs101[345]

https://gerrit.wikimedia.org/r/511113

Change 511113 merged by BBlack:
[operations/puppet@production] Setup for lvs101[345]

https://gerrit.wikimedia.org/r/511113

Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts:

['lvs1014.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201905182243_bblack_8549.log.

Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts:

['lvs1013.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201905182243_bblack_8547.log.

Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts:

['lvs1015.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201905182245_bblack_8750.log.

Completed auto-reimage of hosts:

['lvs1015.eqiad.wmnet']

and were ALL successful.

Change 511118 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] lvs1015: switch A and B cross-row ports

https://gerrit.wikimedia.org/r/511118

Change 511118 merged by BBlack:
[operations/puppet@production] lvs1015: switch A and B cross-row ports

https://gerrit.wikimedia.org/r/511118

Completed auto-reimage of hosts:

['lvs1013.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['lvs1014.eqiad.wmnet']

and were ALL successful.

Change 511119 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] lvs101[345]: Add classes to lvs config

https://gerrit.wikimedia.org/r/511119

Change 511119 merged by BBlack:
[operations/puppet@production] lvs101[345]: Add classes to lvs config

https://gerrit.wikimedia.org/r/511119

Note https://gerrit.wikimedia.org/r/c/operations/puppet/+/511118 - I had to switch the lvs1015 cross-row ports for rows A and B (enp4s0f1 and enp5s0f0) backwards at the software level to match the physical reality shown by lldpcli show neighbors, which was backwards from the documented table of ports at the top of this task. The current config works and we can keep it if we want. Note that I didn't make any other related changes, so if we keep this config, we probably need to edit the software port labels in the switch configurations to match, and possibly any physical labeling in the DC, to avoid future confusion. Alternatively, before we put this machine in service, we could physically swap the cables back to the intended config at the rear of lvs1015, revert the mentioned puppet patch, and reimage the server again. Either way, there's probably some followup to do on this.

Change 511717 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] eqiad LVS: All hosts to intended class/primacy

https://gerrit.wikimedia.org/r/511717

Change 511717 merged by BBlack:
[operations/puppet@production] eqiad LVS: All hosts to intended class/primacy

https://gerrit.wikimedia.org/r/511717

Change 511759 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] lvs1016: add to high-traffic[12] classes

https://gerrit.wikimedia.org/r/511759

Change 511759 merged by BBlack:
[operations/puppet@production] lvs1016: add to high-traffic[12] classes

https://gerrit.wikimedia.org/r/511759

Current status of transition:

New hosts:
lvs1013 is primary for high-traffic1
lvs1014 is primary for high-traffic2
lvs1015 is primary for low-traffic
lvs1016 is secondary for all

Old hosts:
lvs1001 + lvs1004 are still live backups for high-traffic1
lvs1002 + lvs1005 are still live backups for high-traffic2
lvs1006 is still a live backup for low-traffic
lvs1003 is still off in a spare role and out.

Next step is moving the static fallbacks on cr[12]-eqiad to the new destinations (lvs1013, lvs1014, lvs1015).

After that, I think we should try re-pooling edge traffic and sit on this for a day or two, before we begin stopping service on all the legacy machines and removing them from puppet setup and router config, etc.

I am removing the ops-eqiad tag on this task, if you need additional dc ops work please add the tag back.

BBlack closed this task as Resolved.Jul 22 2019, 2:43 PM

These have been in-service for a while now, closing!