Page MenuHomePhabricator

Re-setup lvs1007-lvs1012, replace lvs1001-lvs1006
Closed, DeclinedPublic

Description

lvs1007-lvs1012 have been racked and installed for over a year ago now (T104458). Unfortunately, due to asw-d-eqiad issues (T112781) we have refrained from using them in production. They are installed and configured with the appopriate puppet role but they are currently configured with a regex in hieradata/regex.yaml as BGP-disabled, and BGP configuration on the routers is missing as well. Moreover:

< bblack> we may have disabled some of their ports in other rows during some testing, or who knows what.  pretty much have to audit it all again after the D move
< bblack> (and almost certainly even if basic network links look ok and no SNMP strangeness, they need reinstalls all over again before thinking about traffic)

Since last night, they have been moved to our new asw2-d-eqiad switch. The new switch is not in production yet (cf. T148506) but will very soon be. In the meantime, we should prepare ourselves for this final bit of this LVS upgrade, finally.

Event Timeline

BBlack renamed this task from Re-setup lvs1007-lvs1012, replace lvs1001-lvs1005 to Re-setup lvs1007-lvs1012, replace lvs1001-lvs1006.Nov 9 2016, 5:28 PM

Re: ethernet port validation / config, the last table we had in the old ticket is here: T104458#1788478 . The idea was to try our best to ensure that a given vlan/row's LVS connections are FPC-redundant between the primaries and secondaries. e.g. if lvs1007 (primary for high-traffic1) connects to row C in asw-c-eqiad FPC 5, then lvs1010 (secondary for high-traffic1) needs to connect to row C / asw-c-eqiad in some FPC other than 5. Row D connections have probably changed entirely since that last table was made, and there were some pending moves/fixups listed there as well which may or may not have already happened.

Current situation:

hostportswitchportredundancy issues
lvs1007eth0asw2-a5-eqiadxe-0/0/8.0lvs1010 eth1 also on asw2-a5 xe-0
lvs1007eth1asw-c-eqiadxe-8/0/26.0lvs1010 eth0 also on asw-c xe-8
lvs1007eth2asw-b-eqiadxe-5/1/0.0
lvs1007eth3asw2-d-eqiadxe-2/0/45
lvs1010eth0asw-c-eqiadxe-8/0/23.0lvs1007 eth1 also on asw-c xe-8
lvs1010eth1asw2-a5-eqiadxe-0/0/11.0lvs1007 eth0 also on asw2-a5 xe-0
lvs1010eth2asw-b-eqiadxe-6/1/0.0
lvs1010eth3asw2-d-eqiadxe-7/0/45
-
lvs1008eth0asw2-a5-eqiadxe-0/0/9.0lvs1011 eth1 also on asw2-a5 xe-0
lvs1008eth1asw-c-eqiadxe-8/0/27.0lvs1011 eth0 also on asw-c xe-8
lvs1008eth2asw-b-eqiadxe-5/1/2.0
lvs1008eth3asw2-d-eqiadxe-2/0/46
lvs1011eth0asw-c-eqiadxe-8/0/24.0lvs1008 eth1 also on asw-c xe-8
lvs1011eth1asw2-a5-eqiadxe-0/0/12.0lvs1008 eth0 also on asw2-a5 xe-0
lvs1011eth2asw-b-eqiadxe-6/1/2.0
lvs1011eth3asw2-d-eqiadxe-7/0/46
-
lvs1009eth0asw2-a5-eqiadxe-0/0/10.0lvs1012 eth1 also on asw2-a5 xe-0
lvs1009eth1asw-c-eqiadxe-8/0/28.0lvs1012 eth0 also on asw-c xe-8
lvs1009eth2asw-b-eqiadge-8/0/45.0
lvs1009eth3asw2-d-eqiadxe-2/0/47
lvs1012eth0asw-c-eqiadxe-8/0/25.0lvs1009 eth1 also on asw-c xe-8
lvs1012eth1asw2-a5-eqiadxe-0/0/13.0lvs1009 eth0 also on asw2-a5 xe-0
lvs1012eth2asw-b-eqiadge-8/0/46.0
lvs1012eth3asw2-d-eqiadxe-7/0/47

Also notable: lvs1009 and lvs1012 connections to row B (eth2) are using 1GbE ports rather than 10GbE?

I'm probably backtracking into territory that was once known here, but after the long delay I felt I had to go back and re-validate what's going on with the ports and the thinking.

Running down the issues notable from the table above, and status of all related things:

  1. lvs1009+lvs1012 connections to Row B are 1GbE. Unfixable Row B's 8x switches each have only 2x 10GbE ports, giving us 16 total there. 8 are used as core router uplinks, 4 for labnet100[12], and the remaining 4 are used for lvs100{7,8,10,11}. I think we chose 9 and 12 to have the downgraded links here for now because those are the low-traffic pair (as opposed to high-traffic[12]). Keep in mind also we're upgrading from the old lvs100[1-6] which have only 1GbE links on all ports, so it's not a regression. I don't think we can fix this for now, until switch hardware upgrades in row B.
  1. Row A redundancy issues. Unfixable. asw-a-eqiad's 10GbE ports are all used for router uplinks or connections to asw2-a5-eqiad. asw2-a5-eqiad has enough 10GbE ports, and all of the new LVSes connect to row A through it. Relatedly, all of the cp10xx machines in row A (11 of them) are also connected to asw2-a5-eqiad. There is some room to install a couple of PICs in asw-a-eqiad and add a handful of 10GbE ports, but in practice it would make little difference for redundantly reaching the caches, basically. So, it's a pretty big single point of failure, and unfixable until the switch hardware scenario in Row A changes.
  1. Row C redundancy issues. Unfixable. asw-c-eqiad doesn't have any available 10GbE ports outside of FPC 8. As with the situation above, 11 caches are also plugged into this FPC for the same reason, so even if we did plug a PIC into some other FPCs to gain a few LVS ports it doesn't fundamentally change the SPOF that exists in FPC 8 here.

So basically the ports we're plugged into now are what we're sticking with, and the LVS upgrade is still a viable plan. @ema already re-confirmed that the above ports are the correct mapping (nothing's mis-labeled/connected), and they all have link up at the expected speeds.

Next step is reinstall the machines (to undo any experimentation that happened earlier), and then we can look at the process of turning on BGP an updating eqiad router configs, etc.

Change 356605 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] LVS: new redundancy layout for new eqiad ulsfo hosts

https://gerrit.wikimedia.org/r/356605

Change 356605 merged by BBlack:
[operations/puppet@production] LVS: new redundancy layout for new eqiad ulsfo hosts

https://gerrit.wikimedia.org/r/356605

Change 356833 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] LVS refactor: service IPs and sparing out lvs101[12]

https://gerrit.wikimedia.org/r/356833

Change 356833 merged by BBlack:
[operations/puppet@production] LVS refactor: service IPs and sparing out lvs101[12]

https://gerrit.wikimedia.org/r/356833

Script wmf_auto_reimage was launched by bblack on neodymium.eqiad.wmnet for hosts:

['lvs1011.eqiad.wmnet', 'lvs1012.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201706021602_bblack_10096.log.

Completed auto-reimage of hosts:

['lvs1011.eqiad.wmnet', 'lvs1012.eqiad.wmnet']

Of which those FAILED:

set(['lvs1011.eqiad.wmnet'])

[lvs1011 above just had some minor salt keying issues, fixed+rebooted]

Script wmf_auto_reimage was launched by bblack on neodymium.eqiad.wmnet for hosts:

['lvs1007.eqiad.wmnet', 'lvs1008.eqiad.wmnet', 'lvs1009.eqiad.wmnet', 'lvs1010.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201706021701_bblack_29574.log.

Script wmf_auto_reimage was launched by bblack on neodymium.eqiad.wmnet for hosts:

['lvs1007.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201706051252_bblack_31674.log.

Script wmf_auto_reimage was launched by bblack on neodymium.eqiad.wmnet for hosts:

['lvs1007.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201706051423_bblack_22214.log.

Script wmf_auto_reimage was launched by bblack on neodymium.eqiad.wmnet for hosts:

['lvs1007.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201706071314_bblack_2991.log.

Script wmf_auto_reimage was launched by bblack on neodymium.eqiad.wmnet for hosts:

['lvs1007.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201706071353_bblack_10978.log.

Mentioned in SAL (#wikimedia-operations) [2017-06-08T00:54:41Z] <mutante> cp4019 - powercycled (same as others) | lvs1007 - sits at installer - waiting for IP to be configured (T150256)

Gave up on these machines!