Re-setup lvs1007-lvs1012, replace lvs1001-lvs1006
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	faidon
	Nov 8 2016, 12:51 PM

Description

lvs1007-lvs1012 have been racked and installed for over a year ago now (T104458). Unfortunately, due to asw-d-eqiad issues (T112781) we have refrained from using them in production. They are installed and configured with the appopriate puppet role but they are currently configured with a regex in hieradata/regex.yaml as BGP-disabled, and BGP configuration on the routers is missing as well. Moreover:

< bblack> we may have disabled some of their ports in other rows during some testing, or who knows what.  pretty much have to audit it all again after the D move
< bblack> (and almost certainly even if basic network links look ok and no SNMP strangeness, they need reinstalls all over again before thinking about traffic)

Since last night, they have been moved to our new asw2-d-eqiad switch. The new switch is not in production yet (cf. T148506) but will very soon be. In the meantime, we should prepare ourselves for this final bit of this LVS upgrade, finally.

Details

	Subject	Repo	Branch	Lines +/-
	LVS refactor: service IPs and sparing out lvs101[12]	operations/puppet	production	+63 -56
	LVS: new redundancy layout for new eqiad+ulsfo hosts	operations/puppet	production	+18 -16

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Declined		None	T150256 Re-setup lvs1007-lvs1012, replace lvs1001-lvs1006
		Declined		• Cmjohnson	T167299 Upgrade BIOS/RBSU/etc on lvs1007

Event Timeline

faidon created this task.Nov 8 2016, 12:51 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 8 2016, 12:51 PM

BBlack moved this task from Backlog to TLS on the Traffic board.Nov 8 2016, 12:56 PM

BBlack renamed this task from Re-setup lvs1007-lvs1012, replace lvs1001-lvs1005 to Re-setup lvs1007-lvs1012, replace lvs1001-lvs1006.Nov 9 2016, 5:28 PM

BBlack moved this task from TLS to LoadBalancer on the Traffic board.Nov 15 2016, 2:04 PM

Re: ethernet port validation / config, the last table we had in the old ticket is here: T104458#1788478 . The idea was to try our best to ensure that a given vlan/row's LVS connections are FPC-redundant between the primaries and secondaries. e.g. if lvs1007 (primary for high-traffic1) connects to row C in asw-c-eqiad FPC 5, then lvs1010 (secondary for high-traffic1) needs to connect to row C / asw-c-eqiad in some FPC other than 5. Row D connections have probably changed entirely since that last table was made, and there were some pending moves/fixups listed there as well which may or may not have already happened.

Current situation:

host	port	switch	port	redundancy issues
lvs1007	eth0	asw2-a5-eqiad	xe-0/0/8.0	lvs1010 eth1 also on asw2-a5 xe-0
lvs1007	eth1	asw-c-eqiad	xe-8/0/26.0	lvs1010 eth0 also on asw-c xe-8
lvs1007	eth2	asw-b-eqiad	xe-5/1/0.0
lvs1007	eth3	asw2-d-eqiad	xe-2/0/45
lvs1010	eth0	asw-c-eqiad	xe-8/0/23.0	lvs1007 eth1 also on asw-c xe-8
lvs1010	eth1	asw2-a5-eqiad	xe-0/0/11.0	lvs1007 eth0 also on asw2-a5 xe-0
lvs1010	eth2	asw-b-eqiad	xe-6/1/0.0
lvs1010	eth3	asw2-d-eqiad	xe-7/0/45
-
lvs1008	eth0	asw2-a5-eqiad	xe-0/0/9.0	lvs1011 eth1 also on asw2-a5 xe-0
lvs1008	eth1	asw-c-eqiad	xe-8/0/27.0	lvs1011 eth0 also on asw-c xe-8
lvs1008	eth2	asw-b-eqiad	xe-5/1/2.0
lvs1008	eth3	asw2-d-eqiad	xe-2/0/46
lvs1011	eth0	asw-c-eqiad	xe-8/0/24.0	lvs1008 eth1 also on asw-c xe-8
lvs1011	eth1	asw2-a5-eqiad	xe-0/0/12.0	lvs1008 eth0 also on asw2-a5 xe-0
lvs1011	eth2	asw-b-eqiad	xe-6/1/2.0
lvs1011	eth3	asw2-d-eqiad	xe-7/0/46
-
lvs1009	eth0	asw2-a5-eqiad	xe-0/0/10.0	lvs1012 eth1 also on asw2-a5 xe-0
lvs1009	eth1	asw-c-eqiad	xe-8/0/28.0	lvs1012 eth0 also on asw-c xe-8
lvs1009	eth2	asw-b-eqiad	ge-8/0/45.0
lvs1009	eth3	asw2-d-eqiad	xe-2/0/47
lvs1012	eth0	asw-c-eqiad	xe-8/0/25.0	lvs1009 eth1 also on asw-c xe-8
lvs1012	eth1	asw2-a5-eqiad	xe-0/0/13.0	lvs1009 eth0 also on asw2-a5 xe-0
lvs1012	eth2	asw-b-eqiad	ge-8/0/46.0
lvs1012	eth3	asw2-d-eqiad	xe-7/0/47

Also notable: lvs1009 and lvs1012 connections to row B (eth2) are using 1GbE ports rather than 10GbE?

BBlack mentioned this in T165614: LLDP on cache hosts.May 17 2017, 5:40 PM

I'm probably backtracking into territory that was once known here, but after the long delay I felt I had to go back and re-validate what's going on with the ports and the thinking.

Running down the issues notable from the table above, and status of all related things:

lvs1009+lvs1012 connections to Row B are 1GbE. Unfixable Row B's 8x switches each have only 2x 10GbE ports, giving us 16 total there. 8 are used as core router uplinks, 4 for labnet100[12], and the remaining 4 are used for lvs100{7,8,10,11}. I think we chose 9 and 12 to have the downgraded links here for now because those are the low-traffic pair (as opposed to high-traffic[12]). Keep in mind also we're upgrading from the old lvs100[1-6] which have only 1GbE links on all ports, so it's not a regression. I don't think we can fix this for now, until switch hardware upgrades in row B.

Row A redundancy issues. Unfixable. asw-a-eqiad's 10GbE ports are all used for router uplinks or connections to asw2-a5-eqiad. asw2-a5-eqiad has enough 10GbE ports, and all of the new LVSes connect to row A through it. Relatedly, all of the cp10xx machines in row A (11 of them) are also connected to asw2-a5-eqiad. There is some room to install a couple of PICs in asw-a-eqiad and add a handful of 10GbE ports, but in practice it would make little difference for redundantly reaching the caches, basically. So, it's a pretty big single point of failure, and unfixable until the switch hardware scenario in Row A changes.

Row C redundancy issues. Unfixable. asw-c-eqiad doesn't have any available 10GbE ports outside of FPC 8. As with the situation above, 11 caches are also plugged into this FPC for the same reason, so even if we did plug a PIC into some other FPCs to gain a few LVS ports it doesn't fundamentally change the SPOF that exists in FPC 8 here.

So basically the ports we're plugged into now are what we're sticking with, and the LVS upgrade is still a viable plan. @ema already re-confirmed that the above ports are the correct mapping (nothing's mis-labeled/connected), and they all have link up at the expected speeds.

Next step is reinstall the machines (to undo any experimentation that happened earlier), and then we can look at the process of turning on BGP an updating eqiad router configs, etc.

BBlack mentioned this in T165765: Refactor pybal/LVS config for shared failover.May 19 2017, 2:20 PM

Change 356605 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] LVS: new redundancy layout for new eqiad ulsfo hosts

https://gerrit.wikimedia.org/r/356605

gerritbot added a project: Patch-For-Review.Jun 1 2017, 3:07 PM

Change 356605 merged by BBlack:
[operations/puppet@production] LVS: new redundancy layout for new eqiad ulsfo hosts

https://gerrit.wikimedia.org/r/356605

Change 356833 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] LVS refactor: service IPs and sparing out lvs101[12]

https://gerrit.wikimedia.org/r/356833

Change 356833 merged by BBlack:
[operations/puppet@production] LVS refactor: service IPs and sparing out lvs101[12]

https://gerrit.wikimedia.org/r/356833

Script wmf_auto_reimage was launched by bblack on neodymium.eqiad.wmnet for hosts:

['lvs1011.eqiad.wmnet', 'lvs1012.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201706021602_bblack_10096.log.

Completed auto-reimage of hosts:

['lvs1011.eqiad.wmnet', 'lvs1012.eqiad.wmnet']

Of which those FAILED:

set(['lvs1011.eqiad.wmnet'])

[lvs1011 above just had some minor salt keying issues, fixed+rebooted]

Script wmf_auto_reimage was launched by bblack on neodymium.eqiad.wmnet for hosts:

['lvs1007.eqiad.wmnet', 'lvs1008.eqiad.wmnet', 'lvs1009.eqiad.wmnet', 'lvs1010.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201706021701_bblack_29574.log.

Script wmf_auto_reimage was launched by bblack on neodymium.eqiad.wmnet for hosts:

['lvs1007.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201706051252_bblack_31674.log.

Script wmf_auto_reimage was launched by bblack on neodymium.eqiad.wmnet for hosts:

['lvs1007.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201706051423_bblack_22214.log.

Script wmf_auto_reimage was launched by bblack on neodymium.eqiad.wmnet for hosts:

['lvs1007.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201706071314_bblack_2991.log.

Script wmf_auto_reimage was launched by bblack on neodymium.eqiad.wmnet for hosts:

['lvs1007.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201706071353_bblack_10978.log.

BBlack created subtask T167299: Upgrade BIOS/RBSU/etc on lvs1007.Jun 7 2017, 2:05 PM

Mentioned in SAL (#wikimedia-operations) [2017-06-08T00:54:41Z] <mutante> cp4019 - powercycled (same as others) | lvs1007 - sits at installer - waiting for IP to be configured (T150256)