Page MenuHomePhabricator

Refactor pybal/LVS config for shared failover
Open, NormalPublic

Description

In the shorter term before we reach T165764, I'd like to get to a halfway-point that I think can work without new pybal features. This halfway-point maintains the current N+1 redundancy per traffic class, but uses a single shared failover hosts for all traffic classes. For a hypothetical eqiad, the configuration looks something like this:

HostTraffic classes configuredHost-level MED
lvs1001high-traffic110
lvs1002high-traffic210
lvs1003low-traffic10
lvs1004high-traffic1 + high-traffic2 + low-traffic20

The static fallbacks in the routers would still be as they are today, we'd just have two fewer idle LVS hosts in the core-site case. For the edge-site case, the picture is as above but with 2x traffic classes and 3x LVS hosts, instead of 3x traffic classes and 4x LVS hosts. There are some puppetization refactoring issues to sort through here, as the puppetization currently assumes a 1:1 mapping (in at least one direction) of traffic-class:host.

The original first target for this new config was the new 3xLVS hosts being deployed in ulsfo (already arrived), but things are complicated there by a lack of excess power/space and a desire to do a single outage to replace all the servers and re-cable, which doesn't give us a ton of overlap time to sort out the LVS refactoring.

I think a better option at this point is to pursue this new config on the 10G eqiad LVSes in T150256. We can do that and sort out the details on them while lvs1001-6 continue running for now, and it happens to solve another medium-term problem there with lack of a full complement of 10G ports in eqiad for the new LVSes anyways, by reducing the new LVS count from 6 to 4. Once the puppetization work is done there, it should be easy to bring up similarly-altered new clusters in ulsfo and then asia.

Details

Related Gerrit Patches:

Event Timeline

BBlack created this task.May 19 2017, 2:20 PM
Restricted Application added a project: Operations. · View Herald TranscriptMay 19 2017, 2:20 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
BBlack updated the task description. (Show Details)May 19 2017, 2:21 PM
ema triaged this task as Normal priority.May 22 2017, 4:14 PM
ema moved this task from Triage to LoadBalancer on the Traffic board.

Change 356605 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] LVS: new redundancy layout for new eqiad ulsfo hosts

https://gerrit.wikimedia.org/r/356605

Change 356605 merged by BBlack:
[operations/puppet@production] LVS: new redundancy layout for new eqiad ulsfo hosts

https://gerrit.wikimedia.org/r/356605

Change 356833 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] LVS refactor: service IPs and sparing out lvs101[12]

https://gerrit.wikimedia.org/r/356833

Change 356833 merged by BBlack:
[operations/puppet@production] LVS refactor: service IPs and sparing out lvs101[12]

https://gerrit.wikimedia.org/r/356833

What's missing here is turning on BGP peering with all local routers, which is available in our current 1.15 pybal releases. Will fix that up here and then resolve (the rest has been live for a while for all new LVS deploys).

T180069 - Ticket from the feature add for pybal itself

Change 536324 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] codfw backup LVS: BGP sessions with both routers

https://gerrit.wikimedia.org/r/536324

Change 536324 merged by BBlack:
[operations/puppet@production] codfw backup LVS: BGP sessions with both routers

https://gerrit.wikimedia.org/r/536324

Krinkle moved this task from Limbo to Watching on the Performance-Team (Radar) board.