Page MenuHomePhabricator

ASW single-point of failure for LVS VIPs at POPs
Open, MediumPublic

Description

Apologies if this has been raised before, I did a quick search and didn't find an old task. Obviously before too long we will address by introducing Liberica.

Problem

The issue is that in our current POP design - for instance in drmrs and esams - we have a single point of failure for traffic going to VIPs announced by the 'high-traffic1' LVS instances. For instance these are the servers set for this in our pops:

$lvs_class_hosts = {
    'high-traffic1' => $::realm ? {
        'production' => $::site ? {
            'esams' => [ 'lvs3008', 'lvs3010' ],
            'drmrs' => [ 'lvs6001', 'lvs6003' ],

In both cases the two servers configured for this traffic have primary interfaces (where BGP is done) connected to the same switch:

cmooney@asw1-bw27-esams> show configuration protocols bgp group PyBal | display set | match descr 
set protocols bgp group PyBal neighbor 10.80.0.3 description lvs3008
set protocols bgp group PyBal neighbor 10.80.0.2 description lvs3010
cmooney@asw1-b12-drmrs> show configuration protocols bgp group PyBal | display set | match descr 
set protocols bgp group PyBal neighbor 10.136.0.16 description lvs6001
set protocols bgp group PyBal neighbor 10.136.0.17 description lvs6003

So basically if the single switch fails all traffic to these important services will fail :(

Solutions

I'm not 100% sure what the best fix is here, traffic team I'm open to hearing suggestions. Two ideas that spring to mind are:

  1. Make the lvs servers also BGP peer with the other switch at the POP
    1. Each has a link to this for the back-end traffic to realservers in the other rack
    2. If we create BGP peerings to the other switch over this we protect against the first switch dying
    3. This may not be possible with PyBal of course, although it does dual-peerings to CRs elsewhere so maybe?
  2. Change the backup configuration somehow, so that for instance lvs3009 is the backup for lvs3008 VIPs
    1. Or make a 3-tier backup setup so that lvs3009 would take over as a last resort

Event Timeline

cmooney triaged this task as Medium priority.Wed, Apr 17, 11:55 AM
cmooney created this task.

Change #1020843 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Add new BGP group for cross-rack PyBal peerings at L3 POPs

https://gerrit.wikimedia.org/r/1020843

Change #1020844 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Adjust LVS config in esams, drmrs to peer bit both ASWs

https://gerrit.wikimedia.org/r/1020844

I believe the two patches above, once merged, will add the required redundancy. Following option 1 above, creating backup peerings from the LVS hosts to the switch in the other rack.

Perhaps one option would be to ignore the puppet patch to change drmrs and esams for now - but merge the Homer one and configure magru LVS in puppet to peer with both switches?

So we can test/trial it in magru to build confidence? I've partially labbed it up and have no doubts about it but it probably makes sense to use the opportunity magru buildout presents.

@ayounsi pointed out another way - along the lines of option 2 in the description - that might allow us to overcome the problem simply.

Using drmrs for an example currently we have 3 LVS's as follows:

RackLVSRole
B12lvs6001High traffic 1 - primary
B12lvs6003Backup
B13lvs6002High traffic 2 - primary

The current scenario means if the switch in rack B12 fails then the primary and backup for high_traffic_1 goes down, and we are fully offline.

Discussing with Arzhel we pondered changing the roles as follows:

RackLVSRole
B12lvs6001High traffic 1 - primary
B12lvs6003High traffic 2 - primary
B13lvs6002Backup

In other words put both the normally active servers into the same rack. That means if the switch in that rack fails the backup - in the other rack - will take over for both traffic classes. The downside is that our traffic balance between racks becomes sort of skewed, all the ingress traffic would normally go through just one of our switches. While that is the case, at our POPs WAN circuits are at 10G, and we've 100G links between CRs and switches, so this is unlikely to be a major issue. I'm also conscious any work we do here is only temporary as we wait for Liberica.

Changing the LVS roles as per the above seems another fairly simple and viable way to remove the single-point-of-failure, so wanted to document the option. This would also address the issue in eqsin and ulsfo (where current design involves L2 switches and PyBal peering to CRs), unlike the backup BGP approach described above.