Apologies if this has been raised before, I did a quick search and didn't find an old task. Obviously before too long we will address by introducing Liberica.
Problem
The issue is that in our current POP design - for instance in drmrs and esams - we have a single point of failure for traffic going to VIPs announced by the 'high-traffic1' LVS instances. For instance these are the servers set for this in our pops:
$lvs_class_hosts = { 'high-traffic1' => $::realm ? { 'production' => $::site ? { 'esams' => [ 'lvs3008', 'lvs3010' ], 'drmrs' => [ 'lvs6001', 'lvs6003' ],
In both cases the two servers configured for this traffic have primary interfaces (where BGP is done) connected to the same switch:
cmooney@asw1-bw27-esams> show configuration protocols bgp group PyBal | display set | match descr set protocols bgp group PyBal neighbor 10.80.0.3 description lvs3008 set protocols bgp group PyBal neighbor 10.80.0.2 description lvs3010
cmooney@asw1-b12-drmrs> show configuration protocols bgp group PyBal | display set | match descr set protocols bgp group PyBal neighbor 10.136.0.16 description lvs6001 set protocols bgp group PyBal neighbor 10.136.0.17 description lvs6003
So basically if the single switch fails all traffic to these important services will fail :(
Solutions
I'm not 100% sure what the best fix is here, traffic team I'm open to hearing suggestions. Two ideas that spring to mind are:
- Make the lvs servers also BGP peer with the other switch at the POP
- Each has a link to this for the back-end traffic to realservers in the other rack
- If we create BGP peerings to the other switch over this we protect against the first switch dying
- This may not be possible with PyBal of course, although it does dual-peerings to CRs elsewhere so maybe?
- Change the backup configuration somehow, so that for instance lvs3009 is the backup for lvs3008 VIPs
- Or make a 3-tier backup setup so that lvs3009 would take over as a last resort