In the shorter term before we reach T165764, I'd like to get to a halfway-point that I think can work without new pybal features. This halfway-point maintains the current N+1 redundancy per traffic class, but uses a single shared failover hosts for all traffic classes. For a hypothetical eqiad, the configuration looks something like this:
|Host||Traffic classes configured||Host-level MED|
|lvs1004||high-traffic1 + high-traffic2 + low-traffic||20|
The static fallbacks in the routers would still be as they are today, we'd just have two fewer idle LVS hosts in the core-site case. For the edge-site case, the picture is as above but with 2x traffic classes and 3x LVS hosts, instead of 3x traffic classes and 4x LVS hosts. There are some puppetization refactoring issues to sort through here, as the puppetization currently assumes a 1:1 mapping (in at least one direction) of traffic-class:host.
The original first target for this new config was the new 3xLVS hosts being deployed in ulsfo (already arrived), but things are complicated there by a lack of excess power/space and a desire to do a single outage to replace all the servers and re-cable, which doesn't give us a ton of overlap time to sort out the LVS refactoring.
I think a better option at this point is to pursue this new config on the 10G eqiad LVSes in T150256. We can do that and sort out the details on them while lvs1001-6 continue running for now, and it happens to solve another medium-term problem there with lack of a full complement of 10G ports in eqiad for the new LVSes anyways, by reducing the new LVS count from 6 to 4. Once the puppetization work is done there, it should be easy to bring up similarly-altered new clusters in ulsfo and then asia.