The cloudlb servers have a special setup in which they are connected like this:
- to wikiland production networks (10.x) natively (default)
- to cloud-private subnets (172.20.x) on a VLAN interface.
- to the internet via BGP using a VIP through the cloud-private VLAN.
This VIP can receive traffic to the internet. But as of this writing, the return traffic will use the default route on the host, which is the wikiland production network. Therefore the traffic never returns because the asymmetric routing.
When we were originally thinking about this project we already anticipated this problem, see https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Iteration_on_network_isolation#Default_route_for_servers_connected_to_cloud-private
The options to address this include:
Option 1
Change the default native network to be cloud-private rather than wikiland production.
Option 2
Introduce a VRF / l3mdev in cloudlb servers, to allow having 2 separate routing tables with 2 different default routes.
We will need to instrument the services to use the right VRF for their operations, this includes:
- Bird BGP session with cloudsw. @cmooney has validated that BIRD can work in this setup.
- HAproxy backend connectivity
A VRF is what cloudgw uses for similar reasons.
Option 3
Introduce a linux netns. Similar to option 2 but more transparent to Bird / HAproxy.
Option 4
Some kind of magic or hack to allow the asymmetric routing, disable reverse path filter somewhere etc.