Page MenuHomePhabricator

Run IPVS in a separate network namespace
Open, MediumPublic


Our LVS directors (lvsNNNN servers) currently have service IPs bound on loopback (label lo:LVS); this is a limitation on how IPVS works in the kernel and has the effect that any service that runs on the server itself and bounds to INADDR_ANY/inaddr6_any also listens on the service IPs. This has resulted in long-standing issues such as T103882, as well as hacky commits like rOPUP64df843e00e7 (for T100519), which in turn protects from the race of having port 22 being responded by the host itself before IPVS is configured.

Besides all of this being dirty, it's also actually insecure as well, especially if we take into account that lvsNNNN boxes currently are not protected by netfilter due to performance considerations but are nevertheless exposed to the Internet with multiple public IPs.

A much cleaner solution would be to completely separate the IPs of IPVS-as-a-loadbalancer and LVS-as-the-server in Linux included, and treat them separately for all intents and purposes.

A way to do this is by employin Linux's network namespaces. I've experimented a bit on my computer and I think the following solution would work:

  • Create an "ipvs" network namespace (ip netns add ipvs)
  • Bring lo up (ip netns exec ipvs ip link set lo up), set up service IPs bound on it, ip netns exec ipvs ip address add … dev lo label lo:LVS (what, confusingly, the wikimedia-lvs-realserver package does right now).
  • Set sysctls net.ipv4.conf.all.arp_ignore=1 and net.ipv4.conf.all.arp_announce=2 also under that namespace (currently also done by wikimedia-lvs-realserver — not sure if it's needed at all). Also, optionally but preferrably, disable IPv6 autoconfiguration in that namespace.
  • Move the VLAN interfaces to the namespace itself, ip link set eth0.100 netns ipvs (after potentially creating them with ip link add link eth0 name eth0.100 type vlan id 100, if /etc/network/interfaces doesn't do it for us, see below) and assign IPs to them ip netns exec ipvs ip address add … dev eth0.100. Note that while for the "other row" interfaces (eth1/2/3), we could in theory move the whole interface into the namespace first, then create the VLAN subinterfaces *in* the namespace, care should be taken regarding the /sys/class/net/ethN hierarchy which we heavily rely on for our RPS/RSS/XPS adjustments.
  • Assuming all of the above is done, there is still an issue with the traffic intended for the server's primary VLAN, what is typically bound on eth0. The problem here is, essentially, that we use eth0 both for the server's own traffic, and for directing traffic to realservers that are on the same VLAN. There are two solutions to this that I can see of:
    1. Wire the primary VLAN separately; keep eth0 for the "main" connectivity and eth1/2/3/4 for the rows interconnection. This is the cleanest but a) requires us to rewire all LVS servers, b) most of our servers don't have a separate 1G connection, I think...
    2. Create a macvlan interface on eth0, with a different set of IPs and then assign *that* to the namespace, essentially emulating the separate network card above. The interface could be named after the VLAN it represents, so e.g. for esams: ip link add link eth0 eth0.103 type macvlan; ip link set eth0.103 netns ipvs; ip netns exec ipvs ip link set eth0.103 up; ip netns exec ipvs ip address add … dev eth0.103
  • Finally, give the namespace a default gw (ip netns exec ip route add default via …) using one of the interfaces so that ICMP (pings, unreachables) etc. work.

The following changes would need to be implemented to do all this, which I'm still researching:

  • ifupdown (/etc/network/interfaces does) not support namespaces (see Debian bugs #651919, #743309). It's possible we could run a separate instance of it with a separate interfaces file, but then we need a separate state file as well which is not configurable, so this essentially means a privately-mounted /run in a mount namespace… it gets hairy real quick. It might be just simpler to do all of this IP configuration ourselves.
  • The wikimedia-lvs-realserver package would need to be adjusted — or even better, deprecated entirely, at least for the directors (for which it's confusingly named anyway). Its contents could be folded to whatever solution we found for the above, part of the overall IP setup.
  • Pybal needs to be adjusted; we'll need to wrap the ipvsadm calls with ip netns exec (which is fully supported and tested.). If/when Pybal switches to using Netlink sockets directly, we'll need to figure out a way to do that in a separate thread and call setns() in it, which isn't going to be very straightforward (thread management in Python, ctypes, etc…). Plus, there is another pretty serious catch here too, though: the inbound traffic would need to arrive inside the namespace and thus the advertised BGP routes shouldn't have "next-hop self", but instead point to the IP bound to the aforementioned macvlan interface or one of the cross-row interfaces (or, even loadbalance across them; or pick the row which has the most in-subnet realservers configured for that service IP!)

Other things that need further investigation:

  • The performance impact of both running IPVS in a namespace and using a macvlan are largely unknown. Namespaces are supposed to be lightweight but our LVS workload can be pretty demanding.
  • If all of this gets implemented, it's quite possible that this will pave the road to implementing regular netfilter rules in the main namespace (which will be, in theory, oblivious to all the IPVS stuff happening in the other namespace) and thus have those servers be a more "regular" part of the infrastructure. This would need to be explored further and in particular whether this will present performance issues as well.

Event Timeline

faidon raised the priority of this task from to Medium.
faidon updated the task description. (Show Details)
faidon added projects: acl*sre-team, Traffic, Pybal.
faidon added a subscriber: faidon.

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all such tickets that haven't been updated in 6 months or more. This does not imply any human judgement about the validity or importance of the task, and is simply the first step in a larger task cleanup effort. Further manual triage and/or requests for updates will happen this month for all such tickets. For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!