Page MenuHomePhabricator

Separate WMCS control and management plane traffic
Open, LowPublic

Description

Management vs Control-Plane Traffic

The current WMCS physical network setup (i.e. outside the OpenStack / Neutron) is documented here.

Broadly speaking cloud hosts have two connected NICs, one belonging to the "production" WMF realm, and one belonging to the "cloud" realm. The recent changes have seen this isolation increased by introducing VRFs to the cloud switches to support routing and failover at layer-3 while retaining the required isolation.

Ideally the "production realm" link from cloud hosts would only be used for management functions, including:

  • Communicating with install* hosts during reimage (DHCP, pulling OS image, apt packages etc.)
  • Scraping from WMF prometheus or other monitoring
  • Running cookbooks from cumin hosts
  • SSH access to the host OS

Right now, however, the production realm / private network is also used by various cloud services / daemons for communication. I am not familiar with all of these, but for instance RabbitMQ traffic goes through the production realm, as does Ceph "public network" traffic from clients to the ceph cluster nodes.

New Network

Logically it makes more sense, I think, for traffic between cloud hosts, for cloud services, to remain in the cloud realm / vrf. So basically I'd propose that a new network/subnets be created, with private IPv4 addressing and unrouted IPv6 global unicast (if v6 required), to serve as transport for such protocols. This network would be placed into the existing cloud vrf on those switches, or a separate VRF if a strong requirement for isolation from cloud-instance or cloud-storage networks is identified.

Such a setup might also reduce the requirement for certain cloud services to sit on public vlans. It appears that certain services, like those running on the cloudcontrol and cloudrabbit hosts, are currently built to use public IPs to ensure cloud instances (VMs) can communicate with them. Instances are blocked from communicating directly with private WMF space otherwise.

If that is in fact the case it would seem to be a waste of public IPv4 space, and potentially opens up an attack surface (services running on public IPs and thus routable from the internet) when it is not required. The cloudgw nodes could instead be connected to the proposed new network, and route traffic coming from the cloud realm directly to services running on it, without leaving the cloud realm / vrf. Or cloudnet hosts could do this depending on where WMCS team felt it would best sit.

802.1q handoff

Connecting a third physical NIC to each cloud host is not practical for a cost and operational point of view. So probably to connect to this third "control" network we would want to handoff from datacenter switches to hosts as "vlan trunks", logically segmenting the different vlans through the use of 802.1q vlan tags.

That may require updates to what facts we expose in Puppet (see T296832), to allow for automatically setting switch ports to trunk mode with the correct vlan list. We also need a mechanism to define tagged interfaces on the cloud hosts. Our LVS hosts already configure sub-interfaces via puppet, see here and here, which might be an example to follow.

Conclusion

I'm opening this task to solicit feedback on the idea. Comments are more than welcome. I appreciate migrating from the current setup might be a long or tricky job, but if we can at least explore whether the end result would be desirable we can then consider if it is practical to implement or get there.

Overall I think it definitely makes sense to have this kind of traffic separated from the management stuff. In the long run I think it'd make things easier and more flexible for all teams.

Event Timeline

cmooney created this task.

Thanks for this task and the clear write-up. I agree with the overall problem statement and ideas to solve it. Adding some thoughts and (historical) context.

This is what case 4 of https://wikitech.wikimedia.org/wiki/Cross-Realm_traffic_guidelines#Case_4:_using_isolation_mechanisms is intended for

Some work was started in that regard with T297587: PoC: have cloud hardware servers to the cloud realm using neutron VLAN.
The main difference here (as I understand it) is that this PoC doesn't go as far as creating a new vlan but uses the existing cloud-instances one. Which have the potential of covering cases such as cloudcontrol/cloudrabbbit hosts.
Pro/cons of creating a new vlan should be evaluated compared to this and based on the softwares limitations. For example only having Ceph replication traffic on the cloud-hosts could be fine.

There was also some concerns from SRE about segmentation and isolation, as in: someone breaking into such host from the cloud realm could gain access to the prod realm. But I don't think I agree with those concerns.

On the 802.1q handoff, I can't find a task but there was also discussions about moving away from the 2 NICs per hosts to a trunk model. With tests to be done in the cloud codfw realm. The main benefits being lowest hardware cost, (we currently uses twice as many switches per rack in WMCS than prod), as well as ease of management (less cable to handle, less interfaces to configure).

I had a good chat with @aborrero today on some ideas on how to progress towards this goal. Some notes / additional thoughts based on this

New Vlans
  • We can call the vlans for the cloud realm control-plane traffic "cloud-private-<rack>"
    • This matches the use of "private1-<rack>" in production and is hopefully fairly intuitive
  • As a rule of thumb all cloud servers have a leg in their local cloud-private subnet
    • We should choose a 'supernet' for the cloud-private ranges, and allocate all the per-rack subnets from it
    • Each cloudsw will be the default gateway for the local vlan/subnet (using .1 addr, configured within the cloud vrf)
    • Cloud hosts will need a static route for the 'supernet' towards this IP
      • Matching what we did with the cloud-storage vlans
    • Reason for static is we can't have two default routes, and existing default is via prod realm 10.x gateway
  • Probably makes sense to choose a /16 from 172.16.0.0/12 for the supernet, and allocate per-rack /24s from this.
  • We should probably dedicated a separate /24 from it for service IPs/VIPs
Cloud Load-Balancers
  • New CloudLBs will use BGP to announce service IPs / VIPs to their directly connected cloudsw
    • Active/Passive is probably easiest way to operate this day 1, backup LB should announce service IPs with as-path prepended.
    • If active host dies then backup routes get used instead
    • HAproxy, or other software on the box, can also manipulate the BGP attributes, withdraw routes etc. to affect which LB is used
    • Full active/active Anycast is also an option, but we can consider that additional complexity later probably
  • /32 Service IPs should be from the cloud-private supernet if the service only needs to be reachable within the cloud realm
  • /32 Service IPs should be from the cloud realm public /24 (185.15.56.0/24) if the service needs to be reachable from internet, WMF prod or codfw cloud
  • Cloud-private ranges are not announced to CR routers by the cloudsw's.
  • CloudLB forwards traffic back out via the cloud-private vlan to "real servers" running the various services
    • HAproxy controls this
  • Real servers can do "direct return" (via cloudsw IP on cloud-private) for return traffic
    • i.e. no need for the return traffic to route via the CloudLB in that direction
Routing for cloud instances / OpenStack VMs
  • Cloudnet hosts will have a leg in the cloud-private vlan, to reach services, but this will be in the main netns
  • VM traffic continues to route through the neutron-created netns, getting forwarded to cloudgw as before
  • CloudGW becomes the control point, where fw rules etc. can be added to control routing between cloud-instances and cloud-private
  • There are some complications in terms of what to do on CloudGW
    • Simplest option is probably if cloudgw is not directly connected to the cloud-private subnet
      • As they already have a routed link into the cloudsw cloud vrf (Vlan1120)
    • VM traffic from cloudnet will be forwarded using this existing link
      • Just need to make sure it's not NAT'd like traffic for the internet
      • So add an earlier nftables rule matching destination <cloud-private> and do not snat
      • Return traffic to VM range 172.16.0.0/21 already routes from cloudsw back to cloudgw VIP on this interface
    • If cloudgw itself needs to reach services on the cloud-private range (from main vrf), a static via the VRF table can be added:
      • ip route add <cloud-private-supernet> via 185.15.56.242 dev vrf-cloudgw
      • Hosts on cloud-private subnets may need a specific route back to 185.15.56.240/29 to support this
Result
  • This approach minimizes the amount of cross-realm traffic we have
    • Cloud nodes no longer need to cross realms to connect to OpenStack services, rabbitmq, Swift ec.
  • It minimizes the use of public IPv4 space
    • /32s can be announced by the CloudLB
    • Cloud services which don't need to be reachable publicly can use private service VIP within cloud realm

I had a good chat with @aborrero today on some ideas on how to progress towards this goal. Some notes / additional thoughts based on this

I agree with all this information.

I'll try to capture this in https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Iteration_on_network_isolation for posterity

Probably makes sense to choose a /16 from 172.16.0.0/12 for the supernet, and allocate per-rack /24s from this.

Please keep in mind Docker uses at least 172.17.0.0/16 - I'd prefer avoiding those so we don't need to implement workarounds/config changes

/32 Service IPs should be from the cloud realm public /24 (185.15.56.0/24) if the service needs to be reachable from internet, WMF prod or codfw cloud

Umm. So eqiad1 services can use internal IPs, but codfw1dev services need to use public IPs in the eqiad1 space. Wouldn't it make more sense to also use (possibly separate) internal subnet for codfw1dev services?

Real servers can do "direct return" (via cloudsw IP on cloud-private) for return traffic

I don't think you can do with HAProxy - are you planning to use some other (L4) load balancer?

/32 Service IPs should be from the cloud realm public /24 (185.15.56.0/24) if the service needs to be reachable from internet, WMF prod or codfw cloud

Umm. So eqiad1 services can use internal IPs, but codfw1dev services need to use public IPs in the eqiad1 space.

I'm not sure I understand you fully, but on the face of it that's not what I had in mind. The point here is that the cloud-private ranges in eqiad are local-only, anything that needs to support connections from elsewhere should use a public IP.

Wouldn't it make more sense to also use (possibly separate) internal subnet for codfw1dev services?

Absolutely. We'll mirror anything we do in Eqiad in codfw, and likely test there first.

I am not anticipating stretching the cloud-vrf between cloudsw in eqiad and codfw, however. In other words this plan does not include a way for cloud-private-eqiad subnets to talk directly to cloud-private-codfw subnets.

@aborrero we should maybe discuss that, I'm not sure what cross-site traffic you have already. The private prod ranges are routed across sites, so perhaps some of that is being done?

In theory it would be possible to extend the cloud vrf between sites, maintaining segmentation while allowing the cloud-private networks at either site to talk directly. But it would be a very large job and bring new levels of complexity to our core/WAN network that I'm not sure are justified.

Real servers can do "direct return" (via cloudsw IP on cloud-private) for return traffic

I don't think you can do with HAProxy - are you planning to use some other (L4) load balancer?

I believe Arturo is considering HAProxy, but it's a matter for the cloud team we're fairly agnostic either way. I think it supports it provided things are configured correctly. If not then the traffic can route symmetrically back through the LB. But if the option exists probably better to take the more direct path.

https://www.haproxy.com/documentation/aloha/latest/load-balancing/direct-server-return/

About HAproxy: In my mind, deciding on which particular software we will use in the new service-abstraction / load-balancing layer is for a later stage.
However, we currently have HAproxy running on the cloudcontrol servers (in proxy mode) that we could just relocate to the new cloudlb servers as a means of bootstrapping the architecture.
For other services requiring DSR or UDP abstractions, we could consider other options. HAproxy uses LVS/ipvsadm for them under the hood IIRC, so perhaps we could consider some other more modern options like nftlb (package, source).

/32 Service IPs should be from the cloud realm public /24 (185.15.56.0/24) if the service needs to be reachable from internet, WMF prod or codfw cloud

Umm. So eqiad1 services can use internal IPs, but codfw1dev services need to use public IPs in the eqiad1 space.

I'm not sure I understand you fully, but on the face of it that's not what I had in mind. The point here is that the cloud-private ranges in eqiad are local-only, anything that needs to support connections from elsewhere should use a public IP.

Your comment was written in a way that made me understand that everything used in codfw1dev would need to use public addressing - I'm now guessing that was meant to mean 'everything accessed from the other cloud site where the service is hosted'?

HAproxy uses LVS/ipvsadm for them under the hood IIRC

HAProxy does not. I think you're confusing it with Keepalived which does?

Your comment was written in a way that made me understand that everything used in codfw1dev would need to use public addressing - I'm now guessing that was meant to mean 'everything accessed from the other cloud site where the service is hosted'?

Apologies, I was only talking about the setup in eqiad. I referenced codfw only in relation to accessing things in eqiad from there. Sorry for the confusion!

HAProxy does not. I think you're confusing it with Keepalived which does?

I'd hope we can minimize the use of keepalived, and thus requirement of L2 adjacency between nodes, as much as possible. Just FYI. I understand that's not always possible of course.

HAproxy uses LVS/ipvsadm for them under the hood IIRC

HAProxy does not. I think you're confusing it with Keepalived which does?

Right, I think I was confused by this: https://www.haproxy.com/documentation/aloha/latest/load-balancing/protocols/udp/ which does indeed use LVS.
Anyway, we won't use that.