Page MenuHomePhabricator

move cloudelastic behind cloudlb
Closed, DeclinedPublic

Description

We have a new load balancer layer cloudlb that can be used to expose services to the WMCS realm from hardware in the production realm without having to waste prod-realm public IP addresses by using LVS. We should investigate if we can move the cloudelastic service behind that.

Event Timeline

Thanks @taavi.

Is there a high-level view of what these hosts do? I notice checking on some of them there are a bunch of connections to them from mediawiki servers, directly to the ElasticSeach API.

There are also a smaller number of connections coming from our LVS, again to ElasticSearch. I guess that's the primary service it runs? What makes these "cloud" servers exactly?

taavi moved this task from Backlog to CloudElastic on the Data-Services board.

These are mirrors of the MediaWiki search indexes for WMCS clients, and I believe maintained by Data-Platform-SRE (formerly search platform?). So MediaWiki code updates the elasticsearch contents and clients in wmcs query those contents. I think the main question from the network side is how to handle the MediaWiki->Elasticsearch data flow?

I see there's new hardware coming to replace the previous ones in T342538: Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org, so the timing here is almost perfect to get them racked in WMCS racks instead.

Thanks @taavi! Adding Search Platform SWEs and SREs so we can discuss this proposal.

As far as Data Platform SRE, we are split between Data Engineering and Search Platform, but @RKemper and I still handle most of the Search Platform duties.

Gehel triaged this task as Low priority.Oct 11 2023, 8:41 AM
Gehel moved this task from Incoming to Misc on the Data-Platform-SRE board.

@taavi a few questions to clarify scope and amount of work required, since we've already been asked to move these servers from public to private IPs.

  • Is re-racking the hosts a hard requirement for moving behind cloudlb?
  • Are there other examples of migrating a service behind cloudlb?
  • Any other docs or tasks you think would be useful for us to review.

@taavi a few questions to clarify scope and amount of work required, since we've already been asked to move these servers from public to private IPs.

Overall I think moving to private IPs makes sense. The existing approach - using a public IP to support connections from VMs without breaking our cross-realm traffic policy - is an anti-pattern that needlessly wastes public IPs and adds security concerns.

  • Is re-racking the hosts a hard requirement for moving behind cloudlb?

It really depends what way we set it up. If we follow the exact setup of existing servers behind CloudLB, then yes. But this setup needs some cross-realm connectivity we didn't need for other services, so there is no exact template we can follow.

  • Are there other examples of migrating a service behind cloudlb?

Yes (see T341060), although none are a good fit.

There are basically three options as I see it. @ayounsi I'd be interested in what your thoughts were here:

Option 1

  • Keep cloudelastic behind WMF LVS load-balancers, but move them to private IPs
    • Clients in cloud would continue to use the public VIP to connect to the service

The downside with this is we have still 1 public IP used for the service (which doesn't need to be internet reachable), and we still have VM traffic connecting directly to something on the WMF side (LVS).

Option 2

  • Cloudelastic stays outside cloud racks, but move to the wmf private (10.x) vlans
    • Cloud LB announces a VIP to the cloud-private network that cloud hosts/VMs can use to reach the service
    • Cloud LB makes the back-end connections to cloudelastic hosts over its 10.x interface (we'd need to allow this in the ACLs)

The difference there is the original concept of CloudLB was the real / back-end servers would be within the cloud realm. This change makes CloudLB front a service for which the real servers are in 10.x WMF land.

Option 3

  • Move cloudelastic to the cloud-racks, keeping the LB -> cloudelastic traffic inside cloud-private
    • We'd reimage the cloudelastic hosts with the normal cloud networking setup
      • i.e. main link on cloud-hosts vlan in wmf realm, with secondary/vlan interface connected to the cloud-private network
    • The CloudLB would announce a VIP on 172.20.255.x as in option 2, cloud hosts / VMs connect to this
    • The back-end connections from CloudLB to cloudelastic would be over the 172.20.x.x cloud-private network
    • The cloudelastic hosts would however need to make cross-realm connections to mediawiki over their 10.x prod link (with ACLs updated to allow)

Preference

All of the options involve some "cross realm" traffic. That is unavoidable as the source of the data (mediawiki) is in WMF prod, and the nodes accessing it (via cloudelastic) are Cloud VMs.

Option 1 is probably simplest, however we are using a public IP to work-around our own restrictions about what cloud VMs can talk to - an anti-pattern we want to get rid of.

Option 2 introduces a model whereby the CloudLB fronts a service where the real servers are in WMF realm. This somewhat goes against the concept when introducing the cloud-private model - that the cloud-hosts would only use their 10.x link for control-plane traffic (SSH, puppet, monitoring etc). But option 3 also requires such cross-realm connections.

Option 2 is my preference overall. Making CloudLB the proxy between realms seems a good design choice, providing a central point for policy enforcement. It also does not require us to move any servers physically, and removes any requirement for public IPs.

Next steps

If we agree on option 2 the next steps would be:

  1. Assign an IP for the cloud-elastic service from the cloud-private vip range
  2. Configure the CloudLBs to announce this in BGP to cloudsw
  3. Configure the CloudLB HAproxy to load-balance connections for this IP towards the cloudelastic hosts
  4. Allow the connectivity from CloudLB -> cloudelastic on our core router ACLs
  5. Reconfigure clients of the service to use the new svc.private.eqiad.wikimedia.cloud hostname/VIP

I think the CloudLB could load-balance to cloudelastic hosts on either public or private vlans. So we could probably set up the above with them on their current IPs, then when that is in place migrate hosts one-by-one from public to private. We just need to update the CloudLB with the new IPs as they change.

Thanks for the thorough comment !

My vote goes to option 1 :)

  • It's a design we've done 1000 times (expose a prod service externally through the LVS), so it's "battle tested"
  • Doesn't increase complexity, doesn't introduce a new "domain crossing flow"
  • Easier to transition to
  • More flexible (eg. if this service needs to be exposed to more than CloudVMs)
  • I don't see the extra/current IP usage as a big enough downside
  • "still have VM traffic connecting directly to something on the WMF side (LVS)." that will always be true as the data is in prod WMF anyway.

My vote goes to option 1 :)

Ok. I've no strong objection.

  • "still have VM traffic connecting directly to something on the WMF side (LVS)." that will always be true as the data is in prod WMF anyway.

It's probably not great to expose it to the internet, but nftables rules restrict I'm sure. The other major difference is the CloudLB could be used to enforce some kind of policy or restriction on exactly what VM IPs were allowed connect etc. But that's also not something that's in place so I've no objection.

@bking I think you can proceed. We can reimage the servers onto the private vlan and everything else stays the same as it was. The process for that is a little crunky but fairly straightforward. I can step you through it if needed just ping me.

Thanks @cmooney , @taavi and @ayounsi . I've created T355617 for the private IP migration and will reach out after discussing the timetable with my team lead @Gehel .