Page MenuHomePhabricator

cloud: decide on general idea for having cloud-dedicated hardware provide service in the cloud realm & the internet
Closed, ResolvedPublic

Description

What do we do when we have a cloud-dedicated hardware server and we need it to provide service to both cloud realm & the internet?

Ideas:

  • allocate a public IPv4 subnet behind cloudgw and have a NIC on the cloud-dedicated servers be on this subnet. How to do load-balancing then?
  • Use neutron VIP as load balancer
  • Use cloudgw to NAT to private cloudswift subnet
  • Run your own LVS
  • BGP to advertise IP, have VLAN terminate on cloudsw and not cloudgw

This also means extending or better shaping https://wikitech.wikimedia.org/wiki/Cross-Realm_traffic_guidelines#Case_4:_using_isolation_mechanisms

Related Objects

StatusSubtypeAssignedTask
ResolvedPapaul
Resolvedaborrero
DeclinedNone
DeclinedNone
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedayounsi
Resolvedcmooney
ResolvedPapaul
Resolvedcmooney
Resolvedcmooney
Resolvedaborrero
Resolvedaborrero
ResolvedJhancock.wm
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedcmooney
Resolvedaborrero
Resolvedaborrero
InvalidNone
Resolvedaborrero
Resolvedaborrero
OpenNone
Resolvedaborrero
Invalidaborrero
Resolvedaborrero
Resolvedfgiunchedi
Resolvedcmooney
ResolvedJhancock.wm
Resolvedaborrero
ResolvedAndrew
Resolvedaborrero
Resolvedaborrero
Resolved taavi
Opencmooney
Resolvedaborrero
Opencmooney
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
ResolvedAndrew
ResolvedAndrew
ResolvedAndrew
ResolvedAndrew
ResolvedAndrew
OpenNone
ResolvedAndrew
ResolvedAndrew
Resolvedaborrero
ResolvedJhancock.wm
ResolvedJelto
Resolvedaborrero
Resolved taavi
Resolvedaborrero
ResolvedPapaul
Resolvedaborrero
OpenAndrew
Resolvedcmooney
Resolvedcmooney
Resolvedcmooney
OpenAndrew
ResolvedJclark-ctr
ResolvedPapaul
Resolvedaborrero
Resolved taavi
OpenNone
ResolvedJclark-ctr
Resolvedaborrero
ResolvedRobH
Resolvedfnegri
ResolvedJclark-ctr
Resolved taavi
Resolved taavi
Resolved taavi
Resolved taavi
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Invalidaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolved taavi
Resolvedaborrero
Resolvedaborrero
Resolved taavi
ResolvedJclark-ctr
Resolved taavi
OpenNone
Resolved taavi
OpenNone
Resolved taavi
Resolved taavi
OpenNone
OpenNone
Resolved taavi
ResolvedAndrew
OpenNone
DeclinedNone
Resolved taavi
Resolved taavi
Resolvedjbond

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Thanks @arturo, I think that sums up the options we discussed. A few notes:

Existing Production LVS

One other option that was mentioned, and I'm not sure how viable it is, would be to utilize the existing production LVS. But you made a good point that this doesn't align with the goal of keeping cloud services as a separate realm.

Anycast

The last option you list is basically Anycasting the traffic from the cloudsw across a bunch of servers. This works well, the switch will forward each flow (based on src+dst IP) to one of the servers. If a server dies the BGP session will too, and the switch will lose that route and redirect traffic to the remaining servers. You can also add additional health-checks to force the BGP down on a server if there is a more subtle problem with a service that doesn't bring the box itself down.

Anycasting is usually stateless on the switch. One problem that causes is that all flows get remapped if the number of available back-ends change, but often that's not an issue, or can be tolerated in the rare case of failure. More sophisticated load balancing techniques generally keep track of what flows go to what servers, minimizing the disruption if one fails. They can also have better awareness of the load on each backend, and use more granular criteria to decide what flows go to which box. But if you can get away without that Anycasting works very well, and leaves all the heavy lifting to the switch ASIC.

Existing servers on production public subnets

Finally we should consider things like cloudcontrol, cloudservices, cloudstore and cloudelastic, which are on the current production public Vlan. Ideally whichever option is selected would also support migration of those to the cloud realm at some stage in the future.

I did not forget this task, but have been busy the last few days and was unable to take time and write down my thoughts, yet.

create and shared a spreadsheet trying to capture/compare the different options being considered.

We had a meeting today, rough summary:

The idea is roughly:

  • get a couple new hardware servers (cloudlb?) dual homed (cloud-dedicated VLAN <-> cloud-host vlan, exact details TBD).
  • give them a public IPv4 address. Only 1, for a VIP. This IP is allocated from a cloud-dedicated IPv4 pool/CIDR and it will be associated with the wikimediacloud.org domain.
  • introduce keepalived (VRRP) and haproxy (proxying/loadbalancing) into the new servers
  • have all new services be backends of the above. Initially cloudswift, others likely to follow (cloudcontrol, cloudservices, etc).

Our next steps will be:

  • create a draft with the plans, diagrams and some initial implementation details -- Arturo to bootstrap this in the next few weeks.
  • iterate over the draft until we feel it sounds like an actual plan --- both WMCS & SRE/IF to iterate over it.
  • once we have a clear picture of what the architecture is and how the service will work, talk to the Traffic SRE team and coordinate with them. --- both WMCS & SRE/IF to participate in this meeting
aborrero triaged this task as Medium priority.Dec 13 2021, 10:18 AM
aborrero changed the task status from Open to Stalled.Dec 14 2021, 5:54 PM
aborrero moved this task from Inbox to Soon! on the cloud-services-team (Kanban) board.

We just re-shifted team priorities. We wont be working on this for now.

aborrero changed the task status from Stalled to In Progress.Oct 17 2022, 3:06 PM
aborrero claimed this task.
aborrero raised the priority of this task from Medium to High.
aborrero moved this task from Soon! to Doing on the cloud-services-team (Kanban) board.