Page MenuHomePhabricator

LVS in Analytics VLANs
Open, LowPublic

Description

This task was made after a chat between me and Brandon about what are the options to get LVS endpoints within Analytics VLANs.

Background: there is currently no way in puppet/etc.. to add a LVS endpoint that forwards traffic to a backend composed by hosts in Analytics VLANS.
Use cases: Analytics/Data Engineering would need low traffic LVS VIPs for services like Druid Analytics Brokers (for example, Turnilo/Superset have specify a single druid hostname:port combination in their configs as backend target to fetch data from) and Hive servers (we currently have one active and one standby, but they could work active/active). More use cases may come in the future, also from the ML-side.

Options:

  1. Add a new interface to the low traffic LVSes for each Analytics VLAN (there are four in eqiad, one for each row). This would allow LVS hosts to L2-forward to Analytics VLANs, but it may be controversial from the security perspective. It would represent a little tech debt added and not a clean solution, but it should be feasible.
  2. Buy two more LVS nodes (that should be very basic and cheap) to be used only within the Analytics VLANs. This would require time to set them up (with the Traffic team's help) and also Analytics would need to manage them long term (probably a shared ownership with SRE). This would represent a cleaner solution, but it can potentially represent a lot of work for Analytics. There is probably also some work to be done on the DCOps side, since the new nodes will need to be connected to multiple switches and cross-cabling may be a problem in eqiad in these days.
  3. We make the Analytics VLANs part of production, removing the problem entirely. The motivation is that things changed a lot from when the Analytics VLAN were first introduced, so they may not be needed nowadays. Last time that we tried (T157806#3075311) the answer was a mild no :)

The preferred/suggested solution from the Traffic team seems to be 2).
It is also important to note that the Traffic team will work on reviewing alternative solutions to LVS for the public endpoints/load-balancers, so in a long term scenario LVS-based load balancing may become deprecated (but we are talking about a lot of time).

The Analytics/Data Engineering team should review the above and decide if the use cases are worth or not, and what road to choose :)

Event Timeline

I'm also in favour of option 2. I think that it's the cleanest solution and ultimately presents the lowest risk of the three options.
I appreciate that this involves work for both the DCOps team and the Traffic team in the design and commissioning stages and that there would be an ongoing management responsibility, but I think still feel that this would be worth the effort.

Moving back to Analytics to reprioritize with the team.

odimitrijevic moved this task from Incoming to Operational Excellence on the Analytics board.

Personally I think option 1 and 3 are the best here.

Option 1 is relatively straightforward, after adding a bunch of new sub-interfaces to existing LVS in Eqiad for the new rows recently it doesn't seem to be such a difficult thing to do. My gut instincts are typically to share infrastructure and resources where there isn't a clear security or other requirement for separation (and even then logical separation is normally sufficient).

Option 3 I think it a broader discussion than just the LVS issue. We'd been approaching a conclusion, in T298087, to keep the Analytics Vlan separate, and apply a uRPF filter on it. This is mostly to simplify having to create per-Vlan filters with the appropriate local subnets on each device, and is the approach currently taken on the new Analytics vlans in rows E/F Eqiad.

However if we are going to host LVS-backed services in the Analytics Vlan then we need to use a filter/ACL rather than uRPF. Which removes the rationale for the separate Vlan itself, so maybe option 3 is the better long term.

Personally my instincts are to go with option 1, and consider the potential merging of analytics and private Vlans separately in the longer term.

Now that the analytics vlan outbound firewall restrictions have been removed in {T298087} - what is the impact on this ticket?

Is it still the case that the current LVS servers can't L2 forward packets to the analytics hosts because they don't have a sub-interface in those vlans?
If so, is there anything stopping us from adding those additional interfaces now?

There's still nothing urgent, but there are a couple of things like the druid brokers (an-druid100[1-5]) where LVS could potentially be useful, so I'm just interested to know where we are with this option. Thanks.

Now that the analytics vlan outbound firewall restrictions have been removed in {T298087} - what is the impact on this ticket?

Direct impact is that it makes all 3 options easier to implement.

Is it still the case that the current LVS servers can't L2 forward packets to the analytics hosts because they don't have a sub-interface in those vlans?

This is correct.

If so, is there anything stopping us from adding those additional interfaces now?

We first need to agree on which option to go with :)

Option 3 is option 1 with extra steps, as we would either need to:

  • Change the analytics hosts vlans to private1-a/b/c/d-eqiad and thus renumber (re-image!) all the servers, which is quite impactful, to benefit from the existing LVS leg in the private1 vlans
  • Rename analytics1-a/b/c/d-eqiad to private2-a/b/c/d-eqiad, and update the various firewall rules and network/data.yaml to make them part of prod, 0 impact, but this would require private2 to be trunked on the LVS

So we should focus on option 1 vs. 2.

I'd be interested to know more about (@BBlack):

The preferred/suggested solution from the Traffic team seems to be 2).

As well as (@elukey):

It would represent a little tech debt added and not a clean solution

As I don't want to make false assumptions and there might be limitations I'm not thinking about.

Option 1 seems the cleanest option to me as:

  • Uses an unified LVS layers regardless of backend team/servers
  • Doesn't have the limitations listed for option 2 (cabling, ownership, etc)
  • Easier to implement

One small downside is about traffic flows, if I understand correctly, most clients are in the analytics vlan, so traffic will do something like:
client (analytics) -> LVS (private) -> real server (analytics)
But as analytics hosts are spread across multiple rows, the physical path traffic will take with option 2 is similar to option 1, only the logical part is different (vlan).

Depending on the requirements, advertising the VIP using BGP directly from the hosts to the routers (or switches) could be an option as well, see https://wikitech.wikimedia.org/wiki/Anycast but so far my preference goes to option 1.

Hm, I don't think there are many hosts/services for which we need the LVS. Perhaps we can do Option 1 or 3 for just the hosts we need LVS for?

Yeah if we don't expect much traffic it might be hard to justify dedicated hardware / option 2.

Are there any concrete proposals for services to place behind the LVS? That we might make some basic estimates about to assist the decision?

One small downside is about traffic flows, if I understand correctly, most clients are in the analytics vlan, so traffic will do something like:

We need to bear this in mind. The current uRPF filter on the CR routers would block the return traffic from the realservers on the Analytics Vlan.

I don't believe it's a blocker. We've already had to move away from uRPF in favour of ACLs on the switches, so maybe it's best to do that everywhere.