Page MenuHomePhabricator

weighted maglev viability for low-traffic services
Open, MediumPublic

Description

Katran only provides a weighted maglev scheduler, so we won't be able to replicate 1:1 the current low-traffic setup if we decide to migrate from IPVS to Katran.

As an alternative we could perform load balancing taking into account not only the source IP but also the source port to achieve a proper distribution of incoming request across the available realservers. As a side effect this means that a single client (in most cases CDN servers) will potentially hit several realservers instead of just one.

Is this something that we could assume? Do we have well-known blockers to follow this approach?

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Vgutierrez triaged this task as Medium priority.Wed, Jun 26, 2:51 PM

Strictly on the network side, there is no blocker one way or the other.

I think I miss some context, what's the current low-traffic setup ? What's the downside of only using the source IP in the hashing? Not enough clients for proper balancing ?

I think I miss some context, what's the current low-traffic setup ?

Usually services use wrr (weighted round robin) to balance traffic across nodes

What's the downside of only using the source IP in the hashing? Not enough clients for proper balancing ?

Yep, that's what I suspect, that using source IP only would lead to some realservers getting significantly more traffic than others given the relatively small pool of source IPs. And it looks like that behavior from maglev is already being (ab)used by thanos-web low-traffic service:

# Needed for SSO sessions to stick. As of Dec 2022 backend
# selection from varnish -> ATS is always random for pass traffic
# therefore even with this we need at most one thanos-fe host
# pooled at a time (per site).
scheduler: mh

It is pretty clear to me that the only way to have fair load balancing with maglev is if we do the consistent hashing using the remote port as well.

We're in the situation where we have less client IPs than backends for virutally every service, so source IP hashing would never be enough to get a fair distribution.