Page MenuHomePhabricator

toolforge: explore options to introduce egress network quotas
Open, MediumPublic

Description

There are a number of ACLs, rate-limits, quotas, and network policies on the production wiki endpoints, to protect the services from misbehaving clients, or single clients consuming too many resources. Some of these policies are IP-based.

Some of these rules, policies and constrains are documented here as seen from Toolforge side: https://wikitech.wikimedia.org/wiki/Help:Toolforge#Constraints_of_Toolforge

To mitigate Cloud VPS and Toolforge hitting the limits, we have an exemption mechanism for the general Cloud VPS egress NAT for wiki endpoints. As of this writing, all traffic leaving Toolforge to establish a connection with production wiki endpoints are exempt from this general Cloud VPS egress NAT.

Unfortunately, such wiki endpoints could potentially still see many tools from a single source IP address: the Toolforge k8s worker node address.

We have seen Toolforge tools consuming the wiki endpoints quotas in different way. In some cases, in ways that would prevent or limit what other "neighbor" tools running on the same Toolforge k8s worker node could do network-wise with the wiki endpoints.

This ticket is to explore if we could introduce some egress network quotas, to limit what a single tool can consume from the wiki endpoints, things like:

  • number of concurrent open connections
  • bandwidth

Some examples of the problems described here:

Event Timeline

I love the idea of attempting to create a more equitable resource distribution for networking in Toolforge. I'm not sure yet however how this would actually work as hoped in practice unless there was deep integration with the Kubernetes scheduler.

A naive implementation that says "you get N connections per namespace" would not do anything to stop M tools each consuming under their connection quota being assigned to the same kubernetes node and thus egress IP which in aggregate still blows out the upstream per IP restriction. If we respond to that by lowering the N connections limit then we are deliberately starving some tools to reduce conflicts, but in a way that really doesn't have a floor.

If we think of Toolforge as essential infrastructure for the operation of the content wikis (this was part of the pitch that created WMCS so not a wild thing to think) then how would we look at solving this problem? By that I guess I mean if these rate limits were keeping a "production" process from working because too many connections were seen coming from the wikikube Kubernetes cluster how would we try to address that? Would we be likely to throttle the work done on each wikikube node, or would we look for ways to create more abundance in the restricted services?

in your opinion, should we decline this task and focus on the other angle you mention?

in your opinion, should we decline this task and focus on the other angle you mention?

I don't want to veto the entire concept. I know you know much more about the technical possibilities than I do, and if you see value in exploring the idea you should do that. If my worries about limits not actually solving the problem in an equitable way ring true to you, then maybe we can think of a way to pivot that seems more likely to improve the general situation? Do we already have, or would it be reasonably possible to make, some kind of tracking where we can see the relative network consumption clustered by exec node and namespace?