toolforge: explore options to introduce egress network quotas
Open, MediumPublic
Actions

Assigned To

None

Authored By

	aborrero
	Apr 24 2024, 8:09 AM

Description

There are a number of ACLs, rate-limits, quotas, and network policies on the production wiki endpoints, to protect the services from misbehaving clients, or single clients consuming too many resources. Some of these policies are IP-based.

Some of these rules, policies and constrains are documented here as seen from Toolforge side: https://wikitech.wikimedia.org/wiki/Help:Toolforge#Constraints_of_Toolforge

To mitigate Cloud VPS and Toolforge hitting the limits, we have an exemption mechanism for the general Cloud VPS egress NAT for wiki endpoints. As of this writing, all traffic leaving Toolforge to establish a connection with production wiki endpoints are exempt from this general Cloud VPS egress NAT.

Unfortunately, such wiki endpoints could potentially still see many tools from a single source IP address: the Toolforge k8s worker node address.

We have seen Toolforge tools consuming the wiki endpoints quotas in different way. In some cases, in ways that would prevent or limit what other "neighbor" tools running on the same Toolforge k8s worker node could do network-wise with the wiki endpoints.

This ticket is to explore if we could introduce some egress network quotas, to limit what a single tool can consume from the wiki endpoints, things like:

number of concurrent open connections
bandwidth

Some examples of the problems described here:

Related Objects

Mentioned In: T356164: [toolforge] several tools get periods of connection refused (104) when connecting to wikis
T329327: Frequent `429 Client Error: Too Many Requests for url: https://stream.wikimedia.org/v2/stream/recentchange` errors in SULWatcher
Mentioned Here: T308931: Error 429: too many requests for stream.wikimedia.org
T329327: Frequent `429 Client Error: Too Many Requests for url: https://stream.wikimedia.org/v2/stream/recentchange` errors in SULWatcher
T356160: Listeria bot sometimes gets stuck with 104 errors from Wikimedia APIs
T356163: ChieBot: Intermittent connection reset by peer errors
T356164: [toolforge] several tools get periods of connection refused (104) when connecting to wikis

Event Timeline

aborrero created this task.Apr 24 2024, 8:09 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 24 2024, 8:09 AM

aborrero triaged this task as Medium priority.Apr 24 2024, 8:09 AM

aborrero moved this task from Backlog to Radar on the User-aborrero board.

aborrero mentioned this in T329327: Frequent `429 Client Error: Too Many Requests for url: https://stream.wikimedia.org/v2/stream/recentchange` errors in SULWatcher.

aborrero mentioned this in T356164: [toolforge] several tools get periods of connection refused (104) when connecting to wikis.

Albertoleoncio subscribed.Apr 24 2024, 9:46 AM

I love the idea of attempting to create a more equitable resource distribution for networking in Toolforge. I'm not sure yet however how this would actually work as hoped in practice unless there was deep integration with the Kubernetes scheduler.

A naive implementation that says "you get N connections per namespace" would not do anything to stop M tools each consuming under their connection quota being assigned to the same kubernetes node and thus egress IP which in aggregate still blows out the upstream per IP restriction. If we respond to that by lowering the N connections limit then we are deliberately starving some tools to reduce conflicts, but in a way that really doesn't have a floor.

If we think of Toolforge as essential infrastructure for the operation of the content wikis (this was part of the pitch that created WMCS so not a wild thing to think) then how would we look at solving this problem? By that I guess I mean if these rate limits were keeping a "production" process from working because too many connections were seen coming from the wikikube Kubernetes cluster how would we try to address that? Would we be likely to throttle the work done on each wikikube node, or would we look for ways to create more abundance in the restricted services?

in your opinion, should we decline this task and focus on the other angle you mention?

In T363296#9743894, @aborrero wrote:

in your opinion, should we decline this task and focus on the other angle you mention?

I don't want to veto the entire concept. I know you know much more about the technical possibilities than I do, and if you see value in exploring the idea you should do that. If my worries about limits not actually solving the problem in an equitable way ring true to you, then maybe we can think of a way to pivot that seems more likely to improve the general situation? Do we already have, or would it be reasonably possible to make, some kind of tracking where we can see the relative network consumption clustered by exec node and namespace?

toolforge: explore options to introduce egress network quotasOpen, MediumPublicActions

Description

Related Objects

Event Timeline

toolforge: explore options to introduce egress network quotas
Open, MediumPublic
Actions