There are a number of ACLs, rate-limits, quotas, and network policies on the production wiki endpoints, to protect the services from misbehaving clients, or single clients consuming too many resources. Some of these policies are IP-based.
Some of these rules, policies and constrains are documented here as seen from Toolforge side: https://wikitech.wikimedia.org/wiki/Help:Toolforge#Constraints_of_Toolforge
To mitigate Cloud VPS and Toolforge hitting the limits, we have an exemption mechanism for the general Cloud VPS egress NAT for wiki endpoints. As of this writing, all traffic leaving Toolforge to establish a connection with production wiki endpoints are exempt from this general Cloud VPS egress NAT.
Unfortunately, such wiki endpoints could potentially still see many tools from a single source IP address: the Toolforge k8s worker node address.
We have seen Toolforge tools consuming the wiki endpoints quotas in different way. In some cases, in ways that would prevent or limit what other "neighbor" tools running on the same Toolforge k8s worker node could do network-wise with the wiki endpoints.
This ticket is to explore if we could introduce some egress network quotas, to limit what a single tool can consume from the wiki endpoints, things like:
- number of concurrent open connections
- bandwidth
Some examples of the problems described here:
- T308931: Error 429: too many requests for stream.wikimedia.org
- T329327: Frequent `429 Client Error: Too Many Requests for url: https://stream.wikimedia.org/v2/stream/recentchange` errors in SULWatcher
- T356164: [toolforge] several tools get periods of connection refused (104) when connecting to wikis
- T356163: ChieBot: Intermittent connection reset by peer errors
- T356160: Listeria bot sometimes gets stuck with 104 errors from Wikimedia APIs