We want to start adding systematic per-client rate-limiting at the edge. Each different class of users will get different rate-limits:
- Logged-in users and requests coming from toolsforge will not be rate-limited at the edge. They will still be subject to our anti-flood limits which are incredibly generous (about 50 rps for a single client)
- Known bots (that either we've classified or have been submitted via the trusted bots program) will get their own dedicated rate-limiting, which will be the one in the Robots Policy, unless other agreements exist with the bot operator
- Regular traffic that is likely to come from a browser will get a high rate-limit
- Traffic that is likely not to come from a browser, per T400270, will get rate-limited at or below the robots policy limit; egregious cases / returning / distributed abusers will be blocked individually.
For now, the limits will ostensibly be quite high as we need to fine-tune parameters in browser detection and such.
I think it would also make sense to start constraining these rate-limits to specific urls, or at least exclude things that are extremely lightweight like /w/load.php or /static from the rate-limit counts, given most scrapers will only request actual articles and no css/other bundles.
Pre-conditions to be able to do separate rate limits:
- Bring the known bots code to production (T400100)
- Finish and bring to production the browser detection routines (T400270)
- Bring to 100% production the new verifiable MW sessions (T398815)
- Add rate-limiting by unique cookie with a fallback on IP, or user-agent + fingerprints to both HIDDENPARMA and the edge CDN code.