We have a very clear user-agent policy, https://foundation.wikimedia.org/wiki/Policy:Wikimedia_Foundation_User-Agent_Policy stating that:
- Requests without a user-agent should be blocked
- Requests not coming from a browser should be properly identified using a distinctive name, a contact email or url, or otherwise be blocked
We've always been exceedingly liberal with this policy, but it's clearly become unsustainable.
We will progressively rate limit, then block requests from these generic user-agents. Our goal is to block all traffic from unidentified clients and not coming from authorized actors, like toolsforge or our internal APIs.
Below is the proposed schedule, limiting first to 10 requests per second per ip, then to 5, then to 1, and finally blocking the traffic completely.
| user-agent pattern | 10 rps/ip | 5 rps/ip | 1 rps/ip | block |
| No user agent | - | - | Aug 11 | Aug 18 |
| library default | - | Aug 11 | Aug 18 | Aug 25 |
| curl/wget CLI | - | Aug 11 | Aug 18 | - |
| external mw-related | Aug 11 | Aug 18 | Aug 25 | Sept 1 |
Definitions of patterns:
- No user agent: requests without a user-agent header, or with an empty value for it
- library default: requests with the default user-agent string for common software libraries like python-requests, curl, okhttp, go-httpclient, etc.
- external mw-related: requests with user-agent strings
set by MediaWiki (like ForeignApiRepo) orby other mw-related software like WDQS Updater
In the specific case of MediaWiki - generated user-agents, we can't completely block them at the moment because MediaWiki not only uses a non-policy compliant UA string by default, but it also doesn't allow overriding it.
