Even with the current rate limiting, some crawling are regularly causing issues, wasting precious SRE time.
I'd like to revisit this task to be more strict on user agents, maybe progressively increasing the way we enforce our policy. For example:
- Keep rate limiting for generic curl and other command line/testing tools
- Forbid generic scripting UAs (eg. python-requests, empty) from cloud providers
- Ideally later on, forbid generic scripting UAs from the whole Internet (except WMCS)
A variant could be to only apply the above on the upload cluster, but the less exceptions the better
Agreed to all that, though I would not exempt WMCS because WMCS can generate significant amounts of traffic much faster by virtue of already being in the cluster and people using WMCS are generally Wikimedians who should be more familiar with our policies than someone who just wants to scrape wiki pages.