Even with the current rate limiting, some crawling are regularly causing issues, wasting precious SRE time.
I'd like to revisit this task to be more strict on user agents, maybe progressively increasing the way we enforce our policy. For example:
- Keep rate limiting for generic curl and other command line/testing tools
- Forbid generic scripting UAs (eg. python-requests, empty) from cloud providers
- Ideally later on, forbid generic scripting UAs from the whole Internet (except WMCS)
A variant could be to only apply the above on the upload cluster, but the less exceptions the better
Agreed to all that, though I would not exempt WMCS because WMCS can generate significant amounts of traffic much faster by virtue of already being in the cluster and people using WMCS are generally Wikimedians who should be more familiar with our policies than someone who just wants to scrape wiki pages.
We responded to another set of pages today and most of the offending requests were coming from a public Cloud with no User-agent, so we've banned those requests from the upload cluster: https://gerrit.wikimedia.org/r/702003
I'm not really sure who or which team needs to approve this or whether no one opposes it and someone just needs to do it.
Changeset banning empty user agents: https://gerrit.wikimedia.org/r/702027
The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all tickets that aren't are neither part of our current planned work nor clearly a recent, higher-priority emergent issue. This is simply one step in a larger task cleanup effort. Further triage of these tickets (and especially, organizing future potential project ideas from them into a new medium) will occur afterwards! For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!