Over the past 2+ months, I've had unusual web traffic to my copyvios tool. Currently I've blocked it with a uwsgi rule so the requests 403 immediately, but they fill up my logs with many thousands of junk entries a day, and the rule could easily be worked around if whoever was sending the requests tweaked the parameters a bit, so I would like to see if we can block this at a different point in the stack or at least think a bit about what's going on.
Each request is to a URL like https://copyvios.toolforge.org/?lang=en&project=wikipedia&oldid=887576204&action=compare&url=google.ee. The revision ID in the "oldid" field is constant (this is what I am blocking based on) but the "url" field varies. "http://hasty.ai" often appears in either the HTTP referrer or somewhere else in the request headers, but not always.
[Tue May 4 05:44:01 2021] GET /?lang=en&project=wikipedia&oldid=887576204&action=compare&url=google.ad => generated 0 bytes in 0 msecs (- http://hasty.ai HTTP/1.1 403) 2 headers in 89 bytes User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36 Referer: http://webservices.icodes.co.uk/transfer2.php?location=https%3A%2F%2Fcopyvios.toolforge.org%2F%3Flang%3Den%26project%3Dwikipedia%26oldid%3D887576204%26action%3Dcompare%26url%3Dgoogle.ad - http%3A%2F%2Fhasty.ai
In P15679 I've pastebinned a sample of how these requests look on my end. The logs are from May 4 but the traffic is the same today.
In particular, it's not clear to me what hasty.ai has to do with this at all (or if some service they provide is compromised and being used to DOS me?). I can't see IPs so I have no clue where this is coming from. I don't understand why hasty.ai sometimes shows up in the uwsgi log as part of the protocol (%(proto) is normally HTTP/1.1 but with this traffic is sometimes - http://hasty.ai HTTP/1.1). This almost sounds like it's exposing a bug in uwsgi's request parsing or log formatting but I'm again unable to really trace this down as I don't have an obvious way to see the raw HTTP traffic and my attempts to simulate a garbled HTTP request with telnet did not produce any logs like this.
I get a few of these requests per second, whereas normal tool usage might be a few requests per minute on average. If unblocked, they flood the tool.
Longer term I will require OAuth to use the tool, which will help to block this sort of thing more securely, but it's not ready yet and it won't stop the requests from coming to my uwsgi process either.