Page MenuHomePhabricator

Copyvios tool: investigate/block suspicious web traffic
Closed, ResolvedPublic

Description

Over the past 2+ months, I've had unusual web traffic to my copyvios tool. Currently I've blocked it with a uwsgi rule so the requests 403 immediately, but they fill up my logs with many thousands of junk entries a day, and the rule could easily be worked around if whoever was sending the requests tweaked the parameters a bit, so I would like to see if we can block this at a different point in the stack or at least think a bit about what's going on.

Each request is to a URL like https://copyvios.toolforge.org/?lang=en&project=wikipedia&oldid=887576204&action=compare&url=google.ee. The revision ID in the "oldid" field is constant (this is what I am blocking based on) but the "url" field varies. "http://hasty.ai" often appears in either the HTTP referrer or somewhere else in the request headers, but not always.

For example:

[Tue May  4 05:44:01 2021] GET /?lang=en&project=wikipedia&oldid=887576204&action=compare&url=google.ad => generated 0 bytes in 0 msecs (- http://hasty.ai HTTP/1.1 403) 2 headers in 89 bytes
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36
Referer: http://webservices.icodes.co.uk/transfer2.php?location=https%3A%2F%2Fcopyvios.toolforge.org%2F%3Flang%3Den%26project%3Dwikipedia%26oldid%3D887576204%26action%3Dcompare%26url%3Dgoogle.ad - http%3A%2F%2Fhasty.ai

In P15679 I've pastebinned a sample of how these requests look on my end. The logs are from May 4 but the traffic is the same today.

In particular, it's not clear to me what hasty.ai has to do with this at all (or if some service they provide is compromised and being used to DOS me?). I can't see IPs so I have no clue where this is coming from. I don't understand why hasty.ai sometimes shows up in the uwsgi log as part of the protocol (%(proto) is normally HTTP/1.1 but with this traffic is sometimes - http://hasty.ai HTTP/1.1). This almost sounds like it's exposing a bug in uwsgi's request parsing or log formatting but I'm again unable to really trace this down as I don't have an obvious way to see the raw HTTP traffic and my attempts to simulate a garbled HTTP request with telnet did not produce any logs like this.

I get a few of these requests per second, whereas normal tool usage might be a few requests per minute on average. If unblocked, they flood the tool.

Longer term I will require OAuth to use the tool, which will help to block this sort of thing more securely, but it's not ready yet and it won't stop the requests from coming to my uwsgi process either.

Related Objects

Event Timeline

In particular, it's not clear to me what hasty.ai has to do with this at all (or if some service they provide is compromised and being used to DOS me?). I can't see IPs so I have no clue where this is coming from.

The pattern is indeed quite strange, I took a look today morning (around 6 UTC) and at that time today's nginx logs (which we rotate at midnight UTC) has 89 unique IP addresses when grepping the log for "hasty.ai". Most of those look like cloud/hosting providers of some sort, but there are some that look like residential addresses and all of them are spread all around the address space. No clear user-agent patterns either (just a bunch of different browser-like UAs), but most of the requests have a referrer header with the target toolforge url on it as a get parameter, but I tested a few of those and they didn't immediately look like open redirects.

I don't understand why hasty.ai sometimes shows up in the uwsgi log as part of the protocol (%(proto) is normally HTTP/1.1 but with this traffic is sometimes - http://hasty.ai HTTP/1.1). This almost sounds like it's exposing a bug in uwsgi's request parsing or log formatting but I'm again unable to really trace this down as I don't have an obvious way to see the raw HTTP traffic and my attempts to simulate a garbled HTTP request with telnet did not produce any logs like this.

This is what it looks on our front proxy: GET /?lang=en&project=wikipedia&oldid=887576204&action=compare&url=google.cl - http://hasty.ai HTTP/1.1

Samwalton9-WMF subscribed.

Looks like the OAuth feature is now integrated? If so should we put this in Tech News?

I wanted to put this on next week's Tech News but the deploy missed the deadline for drafting by 3 hours. I've made a message about the change at the week after next's Tech News talk page.

This is amazing! Look at the difference it's making:

Screenshot from 2024-10-10 15-48-45.png (330×797 px, 35 KB)

I think running out of our daily quota is officially a thing of the past.

Can this be closed as resolved now?

Chlod claimed this task.

Marking as resolved since the underlying issue (bot scraping of the tool) has been dealt with. As for the odd appearance of - http://hasty.ai in the HTTP request line in violation of RFC 9112, this is something that should be split off into its own task (if it's even worth investigating).