Page MenuHomePhabricator

Automatic detection of distributed bots
Closed, ResolvedPublic

Description

We've had a number of times where bots evaded throttling mechanisms of WDQS by using either multiple hosts or randomized user agents. Such bots can create significant load on the service, causing massive lags. We can throttle them properly by adding their query pattens to pattern.txt file, however the process of finding what bot is causing the load and what is the pattern is largely manual.

I wonder if it's possible to automate this process. The way it would work would be:

  1. Detect set of "substantially similar" queries. This is most unclear part - how to do it - maybe by hashing first N characters of the query? Those are usually similar.
  2. Collect all IPs for the above. If the set of the IPs is small, and the set of the queries is large, this set is suspicious.
  3. Check how big and frequent these queries are. If they are over certain threshold, then output the set as potential bot query along with statistics of IPs and user agents.
  4. We might also start the search for "suspicious" queries from queries that time out (i.e. scan the list of timed out queries first and if there are "frequent flyers" see if there's more of them).

That's the idea so far but more substantiation is needed. There could be manual components in this but the idea is to make detecting high-frequency distributed bots easier so we can properly throttle them.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

This should probably be implemented as part of a more generic throttling / access strategy. It does not make sense to invest in this if we're keeping it just for WDQS.

Gehel claimed this task.

This should probably be implemented as part of a more generic throttling / access strategy. It does not make sense to invest in this if we're keeping it just for WDQS.

Did you mean to close it as "Declined"? Or do you really mean that such a throttling has been implemented in the meanwhile?