Page MenuHomePhabricator

Label high volume bot spikes in pageview data as automated traffic
Open, Needs TriagePublic

Description

Our pageview pipeline labels as “user” traffic many requests that we know are actually coming from bots that are crawling our site, the lack of ability for us to be able to classify this requests as automated in origin leads to our stats about pageviews (specially top pageviews) being distorted. At the time of this writing our percentage of bot requests is said to be about 20%, in reality, it is probably quite a bit higher. As much as 5-8% higher overall per our research on this matter. This is the parent task to keep track of the work to deploy the "high volume bot spike" detection code.

The bot spikes we are after look like this: https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&start=2018-11&end=2019-10&pages=Line_shaft

They are sharp and large in term of traffic.

Also see recent bot spikes on hungarian wikipedia: T237282

Event Timeline

Nuria created this task.Thu, Nov 14, 7:20 PM
Ottomata moved this task from Incoming to Bots on the Analytics board.Mon, Nov 18, 4:44 PM
Isaac added a subscriber: Isaac.Tue, Nov 26, 9:51 PM

Hey @Nuria -- I had been doing some of my own research on this as part of some background work around re-use of Wikimedia content. I wanted to throw in a few thoughts in case they're useful (and am largely excited about the proposed spike detection!):

  • +1 to identifying weblight traffic via user-agent string. It's a large proportion of the "None" referers, which clouds that data. I suspect it's mostly search but obviously don't know that.
  • The weblight data got me thinking about bot-like traffic that is really VPNs or other proxies. I took a look at some of these userhashes that have very high numbers of pageviews per hour and have generated a few hypotheses:
    • Some of the userhashes have pageviews that are nearly all for a single project (e.g., en.wikipedia) and/or repeatedly hit the same title (e.g., the userhash behind this: https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&range=latest-20&pages=Simple_Mail_Transfer_Protocol) -- those feel like they are very likely bots. VPN/proxies though often seem to mix projects (because lots of different users are coming in via the same "device") and have an expected number of visits to Wikipedia's Main Page (~1%), so personally I think a high pageview count but more uniform distribution of projects / titles associated with a single userhash might be good evidence of a VPN/proxy as opposed to bot. I don't have a great recommendation for what that threshold is right now, but would be happy to work with you on it.
    • I haven't looked at device (i.e. desktop vs. mobile) but a mix of devices might be a useful parameter as well for separating out bots from VPNs
    • It looks like Google Translate preserves the user-agent even though the IP seems to maybe be Google servers and not the actual client, so I doubt it would show up in the data but they'd also be simple to exclude via presence of x_analytics_map translationengine.

@Isaac weblight data will be excluded from the classification entirely, the way it gets to us it does not have any client IP that we can use. This is true for any other proxy as out traffic layer does not forward for the most part the client IP, this is not likely to change in the near term. See: T232795

I haven't looked at device (i.e. desktop vs. mobile) but a mix of devices might be a useful parameter as well for separating out bots from VPNs

This is what our community does right now to exclude bots from top lists traffic. See: https://en.wikipedia.org/wiki/Wikipedia:2018_Top_50_Report#Exclusions

It looks like Google Translate preserves the user-agent even though the IP seems to maybe be Google servers

Google translate is high volume for event data but not that high for pageview data so I had not considered, I can certainly exclude it from the classification explicitily.

Isaac added a comment.Wed, Nov 27, 6:25 PM

weblight data will be excluded from the classification entirely, the way it gets to us it does not have any client IP that we can use. This is true for any other proxy as out traffic layer does not forward for the most part the client IP, this is not likely to change in the near term. See: T232795

Thanks for the pointer!