Label high volume bot spikes in pageview data as automated traffic
Closed, ResolvedPublic8 Estimated Story Points
Actions

Assigned To

None

Authored By

	• Nuria
	Nov 14 2019, 7:20 PM

Description

Our pageview pipeline labels as “user” traffic many requests that we know are actually coming from bots that are crawling our site, the lack of ability for us to be able to classify this requests as automated in origin leads to our stats about pageviews (specially top pageviews) being distorted. At the time of this writing our percentage of bot requests is said to be about 20%, in reality, it is probably quite a bit higher. As much as 5-8% higher overall per our research on this matter. This is the parent task to keep track of the work to deploy the "high volume bot spike" detection code.

The bot spikes we are after look like this: https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&start=2018-11&end=2019-10&pages=Line_shaft

They are sharp and large in term of traffic.

Also see recent bot spikes on hungarian wikipedia: T237282

Details

	Subject	Repo	Branch	Lines +/-
	Count automated traffic as bots in turnilo's homescreen	operations/puppet	production	+2 -2

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T138207 [Open question] Improve bot identification at scale
Resolved	None	T238357 Label high volume bot spikes in pageview data as automated traffic
Resolved	None	T238358 Deploy high volume bot spike detector to hungarian wikipedia
Resolved	• Nuria	T238360 Hourly Feature extraction for bot detection from webrequest
Resolved	• Nuria	T238361 Hourly labeling of "automated" traffic before loading of pageviews into pageview_hourly
Resolved	JAllemandou	T238363 Vet high volume bot spike detection code
Resolved	JAllemandou	T247342 Create UDF for actor id generation
Resolved	JAllemandou	T247344 Automated deletion of actor data for bot prediction after 90 days
Resolved	None	T239532 "Venuše (planeta)" on cs.wp has surprisingly high numbers in Pageviews Analysis (and also Topviews Analysis)
Resolved	JAllemandou	T250744 Unique devices, retrofit with bot detection code
Resolved	JAllemandou	T255467 Create intermediate dataset: pageview with actor information

Event Timeline

• Nuria created this task.Nov 14 2019, 7:20 PM

• Nuria mentioned this in T232992: Manipulation of pageview statistics German Wikipedia.Nov 16 2019, 4:11 AM

Ottomata moved this task from Incoming to Datasets on the Analytics board.Nov 18 2019, 4:44 PM

leila removed a project: Research-Freezer.Nov 20 2019, 12:22 AM

Hey @Nuria -- I had been doing some of my own research on this as part of some background work around re-use of Wikimedia content. I wanted to throw in a few thoughts in case they're useful (and am largely excited about the proposed spike detection!):

+1 to identifying weblight traffic via user-agent string. It's a large proportion of the "None" referers, which clouds that data. I suspect it's mostly search but obviously don't know that.
The weblight data got me thinking about bot-like traffic that is really VPNs or other proxies. I took a look at some of these userhashes that have very high numbers of pageviews per hour and have generated a few hypotheses:
- Some of the userhashes have pageviews that are nearly all for a single project (e.g., en.wikipedia) and/or repeatedly hit the same title (e.g., the userhash behind this: https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&range=latest-20&pages=Simple_Mail_Transfer_Protocol) -- those feel like they are very likely bots. VPN/proxies though often seem to mix projects (because lots of different users are coming in via the same "device") and have an expected number of visits to Wikipedia's Main Page (~1%), so personally I think a high pageview count but more uniform distribution of projects / titles associated with a single userhash might be good evidence of a VPN/proxy as opposed to bot. I don't have a great recommendation for what that threshold is right now, but would be happy to work with you on it.
- I haven't looked at device (i.e. desktop vs. mobile) but a mix of devices might be a useful parameter as well for separating out bots from VPNs
- It looks like Google Translate preserves the user-agent even though the IP seems to maybe be Google servers and not the actual client, so I doubt it would show up in the data but they'd also be simple to exclude via presence of x_analytics_map translationengine.

@Isaac weblight data will be excluded from the classification entirely, the way it gets to us it does not have any client IP that we can use. This is true for any other proxy as out traffic layer does not forward for the most part the client IP, this is not likely to change in the near term. See: T232795

I haven't looked at device (i.e. desktop vs. mobile) but a mix of devices might be a useful parameter as well for separating out bots from VPNs

This is what our community does right now to exclude bots from top lists traffic. See: https://en.wikipedia.org/wiki/Wikipedia:2018_Top_50_Report#Exclusions

It looks like Google Translate preserves the user-agent even though the IP seems to maybe be Google servers

Google translate is high volume for event data but not that high for pageview data so I had not considered, I can certainly exclude it from the classification explicitily.

weblight data will be excluded from the classification entirely, the way it gets to us it does not have any client IP that we can use. This is true for any other proxy as out traffic layer does not forward for the most part the client IP, this is not likely to change in the near term. See: T232795

Thanks for the pointer!

Isaac mentioned this in T239625: Improve quality of external referer data.Dec 2 2019, 3:41 PM

Isaac mentioned this in T235784: Identify data / questions that we can(not) answer regarding external reuse.Dec 23 2019, 5:05 PM

SNowick_WMF subscribed.Jan 7 2020, 6:51 PM

• Nuria added a subtask: T239532: "Venuše (planeta)" on cs.wp has surprisingly high numbers in Pageviews Analysis (and also Topviews Analysis).Apr 6 2020, 1:36 AM

Change 594272 had a related patch set uploaded (by Nuria; owner: Nuria):
[operations/puppet@production] Count automated traffic as bots in turnilo's homescreen

https://gerrit.wikimedia.org/r/594272

gerritbot added a project: Patch-For-Review.May 4 2020, 7:04 PM

• Nuria closed subtask T238358: Deploy high volume bot spike detector to hungarian wikipedia as Resolved.May 4 2020, 7:07 PM

Change 594272 merged by Ottomata:
[operations/puppet@production] Count automated traffic as bots in turnilo's homescreen

https://gerrit.wikimedia.org/r/594272

Maintenance_bot removed a project: Patch-For-Review.May 8 2020, 4:10 PM

• Nuria closed this task as Resolved.May 14 2020, 2:43 PM

• Nuria set the point value for this task to 8.

• Nuria set Final Story Points to 8.

• Nuria closed subtask T239532: "Venuše (planeta)" on cs.wp has surprisingly high numbers in Pageviews Analysis (and also Topviews Analysis) as Resolved.

• Nuria closed subtask T250744: Unique devices, retrofit with bot detection code as Resolved.Jul 23 2020, 4:36 AM