Goal
Brainstorm potential new approaches to bot detection (and bonus reader session identification) that don't rely on the uniqueness of IP addresses and user-agent strings.
Overview
Bot detection -- i.e. identifying pageviews to Wikimedia projects that are believed to not come from human readers -- occurs in two ways:
- Self-identified bots via the user-agent (code) are identified as agent_type = spider in wmf.pageview_actor table. This presumably works well enough and not sure if it needs updated.
- Detected bots by looking at some heuristics like the number of pageviews per "user" and labeling all pageviews that exceed some threshold as agent_type = automated (code).
This second aspect depends on aggregating pageviews per user where a "user" is defined as a combination of the IP address, user-agent string, and a few other components that are associated with a pageview (code). This has two potential drawbacks:
- Bots that change their device information or IP address and thus evade detection (false negatives) -- i.e. are labeled as user when in fact they should be automated. This probably won't be covered here (we don't have good groundtruth so it's hard to know how big of an issue this is) but that would be a bonus!
- Multiple people that share both an IP address and user-agent (device information) within the same day and thus might register as automated in our data when in fact it's just many readers (false positives) and should be labeled as user. This is the main concern of this task as certain changes to assignment of IP and user-agents might lead to an increase in false positives -- e.g.,
Open Questions
- Could we replace our user-agent + IP approach with something that relies on cookies without compromising on user privacy?
- Can we keep user-agent + IP but add some additional context like a simple cookie or other data in webrequests that would help separate users?
- How big of an issue is this? What impact would the Safari change mentioned above have if all pageviews from Safari started coming from a much smaller number of IP addresses? What impact might the generic user-agents have?
