Page MenuHomePhabricator

Revisit approach to automated bot detection
Open, Needs TriagePublic

Description

Goal

Brainstorm potential new approaches to bot detection (and bonus reader session identification) that don't rely on the uniqueness of IP addresses and user-agent strings.

Overview

Bot detection -- i.e. identifying pageviews to Wikimedia projects that are believed to not come from human readers -- occurs in two ways:

  • Self-identified bots via the user-agent (code) are identified as agent_type = spider in wmf.pageview_actor table. This presumably works well enough and not sure if it needs updated.
  • Detected bots by looking at some heuristics like the number of pageviews per "user" and labeling all pageviews that exceed some threshold as agent_type = automated (code).

This second aspect depends on aggregating pageviews per user where a "user" is defined as a combination of the IP address, user-agent string, and a few other components that are associated with a pageview (code). This has two potential drawbacks:

  • Bots that change their device information or IP address and thus evade detection (false negatives) -- i.e. are labeled as user when in fact they should be automated. This probably won't be covered here (we don't have good groundtruth so it's hard to know how big of an issue this is) but that would be a bonus!
  • Multiple people that share both an IP address and user-agent (device information) within the same day and thus might register as automated in our data when in fact it's just many readers (false positives) and should be labeled as user. This is the main concern of this task as certain changes to assignment of IP and user-agents might lead to an increase in false positives -- e.g.,
    • Chrome making UAs more generic (T242825)
    • Safari grouping users at generic IP addresses (T289795)

Open Questions

  • Could we replace our user-agent + IP approach with something that relies on cookies without compromising on user privacy?
  • Can we keep user-agent + IP but add some additional context like a simple cookie or other data in webrequests that would help separate users?
  • How big of an issue is this? What impact would the Safari change mentioned above have if all pageviews from Safari started coming from a much smaller number of IP addresses? What impact might the generic user-agents have?

Event Timeline

Expanding on what @EBernhardson said over on Slack, we have implemented a very simple heuristic which, following Erik, we can call SillyBotDetection. (Whether the bots or the detection is silly is left as an exercise for the reader.) We have been primarily focused on generating user search query corpora for training or testing, though Erik may have done some additional bot detection for the clickstream data for training.

Some of the criteria we've used for query corpus generation include:

  • Exclude IPs that make more than n queries a day.
    • n has ranged from 30–100.
    • This can exclude shared IPs.
    • This can exlude heavy users, including some editors, but we've decided that's okay in many cases because we want "normal human users", which doesn't include all editors (or search engineers). ;)
    • We don't (or at one point didn't) have easy access to user agents, so we didn't bring that into it.
  • Require queries to (appear to) have come from the search box (as opposed to URL hacking directly to get results, which is more bot-ish).
  • Only the content index was searched, which excludes some bots and some power users.
    • This doesn't work on some wikis (Spanish, IIRC) because they have set their default to include additional indexes, so we have to hack around that.
  • As a further bot-mitigation technique, we usually only accept one randomly chosen query from each IP per day.
    • So, if a bot gets through, they can only add one query/day to the corpus, instead of being over-represented with up to n–1 (whatever the limit above is) queries per day.
    • Another variation on this is only considering unique queries from an IP; so we might let multiple queries from the IP through, but not multiple copies of the same query (with or without some minimal normalization).
    • This also over-samples low-volume users, which we hope increases the diversity of our sample. At the very least, it gives equal weight to low-volume and higher-volume users.
    • We have ignored this filter for certain purposes, such as when we need sessionized data or we want to look for query reformulation, etc.—things that require multiple queries from the same user.

All of this has been done semi-manually, in that we have queries in Jupyter notebooks or .hql files that we modify each time we need a query corpus.

For our user query dashboard (i.e., finding the most common query on enwiki or Commons), we also have an informal heuristic (actually, I don't recall if Erik implemented it as a filter, or if we just keep it in mind when looking at the dashboard), which is that frequent queries without any variation are probably bot-like. For example, if there are 500 searches for English Wikipedia with no variation in spacing or punctuation, that's a bot (or at least a link to that specific search); real humans will have lots of variation: english wikipedia, English Wikipedia, ENGLISH WIKIPEDIA, eNGLISH wIKIPEDIA, english.wikipedia, english  wikipedia, etc.

This is not really bot detection anymore, but I'll include it anyway: for the dashboards, because we are interested in intent, we also do "heavy" normalization and strip all punctuation, including quotes, to group similar queries. All of the varaints of english wikipedia above should get the same results, while "english wikipedia" (with quotes) will get fewer results. However, in both cases the users probably have the same or similar intent.

BTullis added subscribers: odimitrijevic, BTullis.

I have discovered this ticket: T310846: Improve Bot Detection Heuristics

Perhaps that should be merged as a duplicate of this. @odimitrijevic, what do you think?

Product Analytics reports monthly metrics and we have observed a spike in automated bot traffic in the recent months of 2022 (Ref: Key Product Metrics)
Questions we want to address:

cc @JAllemandou , we discussed this in the DE office hours today.

We are once again seeing a rise in automated traffic in March 2023. This time it is showing up on external (search engine) referral traffic. (chart) in addition to none/direct traffic.

image.png (514×1 px, 77 KB)

March 2023 Key Product Metrics

leila added subscribers: Miriam, leila.

Context for moving the task to the freezer: We don't have plans to prioritize working on this task in the coming 6 months. @Miriam would like to hold a conversation within the Data Science and Engineering group on this topic. She will create a dedicated task for that work when the time comes.