[EPIC] Detect bots from searches (and learn C++/Rcpp)
Closed, InvalidPublic


The goal of this task is two-fold:

  1. Learn C++ and Rcpp (because this is a very important set of skills for at least one analyst on Discovery/Analysis team to have), and
  2. Produce a classifier (the deliverable) that attempts to figure out whether a given search has most likely been made by a bot or if the user is a bot, using predictors like
    • User-agent / accept-language metadata
    • Country of origin
    • Query features
    • Zero results (maybe?)
    • Number of searches made by that user per hour
    • Time elapsed between searches
    • Number of valid words in search query as determined using dictionary picked by TextCat (possibly?)

This work (that will be kept in this repository) will allow us to identify and filter out probable bot searches that affect our zero results rate (ZRR) but are not explicitly labelled as bot searches by our current heuristics that rely on a list of hand-picked user-agents.

We expect the training and classification process to be:

  1. Query CirrusSearchRequestSet in Hive and do as much request refinement there as possible
  2. Perform additional processing in R (such as creating a properly formatted matrix of predictors, augmenting data)
  3. Use Rcpp to hook into a C++ machine learning library that includes a classifier
    • Why a C++ ML library when R already has many ML packages? We deal with A LOT of searches, so speed is important. Also this makes for a good end-goal :)
    • Which library? Well, there's:
      • mlpack (implements Naive Bayes), which is what I'm currently focusing on
      • Shark (implements random forest, neural network)
      • dlib (implements SVM)
mpopov created this task.Oct 28 2016, 7:13 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 28 2016, 7:13 PM

Documenting my progress in these notes: https://github.com/bearloga/learning-rcpp

Latest endeavor was figuring out how to use mlpack's k-means in R. Next step is learning to serialize/de-serialize C++ objects, which will enable me to store a trained classifier in R, rather than having to keep one in RAM and only having a pointer to it in R.

mpopov removed the point value for this task.
mpopov moved this task from Backlog to In progress on the Discovery-Analysis (Current work) board.
mpopov changed the title from "[EPIC] Learning C++/Rcpp through probabilistic bot detection" to "[EPIC] Learning C++/Rcpp through ML bot detection".Oct 28 2016, 8:42 PM
mpopov edited the task description. (Show Details)
mpopov edited the task description. (Show Details)Oct 29 2016, 12:17 AM
mpopov changed the title from "[EPIC] Learning C++/Rcpp through ML bot detection" to "[EPIC] Detect bots from searches (and learn C++/Rcpp)".Oct 31 2016, 6:57 PM
debt triaged this task as "Normal" priority.Nov 10 2016, 9:17 PM

@TJones, @chelsyx, and I discussed this today and we're starting to think it would be more useful to instead classify users as "typical" vs "atypical", rather than "bot" vs "person" because there may be real people using search in really weird and highly specific ways that would still negatively impact our metrics and make it difficult for us to gauge how we're serving regular users.

debt closed this task as "Invalid".Tue, Mar 28, 8:15 PM
debt added a subscriber: debt.

We'll go ahead and close this as invalid - it'll be extremely hard to figure out bot vs person...this is based off of the chat earlier this year.