The goal of this task is two-fold:
1. Learn C++ and Rcpp (because this is a very important set of skills for at least one analyst on Discovery/Analysis team to have), and
2. Produce a classifier (the deliverable) that attempts to figure out whether a given search has most likely been made by a bot or if the user is a bot, using predictors like
- User-agent / accept-language metadata
- Country of origin
- Query features
- Zero results (maybe?)
- Number of searches made by that user per hour
- Time elapsed between searches
- Number of valid words in search query as determined using dictionary picked by TextCat (possibly?)
This work will allow us to identify and filter out probable bot searches that affect our zero results rate (ZRR) but are not explicitly labelled as bot searches by our current heuristics that rely on a list of hand-picked user-agents.
We expect the training and classification process to be:
1. Query [[ https://github.com/wikimedia/mediawiki-event-schemas/blob/master/avro/mediawiki/CirrusSearchRequestSet/CirrusSearchRequestSet.idl | CirrusSearchRequestSet ]] in Hive and do as much request refinement there as possible
2. Perform additional processing in R (such as creating a properly formatted matrix of predictors, augmenting data)
3. Use Rcpp to hook into a C++ machine learning library that includes a classifier
- **Why a C++ ML library when R already has many ML packages?** We deal with A LOT of searches, so speed is important. Also this makes for a good end-goal :)
- **Which library?** Well, there's:
- [[ http://www.mlpack.org/ | mlpack ]] (implements Naive Bayes), which is what I'm currently focusing on
- [[ http://image.diku.dk/shark/ | Shark ]] (implements random forest, neural network)
- [[ http://dlib.net/ | dlib ]] (implements SVM)