Background
In T128118, we performed an analysis of how query features affect the outcome of the query (zero results vs some results) using the variable importance feature of random forest classification. The report inspired Trey to look into the problem of question marks in greater detail, and has led to us stripping question marks from queries (see T133711 for more details). It will be interesting to see which features float up to the top now that we have eliminated the question mark as a major influencer on zero results rate.
Objective
In this task, you will perform an analysis of search queries to check which features affect likeliness of zero results rate. You are welcome to use random forests, logistic regression, and/or any other methodology to answer the question.
Optional
We can also import TSS2 data into Hive to join with the search logs (see P4095 for more details), which would let you investigate the relationship between query features and clickthrough, letting you answer questions like "when users perform advanced searches and get results, do they click more often than users who perform simple searches?" Ask Mikhail to help you with getting the data if you choose to do this.
Tips & Links
- Here is the GitHub repo for the original report's codebase: https://github.com/wikimedia-research/Discovery-Search-Adhoc-QueryFeatures
- You will need to use the search deconstructor UDF in Hive. See instructions for using UDFs. The UDF relies on SearchQuery.java and SearchQueryFeatureRegex.java.
- You might be interested in the variable importance chapter of Gilles Louppe's PhD thesis Understanding Random Forests: From Theory to Practice
- As always, don't hesitate to ask questions or to ask for help/clarification :D