Summary
Currently the CirrusSearch API provides a morelike feature that provides a list of similar pages that have been calculated upfront by apache more like this. The goal of this task is to expose the relatedness information calculated via the apache flink based citolytics project in the same way. A good query prefix would be citolytics:"Pagetitle". Citolytics would be an additional source for the read more feature (related pages project page). With its different algorithmic approach compared to the current morelike system, we hope to increase the user engagement by providing better recommendations.
Implementation
- The article recommendations can be integrated with a custom KeywordFeature, e.g. CitolyticsKeywordFeature, that is trigged by the citolytics: prefix and modifies the search query.
- Recommendation data can be stored in an additional field of the CirrusSearch index.
- The Flink job that generates the recommendations from a Wiki XML dump can output in ES bulk format. Its output can be used to populate the data manually to the CirrusSearch index or automatically from within the Oozie pipeline.
- The pull request can be found on Gerrit: https://gerrit.wikimedia.org/r/#/c/329626/
Demo
- A MediaWiki setup that demonstrates the feature based on simplewiki can be found here: http://citolytics-demo.wmflabs.org/
- A guide for setting up the demo can be found here: https://github.com/mschwarzer/citolytics-demo/