Top level task for organizing backend data engineering for supporting the learn to rank pipeline.
High level design:
- Join webrequest data with CirrusSearch request set to create a table containing search click-through's
- Train a DBN on a suitably sized sample of the search click-through's and record relevance labels into a hive table
- Generate elasticsearch queries for search+page id combinations we want to generate feature labels for and push them into Kafka from analytics network
- Kafka consumer in production network reads queries inserted by analytics, runs them against some elasticsearch cluster, and the generated feature scores get pushed back into a different Kafka log
- Feature logs are read back from Kafka and stored in HDFS
- Join together DBN labels with ES feature generation to make a combined dataset and output in a format suitable for training a machine learning model (XGBoost, RankLib, or LightGBM).
- Train models against dataset, generating output suitable for loading into elasticsearch LTR plugin
All of the above should be automated to the point where the first step can run and everything else does it's job, with models coming out the other end. It also needs to be possible to relatively easily change the set of features generated to evaluate different features for machine learning models without re-running the first few steps.