Page MenuHomePhabricator

backend data engineering and plumbing for LTRank
Closed, ResolvedPublic

Description

Top level task for organizing backend data engineering for supporting the learn to rank pipeline.

High level design:

  • Join webrequest data with CirrusSearch request set to create a table containing search click-through's
  • Train a DBN on a suitably sized sample of the search click-through's and record relevance labels into a hive table
  • Generate elasticsearch queries for search+page id combinations we want to generate feature labels for and push them into Kafka from analytics network
    • Kafka consumer in production network reads queries inserted by analytics, runs them against some elasticsearch cluster, and the generated feature scores get pushed back into a different Kafka log
    • Feature logs are read back from Kafka and stored in HDFS
  • Join together DBN labels with ES feature generation to make a combined dataset and output in a format suitable for training a machine learning model (XGBoost, RankLib, or LightGBM).
  • Train models against dataset, generating output suitable for loading into elasticsearch LTR plugin

All of the above should be automated to the point where the first step can run and everything else does it's job, with models coming out the other end. It also needs to be possible to relatively easily change the set of features generated to evaluate different features for machine learning models without re-running the first few steps.

Event Timeline

EBernhardson lowered the priority of this task from High to Normal.Apr 3 2017, 3:31 PM
EBernhardson created this task.
EBernhardson updated the task description. (Show Details)Apr 3 2017, 4:16 PM
EBernhardson updated the task description. (Show Details)
EBernhardson added a comment.EditedApr 3 2017, 4:38 PM

There are also some high level architectural questions to answer:

  • Should we continue with writing single scripts in directories of wikimedia/discovery/analytics repository, or should we build a python library that can be distributed as an egg to spark?
  • How do we deploy dependencies of the plugins, such as the clickmodels python library?
  • Do we use the o19s elasticsearch plugin, or do we update search/ltr repository to work with ES5?
EBernhardson updated the task description. (Show Details)Apr 3 2017, 8:30 PM

I'm also not entirely sure on the part the takes query+page ids that were labeled, ships to prod, gets the features back, and then generates datasets suitable for running a learning algorithm. In an ideal world i think this should probably be a single step, from the developers point of view, such that it is very easy to train models with new sets of features. The pipeline from a feature engineering perspective should ideally be:

  • Define a set of elasticsearch queries we want to use as features
  • Run some command that selects a suitable sample of results, collects features, merges with dbn labels, spits out data formatted for a training library, trains the library with optional hyper parameter optimizations, and outputs a set of performance scores indicating how well this set of features did, and a model file that can be loaded into relforge to run manual/external relevance tests.

There is also a potential optimization available to generate only "new" features, reusing features we've already collected, but i think it is a bit premature to work that out.

Should we continue with writing single scripts in directories of wikimedia/discovery/analytics repository, or should we build a python library that can be distributed as an egg to spark?

I've been thinking about this, and I think to support multiple use cases, such as automated updates of models in production with new data and feature engineering, the best route here will probably be to build out a python library. We will already have to figure out what the deployment plan looks like to get the clickmodels .egg distributed so including a second egg doesn't seem that much of an ask. This would allow keeping the python scripts in the wikimedia/discovery/analytics repo quite small, mostly just creating the spark context and calling into the right pieces. Perhaps also handling the paths where things are loaded from/stored to

EBernhardson renamed this task from backend data engineering and plumbing for LTR to backend data engineering and plumbing for LTRank.Apr 5 2017, 8:14 PM
debt added a subscriber: debt.

All the work has effectively been done on this ticket and we can open more as needed. Yay! :)

debt closed this task as Resolved.Dec 15 2017, 6:27 PM