Maniphest T162053

backend data engineering and plumbing for LTRank
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	EBernhardson
	Apr 3 2017, 3:31 PM

Description

Top level task for organizing backend data engineering for supporting the learn to rank pipeline.

High level design:

Join webrequest data with CirrusSearch request set to create a table containing search click-through's
Train a DBN on a suitably sized sample of the search click-through's and record relevance labels into a hive table
Generate elasticsearch queries for search+page id combinations we want to generate feature labels for and push them into Kafka from analytics network
- Kafka consumer in production network reads queries inserted by analytics, runs them against some elasticsearch cluster, and the generated feature scores get pushed back into a different Kafka log
- Feature logs are read back from Kafka and stored in HDFS
Join together DBN labels with ES feature generation to make a combined dataset and output in a format suitable for training a machine learning model (XGBoost, RankLib, or LightGBM).
Train models against dataset, generating output suitable for loading into elasticsearch LTR plugin

All of the above should be automated to the point where the first step can run and everything else does it's job, with models coming out the other end. It also needs to be possible to relatively easily change the set of features generated to evaluate different features for machine learning models without re-running the first few steps.

Related Objects
Search...

Status	Assigned	Task
Invalid	None	T174064 [FY 2017-18 Objective] Implement advanced search methodologies
Resolved	EBernhardson	T161632 [Epic] Improve search by researching and deploying machine learning to re-rank search results
Resolved	EBernhardson	T162053 backend data engineering and plumbing for LTRank
Resolved	EBernhardson	T162054 Build oozie job to collect click data for learn to rank
Resolved	EBernhardson	T162056 Generate DBN relevance labels from click data
Resolved	EBernhardson	T162059 Kafka consumer to take learn to rank queries from a queue and run them against elasticsearch to generate relevance labels.
Resolved	EBernhardson	T162072 Push LTRank feature generation queries into elasticsearch, and read back all the feature logs
Duplicate	EBernhardson	T162075 Oozie job for merging click data with DBN relevance scores
Resolved	EBernhardson	T162311 Setup test/train/validation splits for our training data.
Resolved	EBernhardson	T163407 Collect feature vectors from elasticsearch in mjolnir
Resolved	EBernhardson	T166585 Calculate NDCG of click data that feeds the MLR pipeline
Resolved	• dcausse	T168813 Collect features from the sltr query
Resolved	EBernhardson	T172589 Evaluate ensemble pruning algorithms

Event Timeline

EBernhardson lowered the priority of this task from High to Medium.Apr 3 2017, 3:31 PM

EBernhardson created this task.

EBernhardson created subtask T162054: Build oozie job to collect click data for learn to rank.Apr 3 2017, 3:36 PM

EBernhardson created subtask T162056: Generate DBN relevance labels from click data.Apr 3 2017, 3:42 PM

EBernhardson created subtask T162059: Kafka consumer to take learn to rank queries from a queue and run them against elasticsearch to generate relevance labels..Apr 3 2017, 3:55 PM

EBernhardson updated the task description. (Show Details)Apr 3 2017, 4:16 PM

EBernhardson updated the task description. (Show Details)

There are also some high level architectural questions to answer:

Should we continue with writing single scripts in directories of wikimedia/discovery/analytics repository, or should we build a python library that can be distributed as an egg to spark?
How do we deploy dependencies of the plugins, such as the clickmodels python library?
Do we use the o19s elasticsearch plugin, or do we update search/ltr repository to work with ES5?

EBernhardson created subtask T162072: Push LTRank feature generation queries into elasticsearch, and read back all the feature logs.Apr 3 2017, 5:28 PM

EBernhardson created subtask T162075: Oozie job for merging click data with DBN relevance scores.Apr 3 2017, 6:10 PM

EBernhardson updated the task description. (Show Details)Apr 3 2017, 8:30 PM

I'm also not entirely sure on the part the takes query+page ids that were labeled, ships to prod, gets the features back, and then generates datasets suitable for running a learning algorithm. In an ideal world i think this should probably be a single step, from the developers point of view, such that it is very easy to train models with new sets of features. The pipeline from a feature engineering perspective should ideally be:

Define a set of elasticsearch queries we want to use as features
Run some command that selects a suitable sample of results, collects features, merges with dbn labels, spits out data formatted for a training library, trains the library with optional hyper parameter optimizations, and outputs a set of performance scores indicating how well this set of features did, and a model file that can be loaded into relforge to run manual/external relevance tests.

There is also a potential optimization available to generate only "new" features, reusing features we've already collected, but i think it is a bit premature to work that out.

Should we continue with writing single scripts in directories of wikimedia/discovery/analytics repository, or should we build a python library that can be distributed as an egg to spark?

I've been thinking about this, and I think to support multiple use cases, such as automated updates of models in production with new data and feature engineering, the best route here will probably be to build out a python library. We will already have to figure out what the deployment plan looks like to get the clickmodels .egg distributed so including a second egg doesn't seem that much of an ask. This would allow keeping the python scripts in the wikimedia/discovery/analytics repo quite small, mostly just creating the spark context and calling into the right pieces. Perhaps also handling the paths where things are loaded from/stored to

EBernhardson renamed this task from backend data engineering and plumbing for LTR to backend data engineering and plumbing for LTRank.Apr 5 2017, 8:14 PM

EBernhardson created subtask T162311: Setup test/train/validation splits for our training data..Apr 5 2017, 8:34 PM

EBernhardson moved this task from Current work to Up Next on the Discovery-Search board.Apr 17 2017, 6:06 PM

EBernhardson edited projects, added Discovery-Search; removed Discovery-Search (Current work).

EBernhardson created subtask T163407: Collect feature vectors from elasticsearch in mjolnir.Apr 20 2017, 12:53 AM

EBernhardson created subtask T166585: Calculate NDCG of click data that feeds the MLR pipeline.May 30 2017, 4:51 PM

debt closed subtask T162056: Generate DBN relevance labels from click data as Resolved.May 30 2017, 5:15 PM

debt closed subtask T163407: Collect feature vectors from elasticsearch in mjolnir as Resolved.

debt closed subtask T166585: Calculate NDCG of click data that feeds the MLR pipeline as Resolved.Jun 16 2017, 5:35 PM

debt closed subtask T162311: Setup test/train/validation splits for our training data. as Resolved.

EBernhardson created subtask T168813: Collect features from the sltr query.Jun 26 2017, 5:40 AM

debt closed subtask T162072: Push LTRank feature generation queries into elasticsearch, and read back all the feature logs as Resolved.Jul 7 2017, 9:04 PM

debt closed subtask T162059: Kafka consumer to take learn to rank queries from a queue and run them against elasticsearch to generate relevance labels. as Resolved.