Page MenuHomePhabricator

Free the mjolnir datasets
Open, Needs TriagePublic

Description

Mjolnir builds various intermediate datasets while running, but these datasets are locked up inside it's bowels. Break them up into separate jobs that expose the data for other use cases. This should start at the very front of the pipeline, likely the final part of the pipeline (the machine learning) should stay in mjolnir.

Pieces to split out:

  • The initial data cleaning. This could be a weekly job that takes in the last 80 days of data and emits a clean set of clicks to learn from to discovery.query_clicks_ltr
  • Query clustering. Right now we do a clustering of queries, but offer no way for any other process to use this (or an easy way to change out to a different clustering). This could be a job that reads in discovery.query_clicks_ltr and emits a new table discovery.query_clustering that has the fields <snapshot: string, algorithm:string, wikiid: string, query: string, cluster: int>.
  • Query result labeling. The cleaned data and the clustering information can be joined and the DBN run to label things. Again this can be a weekly job that can output to a table with fields <snapshot: string, algorithm:string, wikiid: string, query: string, page_id: int, label: float>

Potentially further pieces of the pipeline could be generalized, but the above would expose the clustering and the labeled (query, page) pairs which could both be useful.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 12 2019, 2:53 AM
EBernhardson updated the task description. (Show Details)Feb 12 2019, 6:34 PM

I think the main idea will be to keep all of the appropriate code for performing transformations in mjolnir, and add oozie jobs to wikimedia/search/analytics. The new jobs can be python scripts, we already build venv's with dependencies for transfer_to_es, that import mjolnir and run the appropriate transformations. The scripts would primarily be concerned with where to load data from and write to store it. The algorithms would stay in mjolnir.

EBernhardson updated the task description. (Show Details)Feb 12 2019, 6:41 PM