Mjolnir builds various intermediate datasets while running, but these datasets are locked up inside it's bowels. Break them up into separate jobs that expose the data for other use cases. This should start at the very front of the pipeline, likely the final part of the pipeline (the machine learning) should stay in mjolnir.
Pieces to split out:
- The initial data cleaning. This could be a weekly job that takes in the last 80 days of data and emits a clean set of clicks to learn from to discovery.query_clicks_ltr
- Query clustering. Right now we do a clustering of queries, but offer no way for any other process to use this (or an easy way to change out to a different clustering). This could be a job that reads in discovery.query_clicks_ltr and emits a new table discovery.query_clustering that has the fields <snapshot: string, algorithm:string, wikiid: string, query: string, cluster: int>.
- Query result labeling. The cleaned data and the clustering information can be joined and the DBN run to label things. Again this can be a weekly job that can output to a table with fields <snapshot: string, algorithm:string, wikiid: string, query: string, page_id: int, label: float>
Potentially further pieces of the pipeline could be generalized, but the above would expose the clustering and the labeled (query, page) pairs which could both be useful.