Page MenuHomePhabricator

Standardize research pipelines - Dataset generation
Open, HighPublic

Description

The research team produces datasets for production use cases. There are existing datasets that are generated in a adhoc fashion in jupyter notebooks, as well new datasets for FY24. The goal of this task is to standardize the research implements dataset pipelines:

  • gitlab repositories are used to maintain/share the pipeline code
  • gitlab features like CI / package registry are used to run tests and build artifacts for distributed compute
  • datasets are stored in appropriate production environments (e.g. ML training/evaluation datasets in the research hdfs folder, report datasets in dedicated hive databases, etc)
  • datasets are documented and discoverable (e.g. datahub)
  • datasets execution is orchestrated using airflow where aporpriate

Event Timeline

leila triaged this task as High priority.

Weekly updates:

  • Created subtask T342915, started design discussions

Weekly updates:

  • Merge request for spark pipeline to create training dataset for revert risk model
  • Started to collect open questions that need to be addressed in the design for a standardized ML dataset generation pipeline

Weekly updates:

  • Dataset generation spark pipeline for revert risk type models complete
  • Started airflow dag implementation for orchestration

Weekly updates

  • MR for airflow dag for revert risk training dataset pipeline

Weekly updates

  • Started to collect requirements for validation of training dataset generated via airflow; created T346473 to track.
fkaelin renamed this task from Standardize research ML pipeline - Dataset generation to Standardize research pipelines - Dataset generation.Oct 7 2023, 1:40 AM
fkaelin updated the task description. (Show Details)

Updates:

  • Removed the "ML" from the title and refined the task description, as there is a big overlap with other dataset pipelines that will be productionized
  • T343065: the code to get started has been added to knowledge-integrity-risk-index (via T341777)
  • Advanced planning for a pipeline to generate embeddings at scale. Initial phab: T348367