The research team produces datasets for production use cases. There are existing datasets that are generated in a adhoc fashion in jupyter notebooks, as well new datasets for FY24. The goal of this task is to standardize the research implements dataset pipelines:
- gitlab repositories are used to maintain/share the pipeline code
- gitlab features like CI / package registry are used to run tests and build artifacts for distributed compute
- datasets are stored in appropriate production environments (e.g. ML training/evaluation datasets in the research hdfs folder, report datasets in dedicated hive databases, etc)
- datasets are documented and discoverable (e.g. datahub)
- datasets execution is orchestrated using airflow where aporpriate