The ML team has recently started implementing offline ML workflows. The Research team has built a number of offline ML pipelines, along with tooling and re-usable code. The output of this task is that relevant and useful tooling & code developed by Research is available in the new ML team owned repository. This will facilitate the migrations of existing offline ML workflows (e.g. add-a-link, revert risk) from Research to ML (done in dedicated tasks).
Which components will be included:
- Command API: Base models for configuring airflow seamlessly
- CI configuration: build development conda environments for ML workflows on spark, unit tests, linting
- Unit testing, including snapshot based spark tests
- Reusable code: common re-usable transformations (stratified sampling, joining with parent revision, model evaluation, etc)
- Style guide: Create a proposal for the ML team based on the style guide created by research, which focuses on easy re-use of code between development/experimental work in jupyter notebooks and production workloads.
What repos will be affected: code will be migrated from research-datasets to ml-pipelines, and a new file will be added the airflow-dags repo for the ml instance. The code for all components will be added in a separate MR, reviewed and evaluated by ML.
What is not included: Migration of workflows themselves (e.g. training for revert risk , add-a-link models, inference pipelines, etc). These can be migrated by the engineers responsible for the specific tasks (tracked and prioritized separately). Having the code that is part of this task in place will facilitate these migrations, and it will also facilitate future hand-overs from research to ML.