As part of the effort to T391940: FY2024-25 Q4 Goal: Productionize tone check model, the ML team would like to create a reproducible and maintainable training pipeline using WMF's ML Airflow instance.
This will make the model training iterations easier and the process less error-prone by version-controlling the code and reviewing it through code reviews.
We'll tackle the following:
- answer initial planning questions based on the ML team's discussions:
- Why do we use Airflow and what are the main benefits of it?
- What is the repo that we're going to use as a codebase? What are the practices that are followed in WMF?
- What are the data accesses that we need to tackle this work?
- What are the steps that we need to perform in each DAG?
- What is the status of the Airflow instances? What is the difference between the old vs k8s Airflow instances? Which Airflow instance are we going to use?
- Which Airflow operators shall we use for each of the pipeline steps?
- How do we ship the code? Do we package everything in a Docker image? How is the DAG logic shipped to the Airflow instance?
- restructure the training notebook, and consider retraining/fine-tuning workflow
- figure out using WMFKubernetesPodOperator for model training


