The goal is to develop building blocks for the airflow training pipeline. This task keeps track of the progress and updates related to this goal.
The research team has contributed to this goal, including:
- The revert risk training workflow, implemented in this research-datasets branch.
- The revert risk training dag.
- An example notebook for training
Noted the training code uses the pyspark integration for xgboost, which is incompatible with AMD GPUs. As a result, the hadoop GPU is unused for now.
Things I will contribute to include:
- adding a function in the evaluation step to compare the metrics of the newly trained model with the production model on Lift wing.
- adding a train_bert_model in the training step that can utilize the hadoop GPU (now available in yarn's gpus queue, see this patch)
- adding a component publish_model in the revert risk dag to publish the new model.