This task is a continuation of T391940: FY2024-25 Q4 Goal: Productionize tone check model
As a machine learning engineer,
I want to deploy to production a model that detects peacock language by parsing the content of a paragraph and highlighting the problematic words.
I also want a reproducible way to generate and improve the model, making iteration easier and the process less error-prone by version-controlling the code and reviewing it through code reviews.
Description
Event Timeline
Update
- Tone-check retraining code (from notebook) ✅
- CI/CD kokkuri pipelines for testing and publishing image on docker-registry.wikimedia.org/repos/machine-learning/ml-pipelines using the amd-pytorch23:2.3.0rocm6.0 base image ✅
- Tested locally with a python:3.11-slim base image and using mounted volumes simulating the actual run using a PVC ✅
- Running it on airflow using the PVC (Work in progress)
I pushed two MRs to implement the ML training pattern from T396495#11151194:
- ML training job logic refactor: The retraining job logic was updated to be fully configurable, with pinned dependencies for reproducibility and a cleaner blubberfile.
- ML training airflow DAG: The retraining DAG now correctly handles mounting the PVC in writeable mode and uses node affinity to target the MI210 GPU.
The retraining DAG now runs successfully in airflow, as shown in the screenshot below:
Started working on tone-check data generation job logic in T404722:
- test/dev iteration cycles take a really long time
- added logs at major pipeline steps for visibility
- added development limit to restrict the number of rows processed and enable much faster development iterations
- removed dev-limits as the small sample size results in no rows making it through the end of the pipeline
- improved resource allocation for fast job processing
In T404722, I pushed two MRs that implement the tone-check data generation pipeline, following the pattern from T396495#11151194:
- Data generation job logic: Incorporated the notebook into the training data generation job logic that uses:
- generate_training_data.py: script that performs the main ETL work, including template matching, joining historical data, and transforming it into a structured, model-ready format.
- split_training_data.py: script that takes the full model-ready dataset and performs a reproducible 80/10/10 split into train, validation, and test sets.
- Data generation airflow DAG: This DAG runs two sequential tasks, generate_training_data and split_training_data, each using the SparkSubmitOperator with optimized memory resource allocation. These tasks run their respective job logic to access tables from the data lake, generate the model-ready training data, and split it into train, validation, and test sets, writing all outputs as parquet files to HDFS.
The data generation DAG now runs successfully in airflow, as shown in the screenshot below:
I pushed an MR to implement the HDFS to PVC copy pattern from in T396495#11151194:
- Copy airflow DAG: This DAG copies the tone-check training data files from HDFS to a PVC mount, making them available for downstream model training and evaluation tasks.
The copy DAG now runs successfully in airflow, as shown in the screenshot below:
- Updated the tone-check training pipeline to use parquet files instead of csv: ( job logic, DAGs )
- Connected the tone-check data-generation, data-copy, and model-training DAGs using the TriggerDagRunOperator. (MR)
- Investigated why tone_check_copy_hdfs_to_pvc_dag failed in prod yet it worked in staging (T406302#11244807)
- Fixed the import order in mw_content_xml_export_dags (MR)
- Followed up on slack for reviews
- Tested TriggerDagRunOperator with stripped down tone-check DAGs in prod and run into permission issues (T406302#11249205)
- Followed up with DPE SRE on slack to fix permissions in prod
- Our DAGs were granted permission by DPE SRE to spin up pods in the airflow-ml instance (T406302#11250084)
- Model training task fails because it requests a GPU, but all available GPUs are in use by other jobs (T406302#11252998)
- Enabled deferrable execution for the training operator to wait for GPU resources gracefully and discovered that the Airflow triggerer process was not running (T406302#11258029)
- Following SRE advice, we manually started the triggerer by execing into the scheduler pod, but the DAG task failed because of a kube-config issue (P83720)
- Requested SRE to enable and properly configure the Airflow triggerer process to support deferrable operators (T406958)
- Also had a follow-up discussion on slack in case there were other solutions to handle this issue
- A fix is a WIP
- Started merging the tone-check DAGs into a single tone_check_training_dag for simplified orchestration (T407212)
- Tested the tone_check_training_dag in airflow-devenv and the data-generation and data-copy tasks ran successfully (T407212#11271755)
- The train_tone_check task failed with an OOM issue (P83875)
- Ran tone-check training job locally with model-ready training data to determine memory usage as there was an 8GB limit in wmf airflow that caused the job to fail. (T407212#11280133)
- Tested GPU node labels that were set up by SRE. (T373806#11275873)
- While using the GPU node selector, the tone-check model training task completed in 8hrs running in staging on MI210 GPU with 64G VRAM (P83966)
- Pushed an MR to prod that has the single DAG for the tone-check training pipeline.
- Copied tone-check base model from HDFS to PVC in prod. (MR)
- Followed up to confirm that the triggerer process enabled by DPE SRE works as expected in both airflow-ml and airflow-devenv (T406958#11288441)
- The tone-check training DAG ran and completed end-to-end in prod (T407212#11288359)
- Wrote first version of the airflow ML Pipelines docs: https://wikitech.wikimedia.org/wiki/Machine_Learning/Airflow_ML_Pipelines
Update
Since T406217 is completed and deployed, the tone-check retraining pipeline is in place.
The current airflow pipeline for tone-check looks like this:
Generate Data (SparkSubmitOperator) -> Train/Validation/Test split (SparkSubmitOperator) -> Copy from HDFS to a PVC (WMFKubernetesPodOperator) -> Train model on GPU pod (WMFKubernetesPodOperator) -> Copy retrained model to S3 (PythonOperator)
The next steps will be planned and organised in dedicated Phabricator tasks, e.g. Evaluation -> Deployment Decision -> A/B testing -> Continuous Monitoring, etc...



