Goal
Create an standardized training pipeline for Revert Risk Language Agnostic (RRLA) model
Tasks
- Generate a training dataset: Adapt the existing code for Revert Risk Multilingual for generating training data for RRLA.
- Automatize the dataset generation process: Create an Airflow process for creating a new dataset (considering the last 6 months) every month.
Context
The Revert Risk Language Agnostic model is currently a dependency for several projects and teams: Automoderator (T345092) ; Wikimedia Enterprise (T345931) and ORES deprecation.
Our research shows that keeping the models trained with recent data improves their performance significantly.
Also, having a standardized training pipeline would help easily introduce model improvements to deal with some known issues.
Requested but not currently prioritized
- Train model: Based on this research notebook, create a code for retraining the RRLA model, using the data generated in the previous step.
Instead of prioritizing the above request, we would like to plan and prioritize an abstraction of the request that addresses needs from multiple people in the team on this front. See T351009