In T386645, we evaluated the existing peacock detection model's effectiveness at a finer granularity using paragraph/sentence data. The research team suggested retraining the peacock detection model for production use, as the current models were not properly serialized.
We'd like to follow the training process outlined in the binary_classification_lm notebook to retrain mBERT and/or RoBERTa-Longformer models.
Context:
In Q2 (SDS 1.2.1 B), the research team trained four models for peacock behavior detection to test AI technologies for Wikimedia Foundation products:
- mBERT trained on English - 512 tokens
- mBERT trained on 7 languages* - 512 tokens
- XLM-Roberta trained on 7 languages* - 512 tokens
- XLM-R Longformer trained on 10 languages* - 4096 tokens
The evaluation results at article-level (taken from the report):
| Model | accuracy | f1-score | precision | recall |
|---|---|---|---|---|
| XLM-R Longformer | 0.572 | 0.643 | 0.551 | 0.771 |
| mBERT - trained on all langs | 0.608 | 0.527 | 0.663 | 0.437 |
| mBERT - trained on enwiki | 0.534 | 0.388 | 0.565 | 0.296 |
| XLM-Roberta | 0.615 | 0.541 | 0.669 | 0.454 |
*7 languages are English, French, Spanish, Japanese, Russian, Chinese, Arabic
*10 languages are English, French, Spanish, Japanese, Russian, Chinese, Arabic, German, Portuguese, Dutch