Page MenuHomePhabricator

Retrain peacock detection model for production use
Closed, ResolvedPublic

Description

In T386645, we evaluated the existing peacock detection model's effectiveness at a finer granularity using paragraph/sentence data. The research team suggested retraining the peacock detection model for production use, as the current models were not properly serialized.

We'd like to follow the training process outlined in the binary_classification_lm notebook to retrain mBERT and/or RoBERTa-Longformer models.

Context:
In Q2 (SDS 1.2.1 B), the research team trained four models for peacock behavior detection to test AI technologies for Wikimedia Foundation products:

  1. mBERT trained on English - 512 tokens
  2. mBERT trained on 7 languages* - 512 tokens
  3. XLM-Roberta trained on 7 languages* - 512 tokens
  4. XLM-R Longformer trained on 10 languages* - 4096 tokens

The evaluation results at article-level (taken from the report):

Modelaccuracyf1-scoreprecisionrecall
XLM-R Longformer0.5720.6430.5510.771
mBERT - trained on all langs0.6080.5270.6630.437
mBERT - trained on enwiki0.5340.3880.5650.296
XLM-Roberta0.6150.5410.6690.454

*7 languages are English, French, Spanish, Japanese, Russian, Chinese, Arabic
*10 languages are English, French, Spanish, Japanese, Russian, Chinese, Arabic, German, Portuguese, Dutch

Event Timeline

Restricted Application added a subscriber: alaa. · View Herald TranscriptMar 7 2025, 9:21 AM

Another training code (mBERT - trained on enwiki) for reference.

updating after a brainstorming with @achou

hey,

Sharing some thoughts:

  • Dataset insights:
    • Can we generate some insights from the dataset to understand it better? This can also help us to figure out if the dataset should be cleaned further.
    • These insights might be:
      • percentage of true label prediction expected on prod.
      • most frequent articles in the dataset.
      • most frequent edit/revert users in the dataset.
      • peacock label percentage by term.
      • some hypothesis to validate to understand the data better as time permits:
        • impact of edit/remove lengths distribution in revisions on the labels.
        • impact of article metadata (e.g. topic like biography if we have) on labels.
        • impact of edit user metadata (e.g. number of edits, days since user creation ) on labels.
        • impact of context (e.g. local day time)
  • Baseline model:
    • Can we create a baseline model to challenge our current model?
    • This can be a statistic based approach (e.g. most frequent peacock words, knn ) or an easy to train model (e.g. catboost with text features).
    • https://catboost.ai/docs/en/features/text-features . Catboost is a gradient boosting model with out of the box pre-processing for text inputs e.g converting text to tokens, tf-idf features, or embeddings based on preference.
  • Some questions:
    • How do we do the negative sampling? Maybe we can try different percentages to see the impact.
    • Can we get some feedback from the research team about the model?
      • How the dataset is generated? (check dataset generation notebook.) Currently 50% positive. We use peacock template to filter positive labels (~2K edits) . We want check if we can use more keywords for positive labels.
      • What worked well what didn't work well during the training? Although any feedback is useful some examples could be around fully connected layer generation e.g. sizes of the fully connected layers tested, activation functions tested etc.
      • Any documentation/publication for us to focus on. (Check following document in more detail https://docs.google.com/document/d/1x3xR60MC-XEgus6XMtBNZFLBFZ-GQKFENUKtSt7KAOc/edit?tab=t.0#heading=h.7bpdmd9t0rv)
  • Notes for production:
    • I think we should be aware that the label distribution in the current test set is different than the distribution we expect in production. Percentage of positive labels in test is 50% and we expect much low percentage on prod. We need to find expected positive label percentage on production. Therefore, the evaluation scores in test may not reflect the scores we will get on prod. Then, I think we have a couple of options:
      • Evaluate model on a test set which is similar to prod. if we get too many false positives as the original training set has high percentage of positive items:
        • Adjust thresholds to the expected prod requests. (this may yield to a better AUC)
        • Adjust train and test set to the expected prod requests.