Page MenuHomePhabricator

Create a multilingual model to predict reverts on Wikipedia
Open, HighPublic

Description

The Research team in collaboration with the ML-Platform team are creating a new service to help patrollers to detect revisions that might be reverted (Check parent task).

This task if focused on multilingual approach using cross-lingual language models.

Details

Other Assignee
Trokhymovych

Event Timeline

diego triaged this task as High priority.Aug 26 2022, 5:02 PM
diego updated Other Assignee, added: Trokhymovych.
  • Performed EDA for wmf.mediawiki_history
    • Got familiar with data
    • Found out insights important for training dataset building (a significant number of self-revert, different rate of reverts for groups of users)
  • Manually explored recent changes (text differences) for ukwiki, ruwiki, enwiki
    • Find out the differences in causes of reverts for different languages. It should be taken into account while modeling.
    • Got familiar with the logic of reverts, revert-wars
    • Came out with a logic that can reduce noise from the training dataset by filtering "bad" reverts caused by revision wars.
  • Get familiar with the Analytics cluster
  • Collect dataset of changes for ruwiki, enwiki, and ukwiki along with text changes (inserts, changes, removes)
  • Performed EDA for collected datasets
  • Build toxicity features of inserts and changes based on detoxify package, checked the predictive power of those features -> they slightly improve the baseline performance, but not significantly.
  • Checked text changes meta-features of inserts and changes extracted using https://pypi.org/project/mwedittypes/, checked the predictive power of those features -> they improve the baseline performance.
  • Performed initial analysis of changes in references.
  • Checked the hypothesis that user location impacts the model's possibility to detect revisions that would be reverted. As a result, the experiment showed that those have good predictive power, which is comparable with text changes meta-features. In addition, I created a more detailed report for the correlation between anonymous users' locations and revert rates, which can be found in the attachment.

  1. Recollected dataset for anonymous users only and fixed minor bugs in text processing.
  2. Checked profanity score package - list of bad words (https://pypi.org/project/profanity-check/) -> not working, very weak signal
  3. Parsed page's semantic information for further processing (article categories), added wikidata_id.
  4. Attempted to get Wikidata embedding, but not successfully. Pretrained models are either huge or include less than 20% of needed entities.
  5. Check topics classification tool and country classification tools: https://wiki-topic.toolforge.org/. I found the country classification tool very insightful. Previously, I found out that the location of anonymous users influences the revert rate. However, combining page location and user location gives even more exciting insights that can be useful for revert event modeling.

Working with pure texts experiments:

  1. Get familiar with how to use Wikimedia GPU for experiments.
  2. Experiment 1. Title semantics: I used only title texts to calculate embeddings using the pretrained multilingual sentence-transformers model (paraphrase-multilingual-MiniLM-L12-v2). After that, I built models with two preprocessing techniques:

a) Take the training dataset as it is collected (all the revisions, with title duplicates + balancing the classes (random downsampling)). Build a dummy Catboost model on article titles only as a categorical feature (that makes target encoding by default). The model shows ~70% of accuracy. -> I understand it as title leakage. Some articles are frequently changed, and the revert rate from previous revisions can be a powerful feature. At the same time, it means that we should do a train-test split that considers timestamps on further iterations.
b) Take the training dataset as it is collected + drop duplicated records by title ("take the first" strategy) + balancing by is_reverted (target). In this formulation dummy model gives, as expected, random results (50% accuracy), so we have no title frequency leakage and can observe semantic information. So I take embeddings of titles and build a binary classifier on top of them. It shows an accuracy of 56% on the balanced dataset (which is not bad, considering this information is very limited). So title semantics influence the model performance.
c) I tried to define some patterns in semantic information to define some articles clusters with a higher revert rate -> unsuccessful.

  1. Experiment 2: Text changes model (enwiki only). The idea is to build the model that takes a pair of text (before, after) and defines if the revert will occur.

As for this experiment, I used only such revisions that have one text change. (Why? If no text changes - we can't say anything. Suppose more than one change - the signal is too noisy for training (we can't define which change was the reason for revert/not-revert.) + I balanced the dataset using downsampling.
I used the bert-base-multilingual-cased and huggingface framework to tune the model. The result is ~60% of accuracy on balanced data.

  1. Experiment 3: Text inserts model (enwiki only). The idea is to build the model that takes an inserted text and defines if the revert will occur.

As for this experiment, I used only such revisions that have one text insert + I balanced the dataset using downsampling.
I used the bert-base-multilingual-cased and huggingface framework to tune the model. The result is ~70% of accuracy on balanced data.

  1. Collected datasets for more languages (pl, de, es) and recollected previous ones to proceed with the time-dependent experiment
  2. Experimented with multilingual models training for inserts and changes as preparation for finetuning on multiple languages.
  3. I highly rely on the mwedittypes package, which uses mwparserfromhell for wikitext parsing. It was reported that an open issue exists that mwparserfromhell causes a very long to infinite parsing. It can be a possible signal of vandalism that is a signal that we want to detect. I have checked that ~1.0% of revisions had problems parsing the wikitext (pretty much the same for all languages). The revert rate for both parsed and not parsed revisions was the same, so this signal is probably not as strong as expected. I decided to consider it in the final model with binary feature is_parsed.
  4. Implemented bootstrap strategy for defining confidence intervals of statistics calculated for regions analysis (it is desired to understand the confidence of results I got while analyzing the revert rate for different regions of user/page)
  5. Designed the architecture of an end-to-end model that considers both revision features and text features. Started implementing pipeline for experiments with such architecture.

Current status: prepared training data, implemented train-test split strategy + split for separate text models tuning and classifier tuning for leakage avoiding. I started implementing the FeatureExtractor module for final model feature generation. Build the baseline based on revision features + user location only (~70% accuracy on the balanced timestamp-based hold-out dataset).

Future steps: finetune models for text-based feature generation (comments, title semantics, changes, inserts). Add these signals to the model and analyze the results.

This week I was working on a complex model that considers meta-features and text changes. What was done:

  1. Finetuned models for text-based feature generation (comments, title semantics, changes, inserts). I evaluated them separately. I later extracted features for the complex model from the last layer before softmax + softmax layer outputs for each model (except title semantics, as it was trained as a regression model).
  2. Trained model added all features from texts on the data part that was not used in the text model finetuning to omit leakage. As a result, I got a boost in accuracy score (~70%->74% on balanced test)
  3. Started error analysis and results observation.

Next steps:

  1. Make a detailed errors analysis to define the possible modeling gaps and further steps
  2. Define inference specifics and benchmark model efficiency (inference time)
  3. Compare results with existing solutions: new language agnostic model based on meta-features and ORES.
  1. I was working on model results interpretation and prepared a notebook with examples of per-sample SHAP values for final model results. Also, I was investigating the method to interpret each independent text model for better understanding and further improvement.
  2. Prepared and held a presentation of intermediate research results.
  3. Later, I worked on model validation using one week's data, including data collection, features collection, and building report. Also, I was investigating the package that implements the language agnostic model for further possible usage for my model inference feature engineering. Finalized report for complete hold-out data on the one-week dataset. I performed a sample-wise analysis of the differences between models.
  4. Studied the possibility of building the model on top of the language-agnostic, ores, and multilingual models that generalize the knowledge. Evaluated it using hold-out.

For the records here a snippet (by @achou) to try this model:

curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/revert-risk-model:predict" -d @input.json -H "Host: revert-risk-model.experimental.wikimedia.org" --http1.1 -k

An example for input.json: { "lang": "ru", "rev_id": 123855516 }