Page MenuHomePhabricator

Repeat Automoderator testing process with Multilingual Revert Risk data
Open, In Progress, MediumPublic

Description

In T346916 we generated datasets which we made available through a testing process to judge the accuracy of the Language Agnostic Revert Risk model. We would like to investigate moving to the Multilingual Revert Risk model, which will require a new round of testing. We want to know if it's reliable, how different it is to the Language Agnostic version, and set sensible thresholds to use in Automoderator's community configuration.

To start, please generate datasets for the same wikis as in T346916 so that we can make these datasets available to the community. We can then create a second version of the testing spreadsheet incorporating these datasets.

We can start with 2,000 edits per wiki - 25,000 as in the original ticket turned out to be much more than we needed!

Event Timeline

We also need to repeat T351057 for this model - shall I make that a separate ticket?

@Samwalton9-WMF yes, please create a separate ticket for that.

I would also like to tag @Pablo @diego: do we have regular snapshots of revert risk scores based on the multilingual model as well, or even a single snapshot for a few months?

Something similar to risk_observatory.revert_risk_predictions or one-off snapshot like /user/paragon/riskobservatory/revertrisk_20212022_anonymous_bot.parquet that I previously used.

I would also like to tag @Pablo @diego: do we have regular snapshots of revert risk scores based on the multilingual model as well, or even a single snapshot for a few months?

I don't know, maybe @fkaelin knows.

There is no pre computed dataset available. The implementation is general, e.g. by passing a multi-lingual model url it could create predictions for that model. See the risk observatory pipeline code as an example that uses this transformation end-to-end; this can be run via a notebook (pip install repo and import/execute the run method). To create a pipeline that generates predictions for the multi-lingual model via airflow dag, we would need to create research engineering request.

KCVelaga_WMF changed the task status from Open to In Progress.Sep 10 2024, 5:20 AM
KCVelaga_WMF moved this task from Next 2 weeks to Doing on the Product-Analytics (Kanban) board.

@fkaelin thank you! I am running into an error trying to load the model. I tried both versions of the model: 20230810110019 & 20230320192952.

When I try

from research_datasets.revert_risk.model import load_model
model = load_model(file_uri=https://analytics.wikimedia.org/published/wmf-ml-models/revertrisk/multilingual/20230810110019/model.pkl)

leads to an error that I am unable to resolve

AttributeError: module '__main__' has no attribute 'MultilingualRevertRiskModel'

Here is the full stack trace: P68760

Also, it is working fine when loading the language-agnostic model.

@KCVelaga_WMF I misread this code previously - for now the model loading/inference for the various variants of the revert risk model is not unified (we plan to do that though). There are separate loading and classify methods for the multi-lingual model. The features extraction pipeline for all models is generalized, but the inference step needs some more work. I started updating this notebook to work with the multi-lingual model but run into some torch/transfomer version issues.

@Samwalton9-WMF The predictions are available on this spreadsheet for testing. Note: for this task and reverts per day, I had to limit to the first week of August, as I ran into memory issues.

@Samwalton9-WMF The predictions are available on this spreadsheet for testing. Note: for this task and reverts per day, I had to limit to the first week of August, as I ran into memory issues.

Thank you! I ran through ~200 edits at revert risk scores between 0.9 and 1 on enwiki. My unscientific accuracy results around different scores, with cases where I wasn't sure conservatively counted as false positives:

0.9: 38%
0.93: 56%
0.95: 75%
0.96: 84%
0.97: 80%
0.98: 81%
0.99: 91%
0.999: 100%

My initial ballpark thresholds to test, then, would be something like buckets of 0.1 starting at 0.95. @OTichonova is doing a similar runthrough for de.wiki and ru.wiki.

My initial ballpark thresholds to test, then, would be something like buckets of 0.1 starting at 0.95. @OTichonova is doing a similar runthrough for de.wiki and ru.wiki.

de.wikiru.wiki
.99: 96%.99: 96%
.98: 83%.98: 83%
.97: 63%.97: 63%
.96: 67%.96: 71%
.95: 50%.95: 50%
.94: 37%.94: 54%

An important note from @diego about accuracy for the multilingual model - it varies substantially between different languages, and therefore we're unlikely to be able to pick one set of thresholds for all wikis, unlike with the language-agnostic model. This poses some challenges for our testing process and UX. We may need to let each community set a very specific (i.e. number) threshold, rather than giving them predetermined buckets.

See teal columns in top graph, or bottom-left plot:

image.png (913×769 px, 122 KB)