Repeat Automoderator testing process with Multilingual Revert Risk data
Open, In Progress, MediumPublic
Actions

Assigned To

Authored By

	Samwalton9-WMF
	Aug 19 2024, 10:39 AM

Description

In T346916 we generated datasets which we made available through a testing process to judge the accuracy of the Language Agnostic Revert Risk model. We would like to investigate moving to the Multilingual Revert Risk model, which will require a new round of testing. We want to know if it's reliable, how different it is to the Language Agnostic version, and set sensible thresholds to use in Automoderator's community configuration.

To start, please generate datasets for the same wikis as in T346916 so that we can make these datasets available to the community. We can then create a second version of the testing spreadsheet incorporating these datasets.

We can start with 2,000 edits per wiki - 25,000 as in the original ticket turned out to be much more than we needed!

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T365581 Use multilingual revert risk model in Automoderator on supported wikis
		In Progress		KCVelaga_WMF	T372747 Repeat Automoderator testing process with Multilingual Revert Risk data

Event Timeline

Samwalton9-WMF created this task.Aug 19 2024, 10:39 AM

Restricted Application added a project: Moderator-Tools-Team. · View Herald TranscriptAug 19 2024, 10:39 AM

Samwalton9-WMF added a parent task: T365581: Use multilingual revert risk model in Automoderator on supported wikis.Aug 19 2024, 10:39 AM

We also need to repeat T351057 for this model - shall I make that a separate ticket?

@Samwalton9-WMF yes, please create a separate ticket for that.

I would also like to tag @Pablo @diego: do we have regular snapshots of revert risk scores based on the multilingual model as well, or even a single snapshot for a few months?

KCVelaga_WMF moved this task from Triage to Current Quarter on the Product-Analytics board.Aug 19 2024, 11:52 AM

Something similar to risk_observatory.revert_risk_predictions or one-off snapshot like /user/paragon/riskobservatory/revertrisk_20212022_anonymous_bot.parquet that I previously used.

I would also like to tag @Pablo @diego: do we have regular snapshots of revert risk scores based on the multilingual model as well, or even a single snapshot for a few months?

I don't know, maybe @fkaelin knows.

There is no pre computed dataset available. The implementation is general, e.g. by passing a multi-lingual model url it could create predictions for that model. See the risk observatory pipeline code as an example that uses this transformation end-to-end; this can be run via a notebook (pip install repo and import/execute the run method). To create a pipeline that generates predictions for the multi-lingual model via airflow dag, we would need to create research engineering request.

Samwalton9-WMF updated the task description. (Show Details)Aug 22 2024, 11:17 AM

Samwalton9-WMF mentioned this in T373106: Calculate expected reverts per day for multilingual revert risk model.Aug 22 2024, 12:40 PM

Samwalton9-WMF moved this task from Inbox to Product backlog on the Moderator-Tools-Team board.Aug 22 2024, 1:58 PM

KCVelaga_WMF triaged this task as Medium priority.Aug 30 2024, 5:08 PM

KCVelaga_WMF edited projects, added Product-Analytics (Kanban); removed Product-Analytics.

@fkaelin thank you! I am running into an error trying to load the model. I tried both versions of the model: 20230810110019 & 20230320192952.

When I try

from research_datasets.revert_risk.model import load_model
model = load_model(file_uri=https://analytics.wikimedia.org/published/wmf-ml-models/revertrisk/multilingual/20230810110019/model.pkl)

leads to an error that I am unable to resolve

AttributeError: module '__main__' has no attribute 'MultilingualRevertRiskModel'

Here is the full stack trace: P68760

Also, it is working fine when loading the language-agnostic model.

@KCVelaga_WMF I misread this code previously - for now the model loading/inference for the various variants of the revert risk model is not unified (we plan to do that though). There are separate loading and classify methods for the multi-lingual model. The features extraction pipeline for all models is generalized, but the inference step needs some more work. I started updating this notebook to work with the multi-lingual model but run into some torch/transfomer version issues.

Samwalton9-WMF mentioned this in T374935: Enable Automoderator to use Multilingual Revert Risk model to score edits on some wikis.Sep 17 2024, 11:59 AM

@Samwalton9-WMF The predictions are available on this spreadsheet for testing. Note: for this task and reverts per day, I had to limit to the first week of August, as I ran into memory issues.

Samwalton9-WMF edited projects, added Moderator-Tools-Team (Kanban); removed Moderator-Tools-Team.Sep 19 2024, 2:51 PM

Samwalton9-WMF moved this task from Ready to QA on the Moderator-Tools-Team (Kanban) board.

In T372747#10161126, @KCVelaga_WMF wrote:

@Samwalton9-WMF The predictions are available on this spreadsheet for testing. Note: for this task and reverts per day, I had to limit to the first week of August, as I ran into memory issues.

Thank you! I ran through ~200 edits at revert risk scores between 0.9 and 1 on enwiki. My unscientific accuracy results around different scores, with cases where I wasn't sure conservatively counted as false positives:

0.9: 38%
0.93: 56%
0.95: 75%
0.96: 84%
0.97: 80%
0.98: 81%
0.99: 91%
0.999: 100%

My initial ballpark thresholds to test, then, would be something like buckets of 0.1 starting at 0.95. @OTichonova is doing a similar runthrough for de.wiki and ru.wiki.

My initial ballpark thresholds to test, then, would be something like buckets of 0.1 starting at 0.95. @OTichonova is doing a similar runthrough for de.wiki and ru.wiki.

de.wiki	ru.wiki
.99: 96%	.99: 96%
.98: 83%	.98: 83%
.97: 63%	.97: 63%
.96: 67%	.96: 71%
.95: 50%	.95: 50%
.94: 37%	.94: 54%

Strainu subscribed.Sep 22 2024, 6:10 AM

An important note from @diego about accuracy for the multilingual model - it varies substantially between different languages, and therefore we're unlikely to be able to pick one set of thresholds for all wikis, unlike with the language-agnostic model. This poses some challenges for our testing process and UX. We may need to let each community set a very specific (i.e. number) threshold, rather than giving them predetermined buckets.

See teal columns in top graph, or bottom-left plot: