Enable communities to configure automated reversion of bad edits
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	Samwalton9-WMF
	May 18 2023, 1:12 PM

Description

Project page: https://www.mediawiki.org/wiki/Moderator_Tools/Automoderator

Hypothesis

If we enable communities to automatically prevent or revert obvious vandalism, moderators will have more time to spend on other activities. This project will use the Research team's revert prediction model, and we may explore integrating it with AbuseFilter, or creating an automated anti-vandalism revert system. Targets and KRs for this project will be set collaboratively with the community.

Goals

Reduce moderation backlogs by preventing bad edits from entering patroller queues
Give moderators confidence that automoderation is reliable and is not producing significant false positives
Ensure that editors caught in a false positive have clear avenues to flag the error / have their edit reinstated

Proposed solution

Build functionality which enables editors to revert edits which meet a threshold in the revert-prediction model created by the Research team.

Give communities control over the product's configuration, so that they can:

Locally turn it on or off
Set the revert threshold
Audit configuration changes
Be notified of configuration changes
Customise further aspects of how the feature operates

Illustrative sketch

Model

T314384: Develop a ML-based service to predict reverts on Wikipedia(s) (Meta, model card).

There are two version of the model: Language-agnostic and multilingual. The multilingual model is more accurate, but only available in 48 languages, as documented in the model card.

The model is only currently trained on namespace 0 (article/main space). It has been trained on Wikipedia and Wikidata, but could be trained on other projects.

Existing volunteer-maintained solutions

Project	Bot
en.wiki	ClueBot NG
es.wiki	SeroBOT
fr.wiki & pt.wiki	Salebot
fa.wiki	Dexbot
bg.wiki	PSS 9
simple.wiki	ChenzwBot
ru.wiki	Рейму Хакурей
ro.wiki	PatrocleBot

Relevant community discussions

Key Results

Automoderator has a baseline accuracy of 90%
Moderator editing activity increases by 10% in non-patrolling workflows
Automoderator is enabled on two Wikimedia projects by the end of FY23/24.
5% of patrollers engage with Automoderator tools and processes on projects where it is enabled.
90% of false positive reports receive a response or action from another editor.

See more details in our measurement plan.

Open questions

Can we improve the experience of receiving a false positive revert, as compared to the volunteer-maintained bots?
How can we quantify the impact this product will have? Positive (reducing moderator burdens) and negative (false positives for new users)

Related Objects
Search...

Status	Subtype	Assigned	Task
Open		None	T336934 Enable communities to configure automated reversion of bad edits
Open		None	T337501 Retrain revert risk models on a regular basis via moderator false positive reports
Resolved		calbon	T299436 How impactful would pre-save automoderation be on edit save times?
Resolved		jsn.sherman	T345092 Determine high-level technical approach for Automoderator
Resolved	Spike	jsn.sherman	T349295 Determine technical approach for Automoderator edit revert component
Open		None	T349598 Create Automoderator extension
Resolved		jsn.sherman	T349609 Request a Gerrit repository for Automoderator
Open		None	T352437 Create Automoderator's edit reverting functionality
Resolved		jsn.sherman	T352439 Enable reverting of edits in Automoderator
Resolved		Scardenasmolinar	T352440 Automoderator should not revert bot edits
Resolved		Scardenasmolinar	T352442 Automoderator should not revert sysops
Duplicate		None	T352446 Automoderator should not revert self-reverts
Duplicate		None	T352447 Automoderator should not revert reverts of itself
Resolved		Scardenasmolinar	T357121 Automoderator should explicitly ignore page creations and non-mainspace edits
Resolved		Scardenasmolinar	T358363 Automoderator should ignore self-reverts and reverts of Automoderator
Open		jsn.sherman	T360698 Use jobs to fetch scores and then attempt reverts
Open		None	T352666 Enable configuration of Automoderator
Resolved	Spike	Scardenasmolinar	T349374 How will communities configure Automoderator?
Resolved		• aishwaryavardhana	T349238 Develop diverse UI design explorations to determine configuration for Automoderator that enhances the user experience
Resolved		jsn.sherman	T352675 Automoderator Mediawiki configuration: Edit summary
Resolved	Spike	jsn.sherman	T352678 How can we enable Automoderator's username to be localisable?
Resolved		jsn.sherman	T357557 Automoderator should be capable of fetching configuration from an on-wiki json page
Resolved	Spike	Kgraessle	T356903 [SPIKE] Test Automoderator with Indonesian Wikipedia's FlaggedRevs configuration [8HRS]
Open		None	T360695 Track edits that have been checked but not reverted by AutoModerator
Resolved		Kgraessle	T362930 Automoderator should be able to use public or internal liftwing endpoints
Open		cwylo	T362462 Design Research: Patroller workstream measurement (Automoderator Model testing)
Open		None	T362969 Deploy QuickSurvey for Automoderator patroller workstream survey

Event Timeline

Samwalton9-WMF created this task.May 18 2023, 1:12 PM

Restricted Application added subscribers: Strainu, Aklapper. · View Herald TranscriptMay 18 2023, 1:12 PM

Samwalton9-WMF updated the task description. (Show Details)May 18 2023, 1:13 PM

Samwalton9-WMF moved this task from Inbox to Epics on the Moderator-Tools-Team board.

Samwalton9-WMF updated the task description. (Show Details)May 18 2023, 1:25 PM

Thanks @Samwalton9 for bringing up this idea! Having a dedicated score would help tremendously, even if the bot runs are not unified.

An externally-hosted tool using a more standard bot account

Note that at least for ro.wp reverts are done without the bot flag so they are visible to human patrollers. IMO, that has 2 advantages:

wrong reverts are easy to spot
users being auto-reverted get on the radar of human reverters

How can we quantify the impact this product will have? Positive (reducing moderator burdens) and negative (false positives for new users)

I can provide some numbers from the past year for ro.wp: https://ro.wikipedia.org/wiki/Utilizator:PatrocleBot/Rapoarte/csv

data - date
anulat - reverted
total - all non-autopatrolled edits in that interval (these are the edits targeted by the bot)
patrulat - patrolled by bot

Samwalton9-WMF mentioned this in T314384: Develop a ML-based service to predict reverts on Wikipedia(s).May 19 2023, 11:47 AM

In T336934#8861846, @Strainu wrote:

Thanks @Samwalton9 for bringing up this idea! Having a dedicated score would help tremendously, even if the bot runs are not unified.

An externally-hosted tool using a more standard bot account

Note that at least for ro.wp reverts are done without the bot flag so they are visible to human patrollers. IMO, that has 2 advantages:

wrong reverts are easy to spot

users being auto-reverted get on the radar of human reverters

Thanks for highlighting this! I planned to reach out to you and other bot developers to learn about things like this :)

Just so I understand - are patrollers typically hiding bot edits?

I wonder if there are communities where this isn't the norm and they would want a revert bot to use the bot flag.

How can we quantify the impact this product will have? Positive (reducing moderator burdens) and negative (false positives for new users)

I can provide some numbers from the past year for ro.wp: https://ro.wikipedia.org/wiki/Utilizator:PatrocleBot/Rapoarte/csv

data - date

anulat - reverted

total - all non-autopatrolled edits in that interval (these are the edits targeted by the bot)

patrulat - patrolled by bot

Very helpful - so does the bot also patrol edits? What criteria does it use?

In T336934#8864349, @Samwalton9 wrote:

Just so I understand - are patrollers typically hiding bot edits?

I do believe this is the default in the recent changes. It might be that some are displaying bot edits, but this is not the norm.

Very helpful - so does the bot also patrol edits? What criteria does it use?

Damaging score below the 'very likely good' limit and the goodfaith score above the 'very likely good faith' limit (intervals are available at [[Special:ORESModels]] and are wiki-specific). The double score check seems to be very powerful: I have found bad reverts, but no bad patrols so far.

Samwalton9-WMF updated the task description. (Show Details)May 25 2023, 11:10 AM

Samwalton9-WMF updated the task description. (Show Details)

There are two version of the model: Language-agnostic and multilingual. The multilingual model is more accurate, but only available in 48 languages, as documented in the model card.

@diego do you think there would be any reason to allow communities to select which model to use, or is the multilingual model simply the better option?

EDIT: We discussed this on a call. The short version is that it depends, particularly on the performance of the models, which is still in flux.

Samwalton9-WMF updated the task description. (Show Details)May 25 2023, 4:45 PM

Samwalton9-WMF updated the task description. (Show Details)May 25 2023, 5:36 PM

KStoller-WMF mentioned this in T323811: [EPIC] Community configuration 2.0: Factor Community configuration out of GrowthExperiments.Jun 1 2023, 3:19 PM

Samwalton9-WMF added a subtask: T299436: How impactful would pre-save automoderation be on edit save times?.Jun 2 2023, 12:14 PM

Samwalton9-WMF mentioned this in T299436: How impactful would pre-save automoderation be on edit save times?.Jun 2 2023, 12:19 PM

Samwalton9-WMF updated the task description. (Show Details)

Pablo subscribed.Jun 14 2023, 3:04 PM

Samwalton9-WMF added a project: Automoderator.Jun 19 2023, 8:31 AM

jsn.sherman changed the status of subtask T345092: Determine high-level technical approach for Automoderator from Open to In Progress.Oct 11 2023, 5:00 PM

Samwalton9-WMF added a subtask: T349241: Design and implement false positive reporting page.Oct 19 2023, 8:17 AM

Samwalton9-WMF closed subtask T299436: How impactful would pre-save automoderation be on edit save times? as Resolved.

Samwalton9-WMF updated the task description. (Show Details)Oct 24 2023, 11:34 AM

jsn.sherman updated the task description. (Show Details)Nov 3 2023, 4:05 PM

In T336934#8880657, @Samwalton9-WMF wrote:

There are two version of the model: Language-agnostic and multilingual. The multilingual model is more accurate, but only available in 48 languages, as documented in the model card.

@diego do you think there would be any reason to allow communities to select which model to use, or is the multilingual model simply the better option?

EDIT: We discussed this on a call. The short version is that it depends, particularly on the performance of the models, which is still in flux.

After doing a comparative study on ~900 edits (unpatrolled, last edit, last 30 days on rowiki) with all 3 models (ORES, multilingual and language agnostic) I have some data to share. I hope that by sharing my methodology other bot owners will be willing to invest the time to to the same on their wikis. To begin with the conclusion, I believe that offering the model choice and the model retrain early on will be critical for the adoption of the project. Mixing the models together would be ideal (e.g. start with ORES and train the other models with false positives).

The raw data can be found in Google docs.

The assumption I started from was that ORES, being trained on local data only, will be the best fit for the habits of a community, and the question I wanted answered was how far from it where the other 2 models?. A secondary question was what threshold should we use for the newer models?

Methodology: for each unpatrolled change that can be reverted and was found in the recent changes stream, calculate the ores, multilingual and agnostic scores. Then calculate the differences between ORES and each of the revertrisk scores, as well as between multilingual and agnostic models. I then looked at the data globally and manually checked a few edits where the score differences where huge. Initially, I was also planning to use the prediction field, but since this is equivalent with if(score>0.5) prediction=true, I could not really extract useful data from it.

Here is what I make of the data:

The limited manual analysis I did showed different strengths in each model:
- ORES is good at catching local problems (e.g. an IP which does "net-zero" aka just rewordings edits that I usually revert because they make the article harder to read)
- the agnostic model is good at catching technical problems (it caught 2 good faith edits which destroyed infoboxes)
- the multilingual model catches borderline edits, such as someone changing information in infoboxes
The two revertrisk models are about the same distance from ORES, with the language agnostic being slightly "closer" (i.e. a histogram more concentrated near the mid-point). I doubt the difference is statistically significant. This is a bit counter-intuitive, given the description of the models.
Both revertrisk models return slightly lower scores than ORES, that is, they underestimate how damaging an edit is. The multilanguage model is closer to ORES. This shows that language matters. You can see the difference by the bias toward the right in the histograms below.
because of the different strengths, there is no alignment of the scores, that is, it's very hard to create a correspondence between ORES scores (especially a 2-dimensional score) and revertrisk scores. However, there is a clear connection between the revertrisk scores: if both are high, the chance of the change to be reverted increases significantly.

Given the above, I believe the development team should consider introducing the model (and associated threshold) as configurable from the first deployment. This will allow communities to experiment with the knobs and find the option that best fit their needs. The significant associated risk that I see is that a community might decide to enforce a more agressive approach (that is, a lower threshold) which will generate a large number of false positive and subminate the trust in the project.

Histograma coloanei diff_ores_multilingual.png (371×600 px, 13 KB)

Histograma coloanei diff_ores_agnostic.png (371×600 px, 12 KB)

Hi @Strainu , here Diego from the WMF Research team.

Thanks for this analysis, it is very valuable. I would like to understand better the raw data you shared, and I think that would also help me to understand your conclusions. My main question is: Which column are you considering as ground truth (your labels)?
With that info, I could help to compute some stats.
I would be interested in knowing the precision@k for each model (ie, Precision-Recall curve) , so we can compare their results and not the probabilities (a.k.a "scores" ) that are not necessarily linearly distributed. I'll be happy to compute those numbers , I'll just need the ground-truth column.

Thanks for all your work!

In T336934#9328589, @diego wrote:

My main question is: Which column are you considering as ground truth (your labels)?

As mentioned before, I believe that ORES damaging model (column B), being trained on local data, will be the best fit for the habits of a community. I'm a bit reluctant to call it "ground truth", as for ORES the revert results are significantly increased by using both the damaging and good faith models, but I think we can nevertheless work with this simplifying assumption.

In T336934#9328589, @diego wrote:

I would be interested in knowing the precision@k for each model

I think you can get that from the model cards.

@diego since you graciously offered to dive deep on this comparison, maybe you will find it useful to also see a manual review on some diffs. I asked the patrollers @ rowiki to manually review a set of changes that received a score >=0,93 from the revertrisk algorithms and the results can be found below (column 2 is "would revert", column 3 is "would not revert"):

Since this is a crowdsourced effort, some revisions might have confusing results (both yes and no answers). I will also retry with a higher threshold in the following days (to let changes accumulate).

KCVelaga_WMF subscribed.Nov 21 2023, 7:29 AM

Great. Having some manual labels is always valuable.
I have done a quick check and I've seen there are few cases were the RR scores are not higher than 0.93. For example, this one:

$ curl https://api.wikimedia.org/service/lw/inference/v1/models/revertrisk-language-agnostic:predict -X POST -d '{"rev_id": 15206502, "lang": "ro"}' -H "Content-type: application/json" 

{"model_name":"revertrisk-language-agnostic","model_version":"2","wiki_db":"rowiki","revision_id":15206502,"output":{"prediction":false,"probabilities":{"true":0.09646665304899216,"false":0.9035333469510078}}}

Anyhow, I've done some cleaning, and merged the datasets, and then I've computed some scores:

	Precision
RR LA	0.53
RR ML	0.52
ORES	0.53

	F1-SCORE
RR LA	0.69
RR ML	0.69
ORES	0.69

Notice that given these revisions were not randomly selected, these number are not representative of model's performance, but they give an idea of the model precision on the top-side of the Precision-Recall curve.

If you want to see the full analysis, I posted the code and results here.

In T336934#9350484, @diego wrote:

Great. Having some manual labels is always valuable.
I have done a quick check and I've seen there are few cases were the RR scores are not higher than 0.93. For example, this one:

That is very weird. I checked the code and a few of the pages identified in a newer run and did not see any mismatch. Is it possible for the score to change? I know in ores it was possible in certain conditions.

Anyhow, I've done some cleaning, and merged the datasets, and then I've computed some scores:

These scores seem to be based on the prediction, not the score returned by the algorithm, so they seem a bit useless in the context of a reverter - the community will almost certainly not accept a 53% success rate. Can you advise on why you chose these and not the score-based results, which seem better?

If you want to see the full analysis, I posted the code and results here.

Thanks, I'll do some changes based on the above and see where we end up.

In T336934#9355952, @Strainu wrote:

In T336934#9350484, @diego wrote:

Great. Having some manual labels is always valuable.
I have done a quick check and I've seen there are few cases were the RR scores are not higher than 0.93. For example, this one:

That is very weird. I checked the code and a few of the pages identified in a newer run and did not see any mismatch. Is it possible for the score to change? I know in ores it was possible in certain conditions.

I can't think a case where this is possible but I'll have a look.

Anyhow, I've done some cleaning, and merged the datasets, and then I've computed some scores:

These scores seem to be based on the prediction, not the score returned by the algorithm, so they seem a bit useless in the context of a reverter - the community will almost certainly not accept a 53% success rate. Can you advise on why you chose these and not the score-based results, which seem better?

I've done both, you can find them on the jupyter notebook. But in summary the precision is very similar (almost identical) to ORES rowing-damagging

Thank you @diego for analyzing the data. If my understanding is correct, the precision is the metric of the model that will be visible to the community - how many of the reverts are correct, so this is what I'm looking at.

I've played around with all the parameters (threshold, treatment of draws, whether the commit was indeed reverted) on the two data sets above and the results were indeed similar. I then realised that I was collecting revisions already filtered by the reverter I run, so I changed the data source to recent changes and the results can be seen here. I also ran a pwb version where I logged in to be able to retrieve 5000 changes, with similar results.

The best I could get in all these scenarios is ~0.7 precision for the LA model. For me, this number looks a bit low to push the model on the communities without the ability to experiment with the alternatives. Therefore, in the context of this task, my suggestions are:

allow the communities to configure the model between LA, ML and ORES wherever possible
allow the communities to choose a custom threshold score for the chosen model

Samwalton9-WMF moved this task from Backlog to Epics on the Automoderator board.Nov 30 2023, 3:59 PM

diego mentioned this in T341819: Explore alternatives for Revert Risk model improvements for Wikipedia.Dec 1 2023, 4:20 PM

jsn.sherman closed subtask T345092: Determine high-level technical approach for Automoderator as Resolved.Dec 1 2023, 4:24 PM

Sj awarded a token.Jan 2 2024, 11:16 AM

kostajh subscribed.Feb 2 2024, 8:09 AM

jsn.sherman removed a subtask: T352437: Create Automoderator's edit reverting functionality.Mar 21 2024, 5:42 PM

jsn.sherman removed a subtask: T352666: Enable configuration of Automoderator.Mar 21 2024, 5:50 PM

jsn.sherman removed a subtask: T349241: Design and implement false positive reporting page.Mar 21 2024, 5:57 PM

Samwalton9-WMF renamed this task from Enable communities to configure automated prevention or reversion of bad edits to Enable communities to configure automated reversion of bad edits.Apr 3 2024, 10:57 AM

	F41490626: Histograma coloanei diff_ores_agnostic.png
	Nov 11 2023, 8:23 PM

	F41490699: Histograma coloanei diff_ores_multilingual.png
	Nov 11 2023, 8:23 PM

	F37013027: Screenshot 2023-05-18 at 14.04.44.png
	May 18 2023, 1:12 PM

	F37090935: image.png
	Jun 2 2023, 12:20 PM

Enable communities to configure automated reversion of bad editsOpen, Needs TriagePublicActions