Page MenuHomePhabricator

Enable communities to configure automated reversion of bad edits
Open, Needs TriagePublic

Description

Project page: https://www.mediawiki.org/wiki/Moderator_Tools/Automoderator

Hypothesis

If we enable communities to automatically prevent or revert obvious vandalism, moderators will have more time to spend on other activities. This project will use the Research team's revert prediction model, and we may explore integrating it with AbuseFilter, or creating an automated anti-vandalism revert system. Targets and KRs for this project will be set collaboratively with the community.

Goals

  • Reduce moderation backlogs by preventing bad edits from entering patroller queues
  • Give moderators confidence that automoderation is reliable and is not producing significant false positives
  • Ensure that editors caught in a false positive have clear avenues to flag the error / have their edit reinstated

Proposed solution

Build functionality which enables editors to revert edits which meet a threshold in the revert-prediction model created by the Research team.

Give communities control over the product's configuration, so that they can:

  • Locally turn it on or off
  • Set the revert threshold
  • Audit configuration changes
  • Be notified of configuration changes
  • Customise further aspects of how the feature operates

Illustrative sketch

image.png (1×2 px, 487 KB)

Model

T314384: Develop a ML-based service to predict reverts on Wikipedia(s) (Meta, model card).

There are two version of the model: Language-agnostic and multilingual. The multilingual model is more accurate, but only available in 48 languages, as documented in the model card.

The model is only currently trained on namespace 0 (article/main space). It has been trained on Wikipedia and Wikidata, but could be trained on other projects.

Existing volunteer-maintained solutions

ProjectBot
en.wikiClueBot NG
es.wikiSeroBOT
fr.wiki & pt.wikiSalebot
fa.wikiDexbot
bg.wikiPSS 9
simple.wikiChenzwBot
ru.wikiРейму Хакурей
ro.wikiPatrocleBot

Relevant community discussions

Key Results

  • Automoderator has a baseline accuracy of 90%
  • Moderator editing activity increases by 10% in non-patrolling workflows
  • Automoderator is enabled on two Wikimedia projects by the end of FY23/24.
  • 5% of patrollers engage with Automoderator tools and processes on projects where it is enabled.
  • 90% of false positive reports receive a response or action from another editor.

See more details in our measurement plan.

Open questions

  • Can we improve the experience of receiving a false positive revert, as compared to the volunteer-maintained bots?
  • How can we quantify the impact this product will have? Positive (reducing moderator burdens) and negative (false positives for new users)

Related Objects

StatusSubtypeAssignedTask
OpenNone
OpenNone
Resolvedcalbon
Resolvedjsn.sherman
ResolvedSpikejsn.sherman
OpenNone
Resolvedjsn.sherman
OpenNone
Resolvedjsn.sherman
ResolvedScardenasmolinar
ResolvedScardenasmolinar
DuplicateNone
DuplicateNone
ResolvedScardenasmolinar
OpenNone
In ProgressScardenasmolinar
OpenNone
ResolvedSpikeScardenasmolinar
Resolvedaishwaryavardhana
Openjsn.sherman
ResolvedSpikejsn.sherman
Resolvedjsn.sherman
OpenSpikeNone
OpenNone

Event Timeline

Thanks @Samwalton9 for bringing up this idea! Having a dedicated score would help tremendously, even if the bot runs are not unified.

An externally-hosted tool using a more standard bot account

Note that at least for ro.wp reverts are done without the bot flag so they are visible to human patrollers. IMO, that has 2 advantages:

  1. wrong reverts are easy to spot
  2. users being auto-reverted get on the radar of human reverters

How can we quantify the impact this product will have? Positive (reducing moderator burdens) and negative (false positives for new users)

I can provide some numbers from the past year for ro.wp: https://ro.wikipedia.org/wiki/Utilizator:PatrocleBot/Rapoarte/csv

  • data - date
  • anulat - reverted
  • total - all non-autopatrolled edits in that interval (these are the edits targeted by the bot)
  • patrulat - patrolled by bot

Thanks @Samwalton9 for bringing up this idea! Having a dedicated score would help tremendously, even if the bot runs are not unified.

An externally-hosted tool using a more standard bot account

Note that at least for ro.wp reverts are done without the bot flag so they are visible to human patrollers. IMO, that has 2 advantages:

  1. wrong reverts are easy to spot
  2. users being auto-reverted get on the radar of human reverters

Thanks for highlighting this! I planned to reach out to you and other bot developers to learn about things like this :)

Just so I understand - are patrollers typically hiding bot edits?

I wonder if there are communities where this isn't the norm and they would want a revert bot to use the bot flag.

How can we quantify the impact this product will have? Positive (reducing moderator burdens) and negative (false positives for new users)

I can provide some numbers from the past year for ro.wp: https://ro.wikipedia.org/wiki/Utilizator:PatrocleBot/Rapoarte/csv

  • data - date
  • anulat - reverted
  • total - all non-autopatrolled edits in that interval (these are the edits targeted by the bot)
  • patrulat - patrolled by bot

Very helpful - so does the bot also patrol edits? What criteria does it use?

Just so I understand - are patrollers typically hiding bot edits?

I do believe this is the default in the recent changes. It might be that some are displaying bot edits, but this is not the norm.

Very helpful - so does the bot also patrol edits? What criteria does it use?

Damaging score below the 'very likely good' limit and the goodfaith score above the 'very likely good faith' limit (intervals are available at [[Special:ORESModels]] and are wiki-specific). The double score check seems to be very powerful: I have found bad reverts, but no bad patrols so far.

Samwalton9-WMF updated the task description. (Show Details)
Samwalton9-WMF updated the task description. (Show Details)
Samwalton9-WMF added a subscriber: diego.

There are two version of the model: Language-agnostic and multilingual. The multilingual model is more accurate, but only available in 48 languages, as documented in the model card.

@diego do you think there would be any reason to allow communities to select which model to use, or is the multilingual model simply the better option?

EDIT: We discussed this on a call. The short version is that it depends, particularly on the performance of the models, which is still in flux.

There are two version of the model: Language-agnostic and multilingual. The multilingual model is more accurate, but only available in 48 languages, as documented in the model card.

@diego do you think there would be any reason to allow communities to select which model to use, or is the multilingual model simply the better option?

EDIT: We discussed this on a call. The short version is that it depends, particularly on the performance of the models, which is still in flux.

After doing a comparative study on ~900 edits (unpatrolled, last edit, last 30 days on rowiki) with all 3 models (ORES, multilingual and language agnostic) I have some data to share. I hope that by sharing my methodology other bot owners will be willing to invest the time to to the same on their wikis. To begin with the conclusion, I believe that offering the model choice and the model retrain early on will be critical for the adoption of the project. Mixing the models together would be ideal (e.g. start with ORES and train the other models with false positives).

The raw data can be found in Google docs.

The assumption I started from was that ORES, being trained on local data only, will be the best fit for the habits of a community, and the question I wanted answered was how far from it where the other 2 models?. A secondary question was what threshold should we use for the newer models?

Methodology: for each unpatrolled change that can be reverted and was found in the recent changes stream, calculate the ores, multilingual and agnostic scores. Then calculate the differences between ORES and each of the revertrisk scores, as well as between multilingual and agnostic models. I then looked at the data globally and manually checked a few edits where the score differences where huge. Initially, I was also planning to use the prediction field, but since this is equivalent with if(score>0.5) prediction=true, I could not really extract useful data from it.

Here is what I make of the data:

  • The limited manual analysis I did showed different strengths in each model:
    • ORES is good at catching local problems (e.g. an IP which does "net-zero" aka just rewordings edits that I usually revert because they make the article harder to read)
    • the agnostic model is good at catching technical problems (it caught 2 good faith edits which destroyed infoboxes)
    • the multilingual model catches borderline edits, such as someone changing information in infoboxes
  • The two revertrisk models are about the same distance from ORES, with the language agnostic being slightly "closer" (i.e. a histogram more concentrated near the mid-point). I doubt the difference is statistically significant. This is a bit counter-intuitive, given the description of the models.
  • Both revertrisk models return slightly lower scores than ORES, that is, they underestimate how damaging an edit is. The multilanguage model is closer to ORES. This shows that language matters. You can see the difference by the bias toward the right in the histograms below.
  • because of the different strengths, there is no alignment of the scores, that is, it's very hard to create a correspondence between ORES scores (especially a 2-dimensional score) and revertrisk scores. However, there is a clear connection between the revertrisk scores: if both are high, the chance of the change to be reverted increases significantly.

Given the above, I believe the development team should consider introducing the model (and associated threshold) as configurable from the first deployment. This will allow communities to experiment with the knobs and find the option that best fit their needs. The significant associated risk that I see is that a community might decide to enforce a more agressive approach (that is, a lower threshold) which will generate a large number of false positive and subminate the trust in the project.

Histograma coloanei diff_ores_multilingual.png (371×600 px, 13 KB)
Histograma coloanei diff_ores_agnostic.png (371×600 px, 12 KB)

Hi @Strainu , here Diego from the WMF Research team.

Thanks for this analysis, it is very valuable. I would like to understand better the raw data you shared, and I think that would also help me to understand your conclusions. My main question is: Which column are you considering as ground truth (your labels)?
With that info, I could help to compute some stats.
I would be interested in knowing the precision@k for each model (ie, Precision-Recall curve) , so we can compare their results and not the probabilities (a.k.a "scores" ) that are not necessarily linearly distributed. I'll be happy to compute those numbers , I'll just need the ground-truth column.

Thanks for all your work!

My main question is: Which column are you considering as ground truth (your labels)?

As mentioned before, I believe that ORES damaging model (column B), being trained on local data, will be the best fit for the habits of a community. I'm a bit reluctant to call it "ground truth", as for ORES the revert results are significantly increased by using both the damaging and good faith models, but I think we can nevertheless work with this simplifying assumption.

I would be interested in knowing the precision@k for each model

I think you can get that from the model cards.

@diego since you graciously offered to dive deep on this comparison, maybe you will find it useful to also see a manual review on some diffs. I asked the patrollers @ rowiki to manually review a set of changes that received a score >=0,93 from the revertrisk algorithms and the results can be found below (column 2 is "would revert", column 3 is "would not revert"):

Since this is a crowdsourced effort, some revisions might have confusing results (both yes and no answers). I will also retry with a higher threshold in the following days (to let changes accumulate).

Great. Having some manual labels is always valuable.
I have done a quick check and I've seen there are few cases were the RR scores are not higher than 0.93. For example, this one:

$ curl https://api.wikimedia.org/service/lw/inference/v1/models/revertrisk-language-agnostic:predict -X POST -d '{"rev_id": 15206502, "lang": "ro"}' -H "Content-type: application/json" 

{"model_name":"revertrisk-language-agnostic","model_version":"2","wiki_db":"rowiki","revision_id":15206502,"output":{"prediction":false,"probabilities":{"true":0.09646665304899216,"false":0.9035333469510078}}}

Anyhow, I've done some cleaning, and merged the datasets, and then I've computed some scores:

Precision
RR LA0.53
RR ML0.52
ORES0.53
F1-SCORE
RR LA0.69
RR ML0.69
ORES0.69

Notice that given these revisions were not randomly selected, these number are not representative of model's performance, but they give an idea of the model precision on the top-side of the Precision-Recall curve.

If you want to see the full analysis, I posted the code and results here.

Great. Having some manual labels is always valuable.
I have done a quick check and I've seen there are few cases were the RR scores are not higher than 0.93. For example, this one:

That is very weird. I checked the code and a few of the pages identified in a newer run and did not see any mismatch. Is it possible for the score to change? I know in ores it was possible in certain conditions.

Anyhow, I've done some cleaning, and merged the datasets, and then I've computed some scores:

These scores seem to be based on the prediction, not the score returned by the algorithm, so they seem a bit useless in the context of a reverter - the community will almost certainly not accept a 53% success rate. Can you advise on why you chose these and not the score-based results, which seem better?

If you want to see the full analysis, I posted the code and results here.

Thanks, I'll do some changes based on the above and see where we end up.

Great. Having some manual labels is always valuable.
I have done a quick check and I've seen there are few cases were the RR scores are not higher than 0.93. For example, this one:

That is very weird. I checked the code and a few of the pages identified in a newer run and did not see any mismatch. Is it possible for the score to change? I know in ores it was possible in certain conditions.

I can't think a case where this is possible but I'll have a look.

Anyhow, I've done some cleaning, and merged the datasets, and then I've computed some scores:

These scores seem to be based on the prediction, not the score returned by the algorithm, so they seem a bit useless in the context of a reverter - the community will almost certainly not accept a 53% success rate. Can you advise on why you chose these and not the score-based results, which seem better?

I've done both, you can find them on the jupyter notebook. But in summary the precision is very similar (almost identical) to ORES rowing-damagging

Thank you @diego for analyzing the data. If my understanding is correct, the precision is the metric of the model that will be visible to the community - how many of the reverts are correct, so this is what I'm looking at.

I've played around with all the parameters (threshold, treatment of draws, whether the commit was indeed reverted) on the two data sets above and the results were indeed similar. I then realised that I was collecting revisions already filtered by the reverter I run, so I changed the data source to recent changes and the results can be seen here. I also ran a pwb version where I logged in to be able to retrieve 5000 changes, with similar results.

The best I could get in all these scenarios is ~0.7 precision for the LA model. For me, this number looks a bit low to push the model on the communities without the ability to experiment with the alternatives. Therefore, in the context of this task, my suggestions are:

  • allow the communities to configure the model between LA, ML and ORES wherever possible
  • allow the communities to choose a custom threshold score for the chosen model
Samwalton9-WMF renamed this task from Enable communities to configure automated prevention or reversion of bad edits to Enable communities to configure automated reversion of bad edits.Wed, Apr 3, 10:57 AM