No updates this week.
- The Moderation Tools team is running tests and community discussions to implement the Automoderator project, we are in coordination with them to learn about potential areas of improvement for RR.
- There has been some community initiatives to evaluate the quality of the RR models. In T336934 a group of rowiki editors had manually labeled a set of risky revisions. We have analyzed these results, showing reasonable good performance.
- The ML-team is working on integrating RRLA to recent changes feed T348298. We are working on defining the best thresholds for this integration T351897 .
- We have been working with Wikimedia Enterprise to clarify some doubts about the RRLA model T346095
Fri, Nov 24
Thu, Nov 23
I can't think a case where this is possible but I'll have a look.
Anyhow, I've done some cleaning, and merged the datasets, and then I've computed some scores:
These scores seem to be based on the prediction, not the score returned by the algorithm, so they seem a bit useless in the context of a reverter - the community will almost certainly not accept a 53% success rate. Can you advise on why you chose these and not the score-based results, which seem better?
I've done both, you can find them on the jupyter notebook. But in summary the precision is very similar (almost identical) to ORES rowing-damagging
Tue, Nov 21
- Older revisions.
Concerns around our understanding of the limitations of the model for older edits given the training window. If there is a user that is looking at taking a full snapshot of either our current corpus of Wikipedia, or a past version, both include revisions from a broader window of time than the training window specifically and may show “latent” bad revisions that either perform differently with the LiftWing model or are uncaught.
I am curious what you may recommend to evaluate older content that could be vandalized without us knowing due to a lack of revisions/content attention by editors.
I'm not completely sure if I'm understanding your question. What I can say is that any model would have certain time drift, that includes RR and ORES. I think the model's precision would decay if we use it for very old data, but probably it tends to a certain limit (I would assume that the same is true for ORES, and that model is probably already working close to it's boundaries). The Language Agnostic model shouldn't be difficult to run on a large old dataset, I understand that @fkaelin
and @Pablo had been working on running the model on large data, so maybe if you have an specific question to be answered, the four of us could try to design an experiment to answer that question.
- Performance on different types of pages. You already addressed this in part, but what I mean by different page types isn’t necessarily subject-related (though cross-language data is helpful as well) but instead based on the metadata of the page.
How does the model typically perform on revisions in pages with low/high pageviews, low/high amounts of content, more/less edits, etc. This is less critical for our use-case, but we are imagining cases where a user may want to create their own filtering system based on their tolerance for risk and may want an approach that divides article approaches based on metadata.
Let us know if there are potential low-risk exercises we can collaborate on to subsect the data.
I don't have such statistics, maybe the Knowledge Integrity Observatory have some data to answer this (@Pablo ?)
- What to know that we do not know.
This is what I was trying to pull on with the question on use. If ORES had fallen out of style among some users and/or grown in use with others- why? If we can understand points of friction with use (Usability? Performance? Different approach needed?) it will help us integrate learnings as we design similar features (credibility signals,
I don't think there is clear pattern here. I think the adoption/attrition of these tools is opportunistic, in the sense that ppl used them according to their needs. With no other options, ppl would use what there is available. And even with more tools available, developers would use what fits better on their workflows, or even what the have seen working in the past. Unless we have dramatic difference between models' accuracy, I don't think that differences in models' quality is something that is easy to asses for developers.
Probably the attrition is related to the (lack of) success of the tools created using ML models, and not directly to the model itself (although that model quality probably has an impact on the tool success).
Great. Having some manual labels is always valuable.
I have done a quick check and I've seen there are few cases were the RR scores are not higher than 0.93. For example, this one:
Tue, Nov 14
Hi @Strainu , here Diego from the WMF Research team.
Mon, Nov 13
We would also include @cwylo !
Mon, Nov 6
Hi, here Diego from Research. I'll take some of the questions you raised:
Oct 31 2023
@MunizaA got it. Would this mean to create a another project?
Oct 30 2023
Oct 27 2023
Oct 26 2023
Oct 25 2023
This feature is working correctly. From my perspective this task can be marked solved.
Oct 24 2023
looks good. Thanks.
Oct 23 2023
Oct 17 2023
Hi @DDeSouza, please add the following papers:
Oct 11 2023
Oct 10 2023
Hi! The standard archival process works good. Thanks!
Oct 6 2023
Oct 5 2023
Sep 27 2023
Let's the updated csv for now. Later Iets to coordinate with @fkaelin to periodically update these values, both RRLA and Article quality models.
Sep 15 2023
Sep 12 2023
@MunizaA , could we please add an action to finish a project? By finish I mean to keep the project data, but stop showing in the front-end.
Sure, the data is public (we just remove the labeler username). As I mentioned in the previous comment the amount of data is pretty low, but you can find it here.
Sep 8 2023
Sep 1 2023
Aug 25 2023
Aug 22 2023
Revert Risk Language Agnostic should do the work, the inference time goes below 200ms.
Aug 21 2023
Hi @emwille! Sure I'll be happy to present, I'll send you an email to coordinate. Thanks!
Aug 15 2023
Thanks for the input. In order to make the annotations as much useful as possible for training the algorithm, it would be good to have labels that are not too specific, and can generalize as much as possible. Do you think that something like: "long-term vandalism" or "hijacked item" could be a good name for the phenomena that you are describing ?
@BTullis , yes, this has been solved in the current environment, thanks!
Aug 11 2023
Would you like me to move this in bulk to a new directory within your home, such as: /home/dsaez/paramd-archive
This sounds good and enough!
Aug 10 2023
Hi all, here Diego from WMF Research.
Aug 4 2023
- Weekly Updates**
Also the experimental model is available through the Knowledge Integrity package.
And if you want to help with the evaluation, please go to this site: https://annotool.toolforge.org/ and help us to annotate data :)
Jul 28 2023
Jul 21 2023
Jul 13 2023
Jul 11 2023
I have access to most of the data, I can wait a couple of weeks to get the full dump.
Hi @MoritzMuehlenhoff ,
Yes, please I'll need a copy of all the data both on the stat machines and HDFS