Page MenuHomePhabricator

Evaluate and improve the Revert Risk model for Wikidata.
Closed, ResolvedPublic

Description

We have developed a model to evaluate revisions on Wikidata according to their likelihood to be reverted (T333892) . The model is available for testing on this PAWS notebook.

For reporting bugs on the Annotation tool please use this ticket T344016.

In this task we are reporting the model evaluation and improvements. To do this we need to:

  • Release an annotation tool for evaluating the model. (Alpha version can be found here)
  • Launch a labeling campaign
  • Review results and launch a second campaign
  • Expand the model to different edit types (the current just works for claims and description)
  • If needed (based on the evaluation), improve model accuracy

Model have been improved. Follow model deployment here: T363718

Details

Due Date
Jun 27 2024, 10:00 PM

Event Timeline

leila triaged this task as High priority.
diego renamed this task from Explore alternatives for Revert Risk model improvements for Wikidata to .Evaluate and improve the Revert Risk model for Wikidata..Jul 21 2023, 9:53 AM
diego updated the task description. (Show Details)
diego added a subscriber: Lydia_Pintscher.

Weekly Updates

Weekly Updates

  • We have a new version of annotool with the following new features:
    • User authentication - through MediaWiki - has been added.
    • Labeling instruction per project.
    • Prioritizing the amount of data to be labeled, we just show revisions that has received no label.

@MunizaA is going to deploy the updated version no later than next Monday.

diego renamed this task from .Evaluate and improve the Revert Risk model for Wikidata. to Evaluate and improve the Revert Risk model for Wikidata..Aug 4 2023, 10:16 PM
diego added a subscriber: elukey.

Weekly Updates

Clicking on links in the rendered revision diff information section takes you to a 404. This is important because descriptions of the items aren't shown so you can't tell the difference between two items with the same name.

edits to the various sandboxes should be excluded.

did you change the order of the buttons? i swear keep was on the left before. are. you intentionally shuffling them?

It's unclear to me what data the model trains on when looking at item A when claim X is removed and claim Y is added.

Is it label + description of A + labels of X + Q-Id of X + labels of Y + Q-Id of Y

If that's the case, why is so little information looked at to make the decision? If the model is actually looking at more information, why is that additional information not displayed in the annotation tool?

Sometimes when I click on a button to make a decision there is no change to indicate anything has happened. Then I press the button again and nothing happens. And then a little while later the button changes to "Marked as". Probably should hide the button while a request is being waited on because looking at the network tab it seems like you're getting a bunch of duplicate messages because of this.

if there is no English label then there is no link to the item.

At the upper right corner there is a wiki selection drop-down. If I try to select a specific language to work in, it returns "nothing more to load". I can only work on items with "All" selected.

Hi all, here Diego from WMF Research.

Thanks for all the feedback, we are taking notes on the bugs and improvements you are suggesting.

Let me answer some of the questions:

did you change the order of the buttons? i swear keep was on the left before. are. you intentionally shuffling them?

No, we haven't changed, but they are in different orders in the two datasets. We will keep one order for future datasets.

If that's the case, why is so little information looked at to make the decision? If the model is actually looking at more information, why is that additional information not displayed in the annotation tool?

The model considers meta-data, as well the information on the diff. You can see the training code here.
Anyhow, the idea is that you provide your own assessment on whether the revision should be reverted or not.

Please for reporting errors/bugs/improvements on the Annotation Tool, use this task T344016. For discussing the model, please keep providing here.

Weekly Updates

  • We are collecting labels to evaluate and retrain the model.
  • We have already collect 100+ labels.
  • We will keep collecting labels during the next 2 weeks.

At the upper right corner there is a wiki selection drop-down. If I try to select a specific language to work in, it returns "nothing more to load". I can only work on items with "All" selected.

I'm hitting this as well. It's problematic since I only know English, so I'm skipping pretty much all the non-English entries.

Please for reporting errors/bugs/improvements on the Annotation Tool, use this task T344016. For discussing the model, please keep providing here.

Ah, apologies, the email on the mailing list linked to this ticket :)

Hi,

Some thoughts whilst I was playing:

  • Some items only consist of vandalism (where neither Revert or Keep state are wished). Will the 'Other' or 'Not Sure' buckets put them in a place for review?
  • As earlier stated, the wiki language dropdown (all, en, de, ar)

Only 'all' works for me. Choosing any other language gives a 'NOTHING MORE TO LOAD' string

  • You should be able to undo a change, if you realise your 'Keep' should actually be 'Reverted' or vice-versa
  • If I come back to the tool and am logged in, I am presented with the same items that I have already interacted with.
  • For non-EN languages, I cannot filter them out, but I have to scroll a long, long way down to reach new EN items I have not previously interacted with. Remembering and excluding items that I have already 'decided' on would be helpful.

Hi @Danny_Benjafield_WMDE, thanks for the feedback!

If I come back to the tool and am logged in, I am presented with the same items that I have already interacted with.

Do these items also include revisions you've already labeled? Ideally, you shouldn't see any revisions that you've annotated once you reload but revisions previously shown to you that you haven't marked would still show up.

Weekly Updates

  • We have been receiving and addressing feedback for the Annotool.
  • We are collecting data.

Hello,

This looks like a great project, and I was wondering if anyone involved with it would like to speak on it at an LD4 Wikidata Affinity Group Talk. We also give our presenters a lot of freedom in choosing topics, and if you had another Wikidata related topic we’d be happy to hear about that as well. We currently have openings on the following Tuesdays, at 9am PT / 12pm ET / 16:00 UTC / 6pm CEST.

3 October 2023
17 October 2023
31 October 2023
14 November 2023
28 November 2023

Talks are about an hour, usually with a 30-45 minute presentation by the speaker(s), followed by an audience Q&A and done via Zoom. You can see recordings of past speakers by following the links to agendas on our project page. And I am happy to answer any questions you have, of course. I wasn’t entirely certain who to contact about this project, and hope it's all right to post this here. Feel free to reply via email if that is preferred. Thank you!

Sincerely,
Eric Willey
emwille@ilstu.edu

Hi @emwille! Sure I'll be happy to present, I'll send you an email to coordinate. Thanks!

Weekly Updates

  • We have collected more than 3K labels to evaluate the model.
  • Several improvements has to Annotool has been deployed T344016 (great work @MunizaA !)
  • We will be collecting labels for one week more, and then start the evaluation and model improvements.

Weekly Updates

  • We have finalized the first round of labeling.
  • Now, I'll start evaluating the model, and sharing the labeled data.
  • I'm going to present the results to our collaborators from WMDE and discuss next steps.

Weekly Updates

  • We have analyzed the collected labels:
    • Due an issue in the collection process, we have collected several labels for each revision. Surprisingly we found that for more than 12% of the revision, there was no agreement on the labels (no class obtained more than 50% of the "votes"), and for ~30% of the revisions, there was more than one label per revision.
    • These results were unexpected, and made us to redesign the label collection methodology. We are in coordination with @Lydia_Pintscher , for launching a new labeling task in the following days.
  • @MunizaA is working on modifying Annotool, to improve the data collection process, adding a randomization process to increase the revision coverage.

Would it be difficult to publicly release the data collected so far?

Hi @BrokenSegue

Sure, the data is public (we just remove the labeler username). As I mentioned in the previous comment the amount of data is pretty low, but you can find it here.

Weekly Updates

  • @MunizaA has modified Annotool as planed.
  • To overcome the inconsistencies on the labels (discussed in the previous update), we have have defined a new methodology for data annotation (label collection) considering annotators tenure (amount of edits, and account creation time) in order to weight the results.
  • Now I've collected a new dataset, and we plan to release a new labeling campaign in the following days.

Weekly Updates

  • We are testing the new Annotool features, and reviewing the dataset for the new labeling campaign.
  • Depending on the results from the previous steps, we are going to reach out to @Lydia_Pintscher to coordinate the release of the new campaign.

Thank you for the work in Q1. This work as planned will continue into Q2 and I'll move it to the relevant column now. Good luck!

diego set Due Date to Dec 21 2023, 11:00 PM.

Weekly updates

  • I have coordinated with @Lydia_Pintscher for releasing a new labeling campaign. The campaign will be launched next week.

Updates

  • I have presented the Revert Risk model Wikidata and the Annotool at the WikiProject LD4 Wikidata gathering.
  • We have started collecting new annotations on the second Wikidata labeling. The campaign is available here. @Lydia_Pintscher is helping us to find more annotators (thanks!).

Updates

  • Currently we have around 200 labels. WMDE is helping to increase this number.
  • We are preparing a new dataset for training the de RR Wikidata model.

Updates

  • We have obtained 590 labels from 540 different revisions. Data is available here.
  • This is the confusion matrix:
9228
56364

Given the following scores:

Revert RiskORES
Precision:0.930.91
F1:0.900.91

This implies that our results are very similar to the baseline. We (@Trokhymovych and myself ) are working on an update of our model, in order to improve these numbers. We expect to have a new version of the model at the end of January.

Updates

  • We are collecting preparing a new dataset (using diffs) to train the model.
  • We are experimenting with language models, such as mBert and LaBSE to evaluate structured (claims) edits.
  • Updates **
  • In order to improve the interaction between structured and text data , I'm experimenting with a full pytorch approach.

Updates

  • We have improve the model accuracy, currently I'm working on making the model faster, to be able to work in real time.

Updates
I was working on the experimental model using a multilingual language model.
It was evaluated and compared with the ORES model on the time-based hold-out dataset of the revisions from 2023.

Metrics used: AUC (main metric), Precision at Recall level

  • Initial results for all revisions:

image.png (162×1 px, 41 KB)

  • Initial results per language (AUC):

image.png (1×408 px, 82 KB)

@diego: Hi, the Due Date set for this open task passed a while ago.
Could you please either update or reset the Due Date (by clicking Edit Task), or set the status of this task to resolved in case this task is done? Thanks!

diego changed Due Date from Dec 21 2023, 11:00 PM to Jun 27 2024, 10:00 PM.Apr 17 2024, 3:43 PM
diego updated the task description. (Show Details)

This task has been resolved, please follow the model deployment here: T363718