Page MenuHomePhabricator

Improve the plagiarism detection bot
Closed, ResolvedPublic

Description

This card tracks a top 10 wish from the Community Wishlist Survey: https://meta.wikimedia.org/wiki/2015_Community_Wishlist_Survey

Original proposal: Currently we have a bot that analysis "all" new edits to en WP for copyright concerns. The output is here: https://en.wikipedia.org/wiki/User:EranBot/Copyright/rc And there is the potential for it to work in a number of other languages. Problems is that it is not up as reliably as it should be. Also presentation of the concerns could be improved. Would love to see the output turned into an extension and formatted similar to the en:wp:Special:NewPagesFeed. Currently the output is sort-able by WikiProject. It would be nice to create WikiProject specific modules to go on individual project pages. -- Doc James (talk · contribs · email) 03:45, 4 November 2015 (UTC)

Community Tech preliminary assessment:

Support: Very high. Lots of support on the proposal, with several people specifically calling out for a tool that can be used on multiple projects and multiple languages.

Impact: Medium to High, depending on what we're able to do. Integrating the human-checked false positive/true positive data into EranBot's existing database and improving the API could be particularly useful for research and machine learning projects, potentially improving the bot’s true positive rate and requiring less human involvement. The ability to adapt this for multiple projects and languages would be especially helpful.

Feasibility: There's an existing tool on English Wikipedia - EranBot, aka Plagiabot, based on the Turnitin database - and Community Tech has done some work to make the results more broadly useful in the last couple months, including displaying the tool's results alongside Copyvios Detector's reports. (There's more details on ticket T110144.) Turning EranBot into an extension would be considerably more difficult than making improvements to the bot.

Risk: Medium, higher for more involved work. We'll need considerable discussion on the scope and definition.

Status: We're confident that we can do some helpful work on this wish this year. We need more investigation and discussion to figure out a clear scope of work. We'll be able to focus on it more in a few months.

Project page: https://meta.wikimedia.org/wiki/Community_Tech/Improve_the_plagiarism_detection_bot

Related Objects

StatusSubtypeAssignedTask
Resolved TBolliger
ResolvedNiharika
Resolved TBolliger
ResolvedNiharika
ResolvedNiharika
OpenNone
ResolvedMusikAnimal
ResolvedMusikAnimal
ResolvedNiharika
DuplicateNone
ResolvedNiharika
ResolvedNiharika
ResolvedEarwig
ResolvedMusikAnimal
ResolvedNiharika
ResolvedNiharika
ResolvedNiharika
ResolvedNiharika
ResolvedNiharika
OpenNone
OpenNone
ResolvedMusikAnimal
DuplicateNone
ResolvedMusikAnimal
ResolvedMusikAnimal
Resolvedkaldari
DuplicateNone
Resolved DannyH
ResolvedNiharika
Resolved DannyH
ResolvedMusikAnimal
ResolvedMusikAnimal
ResolvedMusikAnimal
ResolvedMusikAnimal
DeclinedNone
DuplicateNone
ResolvedMusikAnimal
ResolvedMusikAnimal
DeclinedNone
DeclinedNone
ResolvedNiharika
ResolvedMusikAnimal
ResolvedNiharika
ResolvedNiharika
ResolvedNiharika
ResolvedMusikAnimal
Resolvedkaldari
DeclinedNone
ResolvedMusikAnimal
ResolvedMusikAnimal
ResolvedMusikAnimal
ResolvedNiharika
DuplicateNone
ResolvedMusikAnimal
ResolvedMusikAnimal
ResolvedMusikAnimal
DuplicateMusikAnimal
ResolvedMusikAnimal
ResolvedMusikAnimal
Resolved DannyH
ResolvedSamwilson
DeclinedNone
ResolvedMusikAnimal
OpenNone

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

We talked back at Wikimania about the possibility of moving the interface for this tool from on-wiki to Tool Labs (and also giving it a database back-end in the process). This would remove the need for users to install custom JavaScript and would also make it easier to build a UI for doing things like filtering and searching. @eranroz, @Doc_James, @Ragesoss: Does that still sound like a good idea? Are there any issues that would make that unfeasible?

@kaldari There is already a database backend, with an API for fetching data (although it needs some improvements). The API does not include a write component for recording which hits are false positive, which are true positives, and which are still not checked. Here's the basic API: http://tools.wmflabs.org/eranbot/plagiabot/api.py?action=suspected_diffs

Some specific suggestions for improving the API:

Agree we need to figure out how to get data from human follow up entered on Wikipedia into the database.

When create project specific boxes of concern, many edits of concern will be within a number of WikiProjects. If one WikiProject deals with it we need that info to feedback to the database and than out to all the project boxes.

@Ragesoss: Right you are. My memories from Wikimania are a bit fuzzy now :)

Earn added that database after Wikimania.

Any thoughts on moving the interface to Tool Labs? In theory, we could implement it as a special page, but it would be a bigger project and also a bit awkward as MediaWiki software is generally supposed to be service agnostic (at least for services that aren't part of MediaWiki itself), but this would depend on a specific API service running on Tool Labs via ithenticate/turnitin. Personally, I think it would make the most sense to host the entire thing on Tool Labs, but perhaps also have a simple report bot that posts small reports for WikiProjects which link back to the main interface on Tool Labs (possibly pre-filtered for the WikiProject).

I was hoping to have the interfaces built into each Wikiproject. We want to have it in the places people most often go. We still need to bring on board more volunteers to follow up.

How https://en.wikipedia.org/wiki/Special:NewPagesFeed works is excellent IMO.

Re: the API:

In Python, you can use ast.literal_eval to turn it back into a dictionary. Still not a great solution--I'd love to have a proper JSON API, and preferably to store the report data in a more structured form so you don't need to parse wikitext to get the values out.

@Niharikakohli and I are discussing this at the #DevSummit16. Questions:

  • Who has the most expertise and who should Community Tech talk with?
  • What are the desired outcomes? Currently noted:
    • A way to get the false/true positive checks into the DB
    • A way to get turnitin to check a revision on demand
    • UI/UX improvements (extensionise?)
    • WikiProject integration (how? A fairly enwiki-specific usecase--how broadly requested is this)
    • API improvements
  • What is the minimum viable product for this? What minimal changes will improve the flow/experience?
IMPORTANT: If you are a community developer interested in working on this task: The Wikimedia Hackathon 2016 (Jerusalem, March 31 - April 3) focuses on #Community-Wishlist-Survey projects. There is some budget for sponsoring volunteer developers. THE DEADLINE TO REQUEST TRAVEL SPONSORSHIP IS TODAY, JANUARY 21. Exceptions can be made for developers focusing on Community Wishlist projects until the end of Sunday 24, but not beyond. If you or someone you know is interested, please REGISTER NOW.

I'm going to participate in Wikimedia Hackathon 2016 and would like to work on the copy&paste detection bot.

  1. If there are other developers who are interested in working on this please let me know, so we can plan and what sub-tasks to work here.
  2. As the main developer of the current bot, I will be happy to give intro to the current implementation for anyone who is interested.

@eranroz Excellent, @Fhocutt and @NiharikaKohli will be at the Hackathon too. We should talk before then...

Re the query from @Fhocutt on WikiProject integration, it comes down to the question of alerting the relevant concerned humans to an edit that needs human attention. The viable options may include:

  1. a bot remark (with difflinks) on the affected article's talkpage, perhaps with an associated hidden maintenance category
  2. a structured bot-post to a centralized noticeboard (similar to WP:ANI) served by a trusted team, substantially larger than the current team
  3. a structured bot-post to a WikiProject's-central noticeboard, such as WT:WikiProject Medicine, where editors have somewhat more interest in the affected article, as cued by the article talkpage header {{WikiProject Medicine}}

So one hope, now that we have the data in a database, is that we can have boxes that go on each WikiProject of copyright concerns that relate to that WikiProject.

People can than click true or false in those boxes and the data will automatically feed back to the database and then feed out to all the WikiProjects boxes that contain that entry as a single article can be part of many projects.

By the way we have a talk accepted for Wikimania on "Detecting Copyright Concerns in Near Real Time" :-)

By the way we have a talk accepted for Wikimania on "Detecting Copyright Concerns in Near Real Time" :-)

By the way we have a talk accepted for Wikimania on "Detecting Copyright Concerns in Near Real Time" :-)

Awesome!

@eranroz, I'd be happy to work with you on this task at the Hackathon if you're still looking for collaborators. :)

@Niharika , yes I would like to. See you in the hackthon.

@Niharika, @eranroz: If you decide to take the MediaWiki extension route, you should take a look at https://en.wikipedia.org/wiki/Special:NewPagesFeed. This page is created by the PageTriage extension. Specifically, you'll want to look at:
PageTriage/SpecialNewPagesFeed.php
PageTriage/modules/ext.pageTriage.views.list/*
PageTriage/api/ApiPageTriageList.php
Note that it uses Backbone and Underscore extensively on the client-side. If you don't want to use those, you could use Mustache for the templating part (MediaWiki core has built in support for Mustache: https://www.mediawiki.org/wiki/Manual:HTML_templates).

@kaldari, from my chat with Eran, he doesn't think it is a good idea to have a Special page to specifically provide support to a non-official/external bot. I agree with him because labs is not 100% reliable. He suggests developing a Tool Labs interface for the bot. This should be not very difficult to achieve given that there's a database that his tool maintains with information for every copyright violation the bot found and logged on https://en.wikipedia.org/wiki/User:EranBot/Copyright/rc. Then we can add filters for Wikiprojects and date-time range searches etc.

Thoughts?

@Niharika: Sounds reasonable to me. If you want the interface to use infinite scrolling, you may want to use the jQuery waypoints (http://imakewebthings.com/waypoints/), BTW. I would be happy to help if you guys have a repo set up for it.

In the Hackathon notes I see you are considering integrating ORES into the system. I would love to hear @Halfak's thoughts on this. Do you think it would be possible to train ORES to help detect plaigarism?

DannyH renamed this task from Improve the "copy and paste detection" bot to Improve the plagiarism detection bot.Apr 9 2016, 12:02 AM

@Shizhao It's only working on English Wikipedia now... Once we get it working on Tool Labs for English WP, we'll see what needs to get done to make it available to other languages.

Updates: We've started putting together a prototype tool here: http://tools.wmflabs.org/plagiabot/

And here are some new draft wireframes. I'd love to know what you all think -- these are still developing.

Item structure:

copy patrol 1 - item structure.jpg (373×1 px, 89 KB)

Opening the compare pane:

copy patrol 2 - compare pane opens.jpg (698×1 px, 111 KB)

Clicking on a review setting:

copy patrol 3 - review selected.jpg (369×1 px, 91 KB)

Clicking on the review again, to change your mind:

copy patrol 4 - change review.jpg (369×1 px, 91 KB)

Filters, with autocomplete for WikiProjects:

copy patrol filters 1 - autocomplete.jpg (359×1 px, 44 KB)

Filters, with WikiProjects selected:

copy patrol filters 2 - wikiprojects selected.jpg (199×1 px, 31 KB)

Also, so no one interest misses it: changes on http://tools.wmflabs.org/plagiabot/ are not being logged right now, so don't be afraid to play around.

The bot provides "hints" on the source (see in the /rc page) - as part of the post processing the bot visits the possible source and checks if the source cites Wikipedia (e.g "Mirror"), or have a clearly defined CC copyright. I think the current web based tool doesn't show it yet (I couldn't at least find it in "Meena", revision time: 2016-05-13T16:14:17Z), so this should be considered.

I really like the design (or the last updates in the design :) ) of plagiabot tool and in particular the "Edit count" near the editor.

Will try to have a look over the weekend.

Besides the obvious need to keep problematic edits out of articles, this tool is also an opportunity to educate editors who make these mistakes. Many problematic edits are done in good faith. Somewhere this tool should mention the need to put notes on user talk pages. Maybe an automatic link to the user's page with a couple templates (warnings) to choose from.

[...] Somewhere this tool should mention the need to put notes on user talk pages. Maybe an automatic link to the user's page with a couple templates (warnings) to choose from.

In my opinion, the tool should allow the operator to perform all operations without leaving the page. So in addition to a button to enter a message in the user's talk page (like Lucas559 in T120435#2293881), also a link to undo or rollback the edits are useful.
Even the information that the editor has already had other reports or blocks is very important.

I like the layout of page, is clear and simple; good!

The wire frames above look good. Currently missing from the web version is the percentage of plagiarism detected and the word count for the edit. These would definitely be helpful on smaller wikis where there are fewer users to do plagiarism checks.

As you develop this tool, please consider how it can be used on multiple projects, not just Wikipedia. This would be very useful on en.wikiversity, for example.

The bot provides "hints" on the source (see in the /rc page) - as part of the post processing the bot visits the possible source and checks if the source cites Wikipedia (e.g "Mirror"), or have a clearly defined CC copyright. I think the current web based tool doesn't show it yet (I couldn't at least find it in "Meena", revision time: 2016-05-13T16:14:17Z), so this should be considered.

I really like the design (or the last updates in the design :) ) of plagiabot tool and in particular the "Edit count" near the editor.

Ah, there's where the hints come from! Do you think it makes more sense to store that in the database instead of both the tools checking the page individually?

Besides the obvious need to keep problematic edits out of articles, this tool is also an opportunity to educate editors who make these mistakes. Many problematic edits are done in good faith. Somewhere this tool should mention the need to put notes on user talk pages. Maybe an automatic link to the user's page with a couple templates (warnings) to choose from.

That's a good idea, Lucas. I'll create a ticket for it.

also a link to undo or rollback the edits are useful.

This has come up a few times now and I must raise my concern. In a lot of cases that I went through, the edit was either 1) already reverted or 2) there were more edits on top of it.
Unless you look at the edit immediately after it comes in, such a rollback button is not very useful.
What we can do is: have a rollback button in cases where it is possible to rollback and when you click the rollback button, we perform a check to see if it is still possible to rollback the edit (that is, nobody edited the page in the meantime) and if so, we roll it back. Sounds useful? More ideas welcome!

Even the information that the editor has already had other reports or blocks is very important.

It should be simple to keep a count of how many times an editor's edit has been marked as a Copyvio in the tool. What kind of other blocks do you have in mind?

I like the layout of page, is clear and simple; good!

Thanks!

In T120435#2294641, @Niharika wrote:
This has come up a few times now and I must raise my concern. In a lot of cases that I went through, the edit was either 1) already reverted or 2) there were more edits on top of it.
Unless you look at the edit immediately after it comes in, such a rollback button is not very useful.
What we can do is: have a rollback button in cases where it is possible to rollback and when you click the rollback button, we perform a check to see if it is still possible to rollback the edit (that is, nobody edited the page in the meantime) and if so, we roll it back. Sounds useful? More ideas welcome!

You're right, this scenario is very common. So maybe these links can be useful to add near 'Diff':

  • 'Diff to the current' (&diff=cur&oldid=nnnnnnn) for check if the edit is already reverted
  • 'hist' (&action=history) for check if there are other edits

What kind of other blocks do you have in mind?

Local or global wiki blocks. I know that a block is not only for copyrights violations, but it's a good indicator if the user is known, and for a good o bad faith.

/w/api.php?action=query&format=json&list=blocks&bkstart=(check_date_time)&bkusers=+Foo&meta=globaluserinfo&guiuser=Foo

Blocking a user direct from the tool is bound to backfire on w:en and likely elsewhere. At most the tool should create a noticeboard post to draw local admin attention to the offending editor's repeated actions. The bulk of Plagiabot reporty gnoming should not require admin priviledges, but blocking users obviously would. A standardized action request to (e.g.) [[w:en:wp:AIN]] can bridge the gap.

Blocking a user direct from the tool is bound to backfire on w:en and likely elsewhere. At most the tool should create a noticeboard post to draw local admin attention to the offending editor's repeated actions. The bulk of Plagiabot reporty gnoming should not require admin priviledges, but blocking users obviously would. A standardized action request to (e.g.) [[w:en:wp:AIN]] can bridge the gap.

Hi, I think you misunderstood the comment. We're not going to provide a way to block someone from the tool but instead we can provide them the information that the user has had X bans on Y wiki etc.

Everyone, https://tools.wmflabs.org/copypatrol/ is ready for you to look at and use. We're still doing some feature development but all of the basic things are in place. You can start reviewing edits after you log in. Keen to hear your feedback!

For now it is only usable by English wikipedia; it's right? In the future, it will also be activated for other wikis?
It also expected to translate the interface?

For now it is only usable by English wikipedia; it's right? In the future, it will also be activated for other wikis?
It also expected to translate the interface?

We use Ithenticate to detect plagiarism on english wikipedia. It's database is rich in English content but not so much in other languages. Eran tested running the tool with Hebrew wikipedia but the results were not successful.
In future, if this tool proves to be useful, we would like to try using the Google API service to see if it works better for other languages.

They have a fair bit of coverage in other languages. We could definitely
look at trying to get this running in other languages. Coverage was week in
Hebrew but I am sure good in French, German, and Spanish.

J

One strange question: Where is source code for CopyPatrol tool to contribute to? There is no link in tool page and no link in the discussion page nor here

@Ladsgroup currently at https://github.com/Niharika29/PlagiabotWeb. Contributions welcome :) I'll look into adding a "View source" link to the interface.