Page MenuHomePhabricator

Investigation: Improve copy and paste detection bot
Closed, ResolvedPublic5 Story Points

Description

Investigation card for a Top 10 wish.

Current bot is by Eran, output: https://en.wikipedia.org/wiki/User:EranBot/Copyright/rc

There's good discussion in T120435: Improve the plagiarism detection bot

People to talk to: Eran, Yuvi, Sage.

Some questions to answer:

  • This tool needs a better interface. What would work better: an interface on Tool Labs or on a special page (via an extension)?
  • What issues are involved in making the interface work for multiple wikis in multiple languages?

Event Timeline

DannyH created this task.Dec 18 2015, 6:33 PM
DannyH raised the priority of this task from to Medium.
DannyH updated the task description. (Show Details)
DannyH moved this task to To be estimated/discussed on the Community-Tech board.
DannyH added a subscriber: DannyH.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 18 2015, 6:33 PM
kaldari updated the task description. (Show Details)Mar 22 2016, 5:54 PM
DannyH set the point value for this task to 5.
Niharika claimed this task.Mar 24 2016, 9:02 AM
Niharika moved this task from Ready to In Development on the Community-Tech-Sprint board.

Some questions to answer:

  • This tool needs a better interface. What would work better: an interface on Tool Labs or on a special page (via an extension)?

After some discussions with Eran, at the Wikimedia Hackathon, we've decided to go with a Tool Labs interface for the bot.
Pros:

  • Easier to make an interesting/engaging interface
  • Gives us the freedom to use more libraries and go through lesser intense code-review processes
  • Easy to setup and launch, compared to an extension
  • Allows us to experiment with the interface/possible gamification

Cons:

  • Does not give us access to Echo notification system
  • Might lead to lesser usage of the tool if it's outside the wiki
Why not a special page?
  • Eran pointed out that having a dedicated special page solely dependent on an external bot is not a good idea. It'll not be very useful to a lot of wikis and might be problematic when the bot goes down (which can happen, given tool labs is not 100% reliable).
  • What issues are involved in making the interface work for multiple wikis in multiple languages?

The most important issue here is Turnitin. Turnitin is the service used by the bot to scan for suspected plagiarized edits. But it does not have a good enough database for other languages. Eran tried running the bot in Hebrew wiki with not very successful results.
The obvious solution here would be to switch to using a better service but unfortunately almost all such services are paid.

Aaron and Eran discussed the idea of using ORES to detect if a given edit could possibly be plagiarized and if that's the case, we send it to the external service to confirm our suspicion. This would be great for reducing the number of times we hit the external service. Once it's trained over Plagiabot's dataset, we can think about expanding it to other bots.

Some open questions:
  • Is it possible that Copyvios, Plagiabot and CorenSearchBot (along with the German bot) are making similar or identical requests to the plagiarism detection service? Is there a scope for optimization here?
  • Should we be making an attempt at integrating/reducing redundancy between the different bots as part of this project?
kaldari closed this task as Resolved.Apr 8 2016, 5:02 PM
kaldari added a subscriber: kaldari.

In reply to your open questions, it seems that the best way that we could remove redundancy would be to create a centralized API (similar to what plagiabot already offers), but that offered more semantic data (rather than HTML), and was integrated into both Turnitin and a general search engine such as Bing.

MusikAnimal moved this task from Backlog to Done on the CopyPatrol board.Dec 6 2016, 5:25 AM