Page MenuHomePhabricator

Investigation: Improve copy and paste detection bot
Closed, ResolvedPublic5 Estimated Story Points

Description

Investigation card for a Top 10 wish.

Current bot is by Eran, output: https://en.wikipedia.org/wiki/User:EranBot/Copyright/rc

There's good discussion in T120435: Improve the plagiarism detection bot

People to talk to: Eran, Yuvi, Sage.

Some questions to answer:

  • This tool needs a better interface. What would work better: an interface on Tool Labs or on a special page (via an extension)?
  • What issues are involved in making the interface work for multiple wikis in multiple languages?

Event Timeline

DannyH raised the priority of this task from to Medium.
DannyH updated the task description. (Show Details)
DannyH moved this task to Needs Discussion on the Community-Tech board.
DannyH subscribed.
DannyH set the point value for this task to 5.

Some questions to answer:

  • This tool needs a better interface. What would work better: an interface on Tool Labs or on a special page (via an extension)?

After some discussions with Eran, at the Wikimedia Hackathon, we've decided to go with a Tool Labs interface for the bot.
Pros:

  • Easier to make an interesting/engaging interface
  • Gives us the freedom to use more libraries and go through lesser intense code-review processes
  • Easy to setup and launch, compared to an extension
  • Allows us to experiment with the interface/possible gamification

Cons:

  • Does not give us access to Echo notification system
  • Might lead to lesser usage of the tool if it's outside the wiki
Why not a special page?
  • Eran pointed out that having a dedicated special page solely dependent on an external bot is not a good idea. It'll not be very useful to a lot of wikis and might be problematic when the bot goes down (which can happen, given tool labs is not 100% reliable).
  • What issues are involved in making the interface work for multiple wikis in multiple languages?

The most important issue here is Turnitin. Turnitin is the service used by the bot to scan for suspected plagiarized edits. But it does not have a good enough database for other languages. Eran tried running the bot in Hebrew wiki with not very successful results.
The obvious solution here would be to switch to using a better service but unfortunately almost all such services are paid.

Aaron and Eran discussed the idea of using ORES to detect if a given edit could possibly be plagiarized and if that's the case, we send it to the external service to confirm our suspicion. This would be great for reducing the number of times we hit the external service. Once it's trained over Plagiabot's dataset, we can think about expanding it to other bots.

Some open questions:
  • Is it possible that Copyvios, Plagiabot and CorenSearchBot (along with the German bot) are making similar or identical requests to the plagiarism detection service? Is there a scope for optimization here?
  • Should we be making an attempt at integrating/reducing redundancy between the different bots as part of this project?
kaldari subscribed.

In reply to your open questions, it seems that the best way that we could remove redundancy would be to create a centralized API (similar to what plagiabot already offers), but that offered more semantic data (rather than HTML), and was integrated into both Turnitin and a general search engine such as Bing.