Page MenuHomePhabricator

Investigation: CopyPatrol in other languages
Closed, ResolvedPublic3 Estimated Story Points

Description

A user from French WP posted on the CopyPatrol talk page, saying that he's interested in adapting the plagiarism detection bot for French.

https://meta.wikimedia.org/wiki/Talk:Community_Tech/Improve_the_plagiarism_detection_bot#Use_EranBot_on_frwiki

What can we do to make EranBot and/or CopyPatrol useable for other languages?
How many languages would Turnitin usefully support?

Event Timeline

From http://www.ithenticate.com/products/faqs:

What languages does iThenticate support?
The iThenticate interface currently supports the following languages: English, Korean, and Japanese.

Which international languages does iThenticate have content for in its database?
iThenticate searches for content matches in the following 30 languages: Chinese (simplified and traditional), Japanese, Thai, Korean, Catalan, Croatian, Czech, Danish, Dutch, Finnish, French, German, Hungarian, Italian, Norwegian (Bokmal, Nynorsk), Polish, Portuguese, Romanian, Serbian, Slovak, Slovenian, Spanish, Swedish, Arabic, Greek, Hebrew, Farsi, Russian, and Turkish. Please note that iThenticate will match text between text of the same language.

DannyH set the point value for this task to 3.Aug 9 2016, 5:28 PM
DannyH moved this task from Needs Discussion to Up Next (June 3-21) on the Community-Tech board.
DannyH raised the priority of this task from Medium to High.Aug 23 2016, 5:16 PM
  • Translations
    • We have ~30 strings which need translations.
    • We can hook up our tool with Intuition (https://github.com/Krinkle/intuition) to enable translations via TranslateWiki. This should be quite straightforward.
  • Backend
    • Waiting to hear back from Eran about enabling Eranbot on French/Italian.
    • With Eranbot enabled on a new wiki, we’ll have a separate database where it logs the plagiarized records for that wiki
    • Wikiprojects:
      • If the project has wikiprojects, we will have to create a script to put together a database for it, unless one already exists. This might be better handled by enabling PageAssessments on that wiki first, which hands us the tables we need, ready to use.
  • Hosting
    • Same tool, different url say tools.wmflabs.org/copypatrol/fr or such. (This will probably need some URL rules rewriting)
    • Separate tools (does not scale)
  • Code changes
    • Abstract the EnwikiDao into a generic Dao
      • This is more or less already done since we pass the $wiki param to the constructor but we still need to change the file name and test it to make sure it works
    • Make sure the User whitelist works on the new wiki (think about having a global User whitelist instead of one for every wiki)
    • Change the centralauth login to go to Meta instead of enwiki (to avoid confusion)
    • Change the date from which records are shown on UI, this is currently hardcoded in Copypatrol
    • Change ORES scores function to work with any given wiki (currently hardcoded for enwiki)
    • Make sure we handle drafts correctly (hardcoded namespace)
    • Replace all hardcoded strings (in UI as well as Controllers) by their translation keys
    • Change the AJAX request to Earwig’s tool (add lang parameter explicitly)
  • UI changes
    • A drop-down to pick which project you want to see records for
    • If the wiki doesn’t have any wikiprojects, we might want to do away with that column entirely, along with the search for it. A better way might be to replace wikiprojects with categories instead in that case. We’ll have to be careful to exclude sub-categories etc. to avoid flooding the UI with a large number of categories. (Or maybe come up with a clever way to display them)
  • Miscellaneous
    • Need to setup a documentation page in the wiki

Tentative next steps:

  1. Pick a wiki where CopyPatrol can work with minimal disruption i.e. supports wikiprojects, is supported by Ithenticate, is open to having such a tool, is open to enabling PageAssessments on.
  2. Enable PageAssessments on the wiki
  3. Integrate CopyPatrol with Intuition
  4. Decide the hosting mechanism
  5. Enable Eranbot to work on the wiki
  6. Decide UI changes
  7. Make required code changes

I heard back from Eran about enabling Eranbot to run on other wikis. Here's what he said:

  1. Copy User:EranBot/Copyright/Blacklist from enwiki to desired wiki
  2. Run: python /data/project/eranbot/gitPlagiabot/plagiabot/plagiabot.py -lang:LANG -blacklist:User:EranBot/Copyright/Blacklist -live:on -reportlogger (2 should be run from a crontab and the server is re lunched every 4 hours) This could be run for a test once, and if it works we can add it to crontab.

While this is sufficient for first trial, some extra may be needed for "production level", by configuration in messages (https://github.com/valhallasw/plagiabot/blob/master/plagiabot.py#L69 ) which can help the bot ignore rollbacked edits.

Thanks you for your investigation.
For frwiki, I have created pages Utilisateur:EranBot and Utilisateur:EranBot/Copyright/Blacklist (same as enwiki for now). Our Wikiproject page. Pull request for dict. Kind regards

Niharika, this is awesome info and good news. :) We can talk next week about next steps.