Page MenuHomePhabricator

explore how CopyPatrol, Eranbot, and Earwig copyvio tool work
Open, Needs TriagePublic

Description

Questions:

  • What databases and APIs are used?
    • EranBot - a bot that checks CopyPatrol/Turnitin/iThenticate, then uses its copyviobot permission to post the results to PageTriage using the pagetriagetagcopyvio API.
    • CopyPatrol - Turnitin/iThenticate - worth noting that this API isn't just submitted schoolwork. it also includes websites and archived websites
    • Earwig's copyvio tool - Google Search API by default, Turnitin/iThenticate if selected
  • Who are the maintainers? Are they active?
    • EranBot - Eran aka ערן. Last edit to enwiki 3 months ago.
    • CopyPatrol - Unclear. MusikAnimal is most active on the GitHub repo. However the repo isn't very active.
    • Earwig's copyvio tool - The Earwig. Active.
  • Do we pay money for any of these APIs? Where does the money come from?
    • CopyPatrol/Turnitin/iThenticate - "credits" are mentioned in the Growth Team's documentation. these credits appear to be cheaper or more plentiful.
    • Earwig's copyvio tool/Google Search API - "credits" are mentioned in the Growth Team's documentation. these credits appear to be more expensive or less plentiful.
  • Is one tool superior to the others? Suggest testing with empirical data. That is, find 20 copyright violations in one tool, then check them in the other tools and see if they catch them.
    • From my personal experience, >15% in Earwig's tool usually indicates a violation. <15% is safe.
    • According to Growth Team's documentation here, both are about equally accurate.
  • Can these tools be run on demand?
    • CopyPatrol
      • No
    • Earwig's copyvio tool
      • Yes, but there's some kind of daily limit
  • How long does each tool take?
    • CopyPatrol
      • 30 minute lag time?
    • Earwig's copyvio tool
      • 30 second lag time?
  • What are the nature of the API limitations for Earwig's tool? I know it turns off for everyone if it is run more than X times a day. Is this because it is freemium and we are hitting the free limit for the day? Is this because we are paying Y amount per day and exhausting that amount?
    • Could we apply for a WMF rapid grant or some other kind of support to raise the API limits for Earwig?
  • Could the Earwig tool be incorporated to run automatically into PageTriage somehow? If so, what is the best technical solution?
    • Novem's idea:
      • page creation hook, run as a background process after the output to the browser is complete, only run on first revision of the article.
      • save this data in SQL pagetriage_page_tags
      • display prominently to the NPP.
        • perhaps toggle one of the buttons on the page curation toolbar green/red
        • perhaps display a large green/red message in the "mark as reviewed" panel of PageTriage. "clean of copyvio" or ">15% copyvio detected" (with a link to run the tool on a fresh revision to investigate further)
  • If the tools are equally good, Eranbot and CopyPatrol is already integrated into PageTriage, opening the possibility of deprecating the use of Earwig's copyvio tool.
  • Properly document all these findings at https://www.mediawiki.org/wiki/Extension:PageTriage/Copyvio_detection or similar location

Feel free to edit this post with the results of our findings, and with links.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Thanks for that link Kosta. This one is buried in there but also looks relevant: https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Articles_for_creation/AfC_Process_Improvement_May_2018/Copyvio_solutions_comparison_report

This also looks interesting. Looks like the last time devs looked at this issue, they envisioned further expansion of the system to use the pagetriagetagcopyvio API.

We are building the underlying architecture here so that other copyvio services could be plugged into it in the future (such as Earwig's / Google). Though using more than one service is out of scope for this project, the technical components will be in place to make it possible at some other point.