Page MenuHomePhabricator

[AOI] Investigation: Can we improve Copyvio Detector?
Closed, ResolvedPublic

Description

Per http://www.allourideas.org/wikimediagadgets/results and http://www.allourideas.org/wikimediaaccesorios/results?locale=es.

Please answer the following questions:

  • Are there high priority bugs or features that the Community Tech team could address in a short period of time?
  • If so, is the maintainer amendable to working with us and is the code publicly available?
  • Would this be a good tool to convert into a MediaWiki extension or add as functionality to an existing extension?

Event Timeline

kaldari raised the priority of this task from to Needs Triage.
kaldari updated the task description. (Show Details)
kaldari added a project: Community-Tech.
kaldari subscribed.

I was initially thinking that the request was related to plagiabot/EranBot, but apparently this is a totally different tool.

kaldari set Security to None.
kaldari moved this task from New & TBD Tickets to Ready on the Community-Tech board.
kaldari renamed this task from [AOI] Spike: Improve Copyvio Detector to [AOI] Spike: Can we improve Copyvio Detector?.Aug 8 2015, 1:19 AM
kaldari renamed this task from [AOI] Spike: Can we improve Copyvio Detector? to [AOI] Investigation: Can we improve Copyvio Detector?.Aug 12 2015, 1:28 AM

Judging from the talk page, there's no obvious starting point--not much in the issue tracker, no real functionality requests on the talk page, bugs fixed quickly.

Left a comment on @Earwig's talk page asking if they're interested in working with us and if there are any dev plans: https://en.wikipedia.org/wiki/User_talk:The_Earwig#Interested_in_working_with_Community_Tech.3F

One idea might be to see if there's interest in adding Plagiabot's Turnitin-based detection to the Tool Labs copyvios app, which only searches through the Yahoo! search API.

Other plagiarism tools: Duplication Detector is down, EranBot/Plagiabot checks in student papers/textbooks/journals, MadmanBot/CorenSearchBot appears to also check via Yahoo! but it's not clear what the scope is. Copyvios appears to be the only stand-alone tool.

Hi, everyone!

There is only one outstanding bug with the tool that comes to mind. I have a memory leak that I've been unable to get to the bottom of for about a year now. It happens so slowly and unpredictably that progress on it is difficult, especially given the lack of urgency and questions about why Python's internal memory management isn't working. I could probably fix it if I devoted enough time to extra debugging.

Regarding feature requests, I'm not aware of any aside from the Turnitin feature you mentioned. Perhaps we can work on that, but I don't know enough about Turnitin to say if it'd be useful.

The exclusions list needs frequent updating, more frequent than I am able to do while relying on individual reports from users. Other bots and tools have similar lists (EranBot's and Wikipedia:Mirrors and forks). I wonder if some centralized, perhaps cross-wiki, list of mirrors and public domain websites would be useful to have.

As a MediaWiki extension? Perhaps, but I don't think it's worth the effort. The tool has an API, so if some alternative front-end is desired, that can be worked on.

Another thought: the tool has no l10n support, so we can work on that if people think it's important. The interface is relatively simple, so I don't expect translation to be too difficult.

Thanks, @Earwig.

On exclusions, that seems like something that would be useful, but there would need to be different sub-lists for different uses--for instance, https://meta.wikimedia.org/wiki/Mirror_filter lists sites that are strictly Wikipedia mirrors, but my impression is that Copyvio Detector and Plagiabot also want to exclude sites that are partly mirrors but have some original content as well. I agree that it seems like a centralizable list, though.

Adding Turnitin would probably make it slower, but would be a broader base to search in. I don't know how compatible the API is, though.

Potential tasks for Copyvio Detector:

  • Find and fix memory leak. Probably not high enough impact.
  • Add l10n/i18n support. Question: how effective is the tool for non-English content?
  • Integrate Turnitin, as used in Plagiabot.

A central mirror/public domain site list could be useful but it would need to be researched and scoped, and would depend on having volunteers interested in maintaining it. Likely out of scope for Community Tech.

Regarding l10n, the tool works fine for non-English content from a technical perspective (logs show many successful requests involving Korean etc wikis; people have added German and Russian mirrors...).

There might be some work to do on the matching engine as well. Check out:
https://tools.wmflabs.org/copyvios/?lang=en&project=wikipedia&title=Mary+Wollstonecraft&oldid=&action=search&use_engine=1&use_links=1

It says that it has 88% confidence that this is a copyright violation, but most of the matches are either quotations or long book titles and it doesn't look like there's any actual plagiarism.

@Earwig: How hard would it be to exclude quotations from the matching engine? Is the engine custom-written or some 3rd party library?

It is custom-written. You are right that the particular result there is poor; my first thought is to work on the confidence algorithm a bit to value large contiguous blocks more than lots of disjoint trigrams. For quotes, I'm not so sure; if that issue was fixed I think it might not be so important. I can look into that.

@Earwig: Another useful thing to build for Copyvio Detector would be a testcase suite. These could be subpages under https://en.wikipedia.org/wiki/User:EarwigBot/Copyvios/ that are designed for benchmarking Copyvio Detector's confidence accuracy. That way, we could better measure whether any tweaks to the confidence algorithm are going in the right direction. (To avoid actually violating copyright, the "violating" pages could be based on old public domain sources.)

Yes, this is a good idea. I already use https://en.wikipedia.org/wiki/User:The_Earwig/Sandbox/CopyvioExample and https://en.wikipedia.org/wiki/User:The_Earwig/Sandbox/CopyvioPDFExample as basic sanity checks, but a more comprehensive suite would be much better.

Regarding l10n, the tool works fine for non-English content from a technical perspective (logs show many successful requests involving Korean etc wikis; people have added German and Russian mirrors...).

Korean Wikipedia has a a gadget to query copyvios with a button so you should expect much query from Koreans.

Results of this investigation were: T110144, T110743, T110124, and T110778.