[AOI] Investigation: Can we improve Copyvio Detector?
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	kaldari
	Aug 8 2015, 1:13 AM

Description

Per http://www.allourideas.org/wikimediagadgets/results and http://www.allourideas.org/wikimediaaccesorios/results?locale=es.

Please answer the following questions:

Are there high priority bugs or features that the Community Tech team could address in a short period of time?
If so, is the maintainer amendable to working with us and is the code publicly available?
Would this be a good tool to convert into a MediaWiki extension or add as functionality to an existing extension?

Related Objects

Mentioned In: T110778: [AOI] Create a test suite for Copyvio Detector
T110124: Add i18n support to Copyvio Detector [AOI]
Mentioned Here: T110124: Add i18n support to Copyvio Detector [AOI]
T110144: Integrate Turnitin (as used in Plagiabot) into Copyvio Detector tool [AOI]
T110743: Make plagiabot API output report id, matching urls, % match, and number of words as separate data
T110778: [AOI] Create a test suite for Copyvio Detector

Event Timeline

kaldari created this task.Aug 8 2015, 1:13 AM

kaldari raised the priority of this task from to Needs Triage.

kaldari updated the task description. (Show Details)

kaldari added a project: Community-Tech.

kaldari subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 8 2015, 1:13 AM

I was initially thinking that the request was related to plagiabot/EranBot, but apparently this is a totally different tool.

kaldari updated the task description. (Show Details)Aug 8 2015, 1:16 AM

kaldari set Security to None.

kaldari moved this task from New & TBD Tickets to Ready on the Community-Tech board.

kaldari renamed this task from [AOI] Spike: Improve Copyvio Detector to [AOI] Spike: Can we improve Copyvio Detector?.Aug 8 2015, 1:19 AM

kaldari updated the task description. (Show Details)Aug 10 2015, 9:31 PM

Ricordisamoa subscribed.Aug 10 2015, 10:01 PM

kaldari renamed this task from [AOI] Spike: Can we improve Copyvio Detector? to [AOI] Investigation: Can we improve Copyvio Detector?.Aug 12 2015, 1:28 AM

kaldari updated the task description. (Show Details)Aug 12 2015, 10:04 PM

• Fhocutt claimed this task.Aug 14 2015, 2:00 AM

• Fhocutt updated the task description. (Show Details)

• Fhocutt moved this task from Ready to In Dev/Progress on the Community-Tech board.Aug 14 2015, 4:51 PM

• Fhocutt added a subscriber: Earwig.Aug 14 2015, 11:01 PM

Judging from the talk page, there's no obvious starting point--not much in the issue tracker, no real functionality requests on the talk page, bugs fixed quickly.

Left a comment on @Earwig's talk page asking if they're interested in working with us and if there are any dev plans: https://en.wikipedia.org/wiki/User_talk:The_Earwig#Interested_in_working_with_Community_Tech.3F

One idea might be to see if there's interest in adding Plagiabot's Turnitin-based detection to the Tool Labs copyvios app, which only searches through the Yahoo! search API.

Other plagiarism tools: Duplication Detector is down, EranBot/Plagiabot checks in student papers/textbooks/journals, MadmanBot/CorenSearchBot appears to also check via Yahoo! but it's not clear what the scope is. Copyvios appears to be the only stand-alone tool.

• Fhocutt moved this task from In Dev/Progress to Needs Review/Feedback on the Community-Tech board.Aug 14 2015, 11:21 PM

Hi, everyone!

There is only one outstanding bug with the tool that comes to mind. I have a memory leak that I've been unable to get to the bottom of for about a year now. It happens so slowly and unpredictably that progress on it is difficult, especially given the lack of urgency and questions about why Python's internal memory management isn't working. I could probably fix it if I devoted enough time to extra debugging.

Regarding feature requests, I'm not aware of any aside from the Turnitin feature you mentioned. Perhaps we can work on that, but I don't know enough about Turnitin to say if it'd be useful.

The exclusions list needs frequent updating, more frequent than I am able to do while relying on individual reports from users. Other bots and tools have similar lists (EranBot's and Wikipedia:Mirrors and forks). I wonder if some centralized, perhaps cross-wiki, list of mirrors and public domain websites would be useful to have.

As a MediaWiki extension? Perhaps, but I don't think it's worth the effort. The tool has an API, so if some alternative front-end is desired, that can be worked on.

Another thought: the tool has no l10n support, so we can work on that if people think it's important. The interface is relatively simple, so I don't expect translation to be too difficult.

Thanks, @Earwig.

On exclusions, that seems like something that would be useful, but there would need to be different sub-lists for different uses--for instance, https://meta.wikimedia.org/wiki/Mirror_filter lists sites that are strictly Wikipedia mirrors, but my impression is that Copyvio Detector and Plagiabot also want to exclude sites that are partly mirrors but have some original content as well. I agree that it seems like a centralizable list, though.

Adding Turnitin would probably make it slower, but would be a broader base to search in. I don't know how compatible the API is, though.

Potential tasks for Copyvio Detector:

Find and fix memory leak. Probably not high enough impact.
Add l10n/i18n support. Question: how effective is the tool for non-English content?
Integrate Turnitin, as used in Plagiabot.

A central mirror/public domain site list could be useful but it would need to be researched and scoped, and would depend on having volunteers interested in maintaining it. Likely out of scope for Community Tech.

• Fhocutt moved this task from Needs Review/Feedback to Done on the Community-Tech board.Aug 19 2015, 1:56 AM

Regarding l10n, the tool works fine for non-English content from a technical perspective (logs show many successful requests involving Korean etc wikis; people have added German and Russian mirrors...).

There might be some work to do on the matching engine as well. Check out:
https://tools.wmflabs.org/copyvios/?lang=en&project=wikipedia&title=Mary+Wollstonecraft&oldid=&action=search&use_engine=1&use_links=1

It says that it has 88% confidence that this is a copyright violation, but most of the matches are either quotations or long book titles and it doesn't look like there's any actual plagiarism.

@Earwig: How hard would it be to exclude quotations from the matching engine? Is the engine custom-written or some 3rd party library?

It is custom-written. You are right that the particular result there is poor; my first thought is to work on the confidence algorithm a bit to value large contiguous blocks more than lots of disjoint trigrams. For quotes, I'm not so sure; if that issue was fixed I think it might not be so important. I can look into that.

kaldari mentioned this in T110124: Add i18n support to Copyvio Detector [AOI].Aug 24 2015, 10:47 PM

@Earwig: Another useful thing to build for Copyvio Detector would be a testcase suite. These could be subpages under https://en.wikipedia.org/wiki/User:EarwigBot/Copyvios/ that are designed for benchmarking Copyvio Detector's confidence accuracy. That way, we could better measure whether any tweaks to the confidence algorithm are going in the right direction. (To avoid actually violating copyright, the "violating" pages could be based on old public domain sources.)

Yes, this is a good idea. I already use https://en.wikipedia.org/wiki/User:The_Earwig/Sandbox/CopyvioExample and https://en.wikipedia.org/wiki/User:The_Earwig/Sandbox/CopyvioPDFExample as basic sanity checks, but a more comprehensive suite would be much better.

kaldari mentioned this in T110778: [AOI] Create a test suite for Copyvio Detector.Aug 29 2015, 1:19 AM

In T108422#1552600, @Earwig wrote:

Regarding l10n, the tool works fine for non-English content from a technical perspective (logs show many successful requests involving Korean etc wikis; people have added German and Russian mirrors...).

Korean Wikipedia has a a gadget to query copyvios with a button so you should expect much query from Koreans.

Results of this investigation were: T110144, T110743, T110124, and T110778.

[AOI] Investigation: Can we improve Copyvio Detector?Closed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Potential tasks for Copyvio Detector:

[AOI] Investigation: Can we improve Copyvio Detector?
Closed, ResolvedPublic
Actions