[AOI] Create a test suite for Copyvio Detector
Closed, ResolvedPublic2 Estimated Story Points
Actions

Assigned To

Authored By

	kaldari
	Aug 29 2015, 1:19 AM

Description

Per T108422, it would be nice to have a suite of pages that could be used to consistently test Copyvio Detector's violation confidence algorithm. These could be set up as subpages under https://en.wikipedia.org/wiki/User:EarwigBot/Copyvios/.

The test suite should include:

A page that consists almost entirely of plagiarism, like https://en.wikipedia.org/wiki/User:The_Earwig/Sandbox/CopyvioExample
A page that includes a single paragraph of plagiarism, but the rest is unplagiarized
A page that includes numerous plagiarized sentences mixed with unplagiarized text
A page that is closely paraphrased from another source (a few words or phrases changed in each sentence)
A couple pages that are not plagiarized at all, like https://en.wikipedia.org/wiki/Mary_Wollstonecraft

For pages that actually contain plagiarism, make sure they are plagiarized from public domain sources that are not list at https://en.wikipedia.org/wiki/User:EarwigBot/Copyvios/Exclusions.

For pages based on existing Wikipedia articles, make sure that you attribute the articles to their original URL in the edit summary so that CC attribution isn't violated.

Related Objects

Mentioned In: T108422: [AOI] Investigation: Can we improve Copyvio Detector?
Mentioned Here: T110144: Integrate Turnitin (as used in Plagiabot) into Copyvio Detector tool [AOI]
T108422: [AOI] Investigation: Can we improve Copyvio Detector?

Event Timeline

kaldari created this task.Aug 29 2015, 1:19 AM

kaldari raised the priority of this task from to Low.

kaldari updated the task description. (Show Details)

kaldari added a project: Community-Tech.

kaldari subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 29 2015, 1:19 AM

kaldari moved this task from New & TBD Tickets to Blocked on the Community-Tech board.Aug 29 2015, 1:19 AM

kaldari mentioned this in T108422: [AOI] Investigation: Can we improve Copyvio Detector?.Sep 9 2015, 8:27 PM

kaldari moved this task from Blocked to Older: Team Work on the Community-Tech board.Sep 11 2015, 5:22 PM

kaldari updated the task description. (Show Details)Sep 11 2015, 6:00 PM

kaldari set Security to None.

kaldari moved this task from Older: Team Work to Ready on the Community-Tech board.

kaldari added a project: Community-Tech-Sprint.Sep 11 2015, 7:21 PM

kaldari moved this task from Ready to Needs Discussion on the Community-Tech board.

kaldari updated the task description. (Show Details)Sep 12 2015, 4:49 PM

kaldari updated the task description. (Show Details)

kaldari updated the task description. (Show Details)Sep 14 2015, 6:21 PM

kaldari edited a custom field.

Niharika claimed this task.Sep 18 2015, 4:53 PM

Niharika moved this task from Ready to In Development on the Community-Tech-Sprint board.

Niharika moved this task from In Development to Needs Review/Feedback on the Community-Tech-Sprint board.Sep 21 2015, 5:01 PM

I will likely work on this on my own over the next couple of weeks. It'll be useful for other improvements that I plan to make to the comparison engine.

I did create some articles at https://en.wikipedia.org/wiki/User:EarwigBot/Copyvios/ for this task.

@NiharikaKohli: It looks like Copyvio Detector doesn't know about the source used in tests 4 and 6 for some reason. Might want to use a different source for those. I also added a 7th test for a very long unplagiarized page (which increases the chance for lots of small matches).

@kaldari, okay. Although not knowing the source for the test page should be considered as an error in Copyvios right?

@NiharikaKohli: Well, more like an error in the search API it's using, which we can't really do much about (besides T110144).

This is very useful. Thanks!

The seventh test, a very long unplagiarized page, shows the main issue with the comparison engine that I need to work on. It's clear from the highlighting that there's no real connection, but the percentage is high simply because of the sheer number of individual trigrams.

@kaldari, I've updated Page 4 and 6 from different sources. I think the issue was that I was using an article published on Medium as the source, which is technically a blog. I don't think blogs are checked as possible plagiarism sources.

@NiharikaKohli: Looks good to me. I also added {{NOINDEX}} templates to the pages so they don't get indexed by Google.

Also, I changed the URL to https://en.wikipedia.org/wiki/User:EarwigBot/Copyvios/Test_suite so that it's a bit easier to discover.

kaldari moved this task from Needs Review/Feedback to Q1 2018-19 on the Community-Tech-Sprint board.Sep 22 2015, 7:38 PM

Now at https://en.wikipedia.org/wiki/User:EarwigBot/Copyvios/Tests. Did some cleanup and added a few new tests.

@NiharikaKohli, @kaldari: I've thought about this a little further. I am concerned about test cases 2-6 (the main violations). We're citing sources, yes, but this doesn't negate the fact that we're copying content from non-PD/CC-licensed websites. I don't think NOINDEX + an explanation + "it's for testing purposes" is good enough. I also don't think this would count as fair use.

@Earwig - If we change the text to be a quote from the respective article instead, does that still count as a violation?
Like, instead of writing X, we say, Author A from source B says ".....X...."
It should be good enough for testing.

I think Earwig is right. As the task description says, we need to be using non-copyrighted sources for the test cases (even though that is rather non-intuitive for testing a copyright violation detector). The policies on Wikipedia are clear about not hosting copyrighted content (even for testing purposes). And extended quotes are still considered copyright violations if they are not done with permission. My suggestion would be to try using sources from Wikisource. Just make sure that the Copyvio Detector actually detects them as copyright violations (even though technically, they aren't). Alternatively, if there's anything on the internet (outside of Wikipedia) that you've written yourself, you can always use that since you are free to release your rights.

kaldari reopened this task as Open.Sep 26 2015, 7:57 AM

Oh, good point on that last one. I can definitely use posts from my own blog. Will try that.

All done now.

Earwig closed this task as Resolved.Sep 27 2015, 9:02 AM

• DannyH moved this task from Needs Discussion to Archive on the Community-Tech board.Nov 3 2015, 11:44 PM

• DannyH removed a project: Community-Tech-Sprint.Nov 3 2015, 11:49 PM

[AOI] Create a test suite for Copyvio DetectorClosed, ResolvedPublic2 Estimated Story PointsActions

Description

Related Objects

Event Timeline

[AOI] Create a test suite for Copyvio Detector
Closed, ResolvedPublic2 Estimated Story Points
Actions