Page MenuHomePhabricator

[AOI] Create a test suite for Copyvio Detector
Closed, ResolvedPublic2 Estimated Story Points

Description

Per T108422, it would be nice to have a suite of pages that could be used to consistently test Copyvio Detector's violation confidence algorithm. These could be set up as subpages under https://en.wikipedia.org/wiki/User:EarwigBot/Copyvios/.

The test suite should include:

For pages that actually contain plagiarism, make sure they are plagiarized from public domain sources that are not list at https://en.wikipedia.org/wiki/User:EarwigBot/Copyvios/Exclusions.

For pages based on existing Wikipedia articles, make sure that you attribute the articles to their original URL in the edit summary so that CC attribution isn't violated.

Event Timeline

kaldari raised the priority of this task from to Low.
kaldari updated the task description. (Show Details)
kaldari added a project: Community-Tech.
kaldari subscribed.
kaldari set Security to None.
kaldari moved this task from Older: Team Work to Ready on the Community-Tech board.
kaldari updated the task description. (Show Details)
kaldari edited a custom field.

I will likely work on this on my own over the next couple of weeks. It'll be useful for other improvements that I plan to make to the comparison engine.

@NiharikaKohli: It looks like Copyvio Detector doesn't know about the source used in tests 4 and 6 for some reason. Might want to use a different source for those. I also added a 7th test for a very long unplagiarized page (which increases the chance for lots of small matches).

@kaldari, okay. Although not knowing the source for the test page should be considered as an error in Copyvios right?

@NiharikaKohli: Well, more like an error in the search API it's using, which we can't really do much about (besides T110144).

This is very useful. Thanks!

The seventh test, a very long unplagiarized page, shows the main issue with the comparison engine that I need to work on. It's clear from the highlighting that there's no real connection, but the percentage is high simply because of the sheer number of individual trigrams.

@kaldari, I've updated Page 4 and 6 from different sources. I think the issue was that I was using an article published on Medium as the source, which is technically a blog. I don't think blogs are checked as possible plagiarism sources.

@NiharikaKohli: Looks good to me. I also added {{NOINDEX}} templates to the pages so they don't get indexed by Google.

Now at https://en.wikipedia.org/wiki/User:EarwigBot/Copyvios/Tests. Did some cleanup and added a few new tests.

@NiharikaKohli, @kaldari: I've thought about this a little further. I am concerned about test cases 2-6 (the main violations). We're citing sources, yes, but this doesn't negate the fact that we're copying content from non-PD/CC-licensed websites. I don't think NOINDEX + an explanation + "it's for testing purposes" is good enough. I also don't think this would count as fair use.

@Earwig - If we change the text to be a quote from the respective article instead, does that still count as a violation?
Like, instead of writing X, we say, Author A from source B says ".....X...."
It should be good enough for testing.

I think Earwig is right. As the task description says, we need to be using non-copyrighted sources for the test cases (even though that is rather non-intuitive for testing a copyright violation detector). The policies on Wikipedia are clear about not hosting copyrighted content (even for testing purposes). And extended quotes are still considered copyright violations if they are not done with permission. My suggestion would be to try using sources from Wikisource. Just make sure that the Copyvio Detector actually detects them as copyright violations (even though technically, they aren't). Alternatively, if there's anything on the internet (outside of Wikipedia) that you've written yourself, you can always use that since you are free to release your rights.

Oh, good point on that last one. I can definitely use posts from my own blog. Will try that.