Include scores in CopyPatrol interface for each source URL
Closed, ResolvedPublic2 Estimated Story Points
Actions

Assigned To

Authored By

	kaldari
	Jul 1 2016, 10:38 PM

Description

Eranbot has scores associated with each source indicating how many words are duplicated and what percent of the content that is. The score is of the form "XXX% XXX words" in the report HTML. We'll probably have to use a regex to parse this out of the HTML (unless Eran wants to abstract this data out in the API).

Once we have the scores, they should be visually associated with each source in the CopyPatrol interface.

copy patrol - alternate wikiproject display.jpg (429×1 px, 96 KB)

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T116957 Plagiarism detection tools for text (tracking)
Resolved	• TBolliger	T120435 Improve the plagiarism detection bot
Resolved	• TBolliger	T131583 Epic: Make a tool labs interface for Plagiabot aka Eranbot
Resolved	MusikAnimal	T139209 Include scores in CopyPatrol interface for each source URL

Event Timeline

kaldari created this task.Jul 1 2016, 10:38 PM

Restricted Application added subscribers: Zppix, JEumerus, Aklapper. · View Herald TranscriptJul 1 2016, 10:38 PM

kaldari triaged this task as Medium priority.Jul 1 2016, 10:38 PM

kaldari moved this task from New & TBD Tickets to Needs Discussion on the Community-Tech board.

This was discussed in one of our recent meetings. The scores shown by Eranbot hardly ever match up with the actual scores produced by copyvios. You can verify this by browsing through Copyright/rc and checking copyvios produced scores. It is very frequently 0.0% when Turnitin gives us 80-90%. That is mostly because Turnitin uses it's own parsed version of the source page, which may be several years old.
I am not sure we should be showing data which is mostly incorrect to users.

Hmm, I wonder how these scores are calculated. We could at least show the number of words that are matching.

In T139209#2423238, @kaldari wrote:

Hmm, I wonder how these scores are calculated. We could at least show the number of words that are matching.

Those scores are probably correct, but they hold true for *some* version of the source page which turnitin has. It mostly no longer holds true for the current page version.

When we do the query to Copyvios, it gives us back a percentage of the match. We could use that except that we don't actually have our hands on it until we do the query.

I don't think the exact accuracy of the scores really matters too much. The main thing the scores provide is a quick way to evaluate the list of sources. For example, if there are 3 sources given and they all have the same score of "91% 114 words", you can instantly know that all 3 sources have the same content and there is no reason to look at all 3 of them separately. This is actually a very common scenario. Also, a source having a score of "100% 1270 words" is obviously a more serious case than a source having a score of "52% 66 words", so it may warrant more attention. When there are hundreds of reports to sort through, this is really helpful information, even if sometimes it is out of date or not completely accurate, as it still gives a general idea of the severity of the plagiarism.

• DannyH edited projects, added Community-Tech-Sprint; removed Community-Tech.Jul 5 2016, 5:35 PM

• DannyH set the point value for this task to 2.Jul 5 2016, 5:53 PM

• DannyH added a parent task: T131583: Epic: Make a tool labs interface for Plagiabot aka Eranbot .Jul 5 2016, 9:41 PM

Okay, wireframe added above with the scores in front of the compare links.

MusikAnimal moved this task from Ready to Needs Review/Feedback on the Community-Tech-Sprint board.Jul 14 2016, 1:17 AM

MusikAnimal moved this task from Needs Review/Feedback to Ready on the Community-Tech-Sprint board.

MusikAnimal claimed this task.Jul 21 2016, 7:09 PM

MusikAnimal removed MusikAnimal as the assignee of this task.Jul 21 2016, 8:18 PM

MusikAnimal subscribed.

MusikAnimal claimed this task.Jul 26 2016, 6:27 PM

MusikAnimal moved this task from Ready to In Development on the Community-Tech-Sprint board.

Pull request at https://github.com/Niharika29/PlagiabotWeb/pull/23/files and running on plagiabot

Please note the comments regarding the revised regular expression. I'm not sure if I broke something but everything seems to work.

With this work I also made quite a discovery... as we know sometimes when you hit "Compare" you see nothing in the diff matches the source. This (most likely) is not because of Turnitin using a different version of the source, or Copyvios using the wrong version of the article, rather it's the source website returning different content because Copyvios user agent is bot-like. Apparently there are other scenarios where the source does not like the IP copyvios is using, or something along those lines. I talked to @Earwig about this and I guess there's not much we can do (correct me if I'm wrong). The good news however is from my testing the scores and percentages added with this branch appear to be accurate. So for a reviewer, you can rely on them to find the most probable violations. We may wish to document why sometimes the comparison doesn't match up with the scores. cc @DannyH

@Earwig, @MusikAnimal: If it's mostly a matter of the user-agent, couldn't we just spoof the user-agent string of a browser, like we do with InternetArchiveBot? Is there more to it than that?

kaldari closed this task as Resolved.Aug 1 2016, 6:14 PM

kaldari moved this task from Needs Review/Feedback to Q1 2018-19 on the Community-Tech-Sprint board.

• DannyH edited projects, added Community-Tech; removed Community-Tech-Sprint.Aug 1 2016, 10:55 PM

• DannyH moved this task from Needs Discussion to Archive on the Community-Tech board.Aug 1 2016, 11:06 PM

It wasn't a user-agent issue, but something else that's hard to explain. Anyway, I fixed it.

MusikAnimal moved this task from Backlog to Done on the CopyPatrol board.Dec 6 2016, 5:23 AM

	F4242050: copy patrol - alternate wikiproject display.jpg
	Jul 6 2016, 1:15 AM

Include scores in CopyPatrol interface for each source URLClosed, ResolvedPublic2 Estimated Story PointsActions

Description

Related ObjectsSearch...

Event Timeline

Include scores in CopyPatrol interface for each source URL
Closed, ResolvedPublic2 Estimated Story Points
Actions

Related Objects
Search...