Page MenuHomePhabricator

Include scores in CopyPatrol interface for each source URL
Closed, ResolvedPublic2 Estimated Story Points

Description

Eranbot has scores associated with each source indicating how many words are duplicated and what percent of the content that is. The score is of the form "XXX% XXX words" in the report HTML. We'll probably have to use a regex to parse this out of the HTML (unless Eran wants to abstract this data out in the API).

Once we have the scores, they should be visually associated with each source in the CopyPatrol interface.

copy patrol - alternate wikiproject display.jpg (429×1 px, 96 KB)

Event Timeline

kaldari triaged this task as Medium priority.Jul 1 2016, 10:38 PM
kaldari moved this task from New & TBD Tickets to Needs Discussion on the Community-Tech board.

This was discussed in one of our recent meetings. The scores shown by Eranbot hardly ever match up with the actual scores produced by copyvios. You can verify this by browsing through Copyright/rc and checking copyvios produced scores. It is very frequently 0.0% when Turnitin gives us 80-90%. That is mostly because Turnitin uses it's own parsed version of the source page, which may be several years old.
I am not sure we should be showing data which is mostly incorrect to users.

Hmm, I wonder how these scores are calculated. We could at least show the number of words that are matching.

Hmm, I wonder how these scores are calculated. We could at least show the number of words that are matching.

Those scores are probably correct, but they hold true for *some* version of the source page which turnitin has. It mostly no longer holds true for the current page version.

When we do the query to Copyvios, it gives us back a percentage of the match. We could use that except that we don't actually have our hands on it until we do the query.

I don't think the exact accuracy of the scores really matters too much. The main thing the scores provide is a quick way to evaluate the list of sources. For example, if there are 3 sources given and they all have the same score of "91% 114 words", you can instantly know that all 3 sources have the same content and there is no reason to look at all 3 of them separately. This is actually a very common scenario. Also, a source having a score of "100% 1270 words" is obviously a more serious case than a source having a score of "52% 66 words", so it may warrant more attention. When there are hundreds of reports to sort through, this is really helpful information, even if sometimes it is out of date or not completely accurate, as it still gives a general idea of the severity of the plagiarism.

DannyH set the point value for this task to 2.Jul 5 2016, 5:53 PM
DannyH subscribed.

Okay, wireframe added above with the scores in front of the compare links.

MusikAnimal added a subscriber: Earwig.

Pull request at https://github.com/Niharika29/PlagiabotWeb/pull/23/files and running on plagiabot

Please note the comments regarding the revised regular expression. I'm not sure if I broke something but everything seems to work.

With this work I also made quite a discovery... as we know sometimes when you hit "Compare" you see nothing in the diff matches the source. This (most likely) is not because of Turnitin using a different version of the source, or Copyvios using the wrong version of the article, rather it's the source website returning different content because Copyvios user agent is bot-like. Apparently there are other scenarios where the source does not like the IP copyvios is using, or something along those lines. I talked to @Earwig about this and I guess there's not much we can do (correct me if I'm wrong). The good news however is from my testing the scores and percentages added with this branch appear to be accurate. So for a reviewer, you can rely on them to find the most probable violations. We may wish to document why sometimes the comparison doesn't match up with the scores. cc @DannyH

@Earwig, @MusikAnimal: If it's mostly a matter of the user-agent, couldn't we just spoof the user-agent string of a browser, like we do with InternetArchiveBot? Is there more to it than that?

It wasn't a user-agent issue, but something else that's hard to explain. Anyway, I fixed it.