Page MenuHomePhabricator

Integrate Turnitin (as used in Plagiabot) into Copyvio Detector tool [AOI]
Closed, ResolvedPublic8 Estimated Story Points

Description

Plagiabot provides an API for testing articles with the Turnitin engine (see https://en.wikipedia.org/wiki/Wikipedia:Turnitin). Here is an example of the API output for a specific article: http://tools.wmflabs.org/eranbot/plagiabot/api.py?action=suspected_diffs&page_title=Rajesh_Khanna&report=1. (It returns an array of 1 or more potential violations.)

It would be great if the Copyvio Detector tool (https://tools.wmflabs.org/copyvios/) had the option of using Turnitin as well as Yahoo BOSS for detecting possible copyright violations.

Acceptance criteria:

  • In the "Copyvio search" options, add a new option for "Use Turnitin" (off by default for now)
  • If "Use Turnitin" is checked, add an extra box to the output (between the generation-time div and the cv-result div) that shows the results from the Plagiabot query.
  • If there are no matches from the Plagiabot query, the div should use class=green-box and say something like "Turnitin found no matching sources."
  • If there are matches from the Plagiabot query, the div should use class=red-box and say something like "Turnitin found sources that may have been plagiarized. Please review them." It should then include output similar to the Source column at https://en.wikipedia.org/wiki/User:EranBot/Copyright/2#Added, but in a nicer format. Specifically, it should include a link to the full report followed by a tabular display of the source matches, confidence, etc. Try to make it fairly similar to the formatting of the existing Copyvio Detector tool output.
  • Do not feed the results from Plagiabot into the Copyvio Detector's list of sources to check.

Source code for Copyvio Detector tool: https://github.com/earwig/copyvios

Event Timeline

kaldari raised the priority of this task from to Needs Triage.
kaldari updated the task description. (Show Details)
kaldari subscribed.

The Plagiabot API source code is at https://github.com/valhallasw/plagiabot/tree/master/webservice (in case it needs to be modified or extended.

kaldari renamed this task from Integrate Turnitin (as used in Plagiabot) into Copyvio Detector tool to [AOI] Integrate Turnitin (as used in Plagiabot) into Copyvio Detector tool.Aug 25 2015, 12:28 AM
kaldari moved this task from New & TBD Tickets to Blocked on the Community-Tech board.
kaldari updated the task description. (Show Details)
kaldari added a subscriber: Earwig.
kaldari updated the task description. (Show Details)
kaldari updated the task description. (Show Details)

From the sprint kick-off meeting:
Support: Medium
Feasibility: Good
Impact: Medium
Risk: Medium/Low

Priority: Normal

kaldari triaged this task as Medium priority.Aug 25 2015, 6:13 PM
kaldari moved this task from Blocked to Ready on the Community-Tech board.

@eranroz: Are all of the sources flagged by Turnitin publicly accessible? In other words, is it always possible to feed them into the Copyvio Detector tool's diff engine?

This comment was removed by kaldari.

@kaldari - not all sources of Turnitin are publicly accessible. People usually copy from publicly accessible sources, but sometimes from closed sources (we had few cases where people copied from closed source and we caught it only with the Turnitin tool)

@eranroz When the matched source is not publicly accessible, what does the plagiabot API return for source URL? Nothing? The URL of the paywall gate? The actual source URL which then redirects to a paywall gate? Does it vary depending on the source?

So it looks like in some cases, the URL returned by the plagiabot API is actually a redirect to a paywall gate (for example http://dx.doi.org/10.1016/j.jbiomech.2010.09.033) which does not contain the source text. In those cases, we would not be able to feed the text into Copyvio Detector (See for example, https://tools.wmflabs.org/copyvios/?lang=en&project=wikipedia&title=Tendinosis&oldid=&action=compare&url=http%3A%2F%2Fdx.doi.org%2F10.1016%2Fj.jbiomech.2010.09.033). Unfortunately, it doesn't look like the API warns us when that is the case, so we have to assume that any URLs returned by plagiabot are potentially URLs to paywall gates. That means that this task as currently defined is not feasible. I'm going to rewrite it according to @Fhocutt's suggestion.

@Earwig: How do the new acceptance criteria in the task description sound? Does that seem like a good solution? Any suggestions for tweaking it?

Sounds fine. I'm not sure about putting the Turnitin results above the main result summary, but that's a nitpick.

@Earwig: I'm definitely open to other suggestions. I just couldn't think of anywhere better to put it.

DannyH renamed this task from [AOI] Integrate Turnitin (as used in Plagiabot) into Copyvio Detector tool to Integrate Turnitin (as used in Plagiabot) into Copyvio Detector tool [AOI].Oct 28 2015, 7:05 PM

@Earwig: What's the best way for me to contribute to this? Is there a development repository, is there a good way to run it locally, should I get a copy set up on the community-tech-tools project? Also, thank you for the thorough documentation on Copyvios and EarwigBot. I ran into some issues but it's still much easier so far than some other tools I've worked with.

To be honest, I'm struggling with free time right now. Not sure the best way for you to approach this.

For one, probably best to do most of the work in the main copyvios repo rather than earwigbot (even though the current copyvio detection code is mostly in there). I suggest the standard GitHub fork-and-pull model, unless that doesn't work for you.

No development repo other than the ones you know about already. Sorry, but I don't know what the community-tech-tools project is exactly.

Let me know if you have further issues getting components running. It's a bit of a hodgepodge since I hadn't planned on others doing everything from scratch like this too often. It should work offline with flask alone (i.e. just python app.py from wherever you have the copyvios repo cloned), but I haven't tested that in a while, so there might be some problems. You also won't be able to do regular search engine checks without a key from Yahoo. That shouldn't be a major issue.

All right! Thanks for taking the time to reply here.

GitHub fork-and-pull is fine, I wanted to make sure there wasn't something you were already doing. The community-tech-tools project is our team's project on Labs for our dev work; if I need to have this on Labs rather than localhost, that's where I'll put it.

It's sort of up and running on localhost--the front page loads, and it will tell me that a nonexistent article does not exist. I didn't get uglifycss working; I copied and renamed the files. I don't think the dropdown to choose the wiki works--everything goes to the one set in the config--and when I put in an article title that does exist I get database errors:

Traceback (most recent call last):
  File "./app.py", line 38, in inner
    return func(*args, **kwargs)
  File "./app.py", line 104, in index
    query = do_check()
  File "/home/name/bots/copyvios/copyvios/checker.py", line 37, in do_check
    _get_results(query, follow=not _coerce_bool(query.noredirect))
  File "/home/name/bots/copyvios/copyvios/checker.py", line 63, in _get_results
    conn = get_db()
  File "/home/name/bots/copyvios/copyvios/misc.py", line 45, in get_db
    args = cache.bot.config.wiki["_copyviosSQL"]
  File "/home/name/bots/copyvios/earwigbot/earwigbot/config/node.py", line 41, in __getitem__
    return self._data[key]
KeyError: '_copyviosSQL'

But my .earwigbot/config.yml file does have this:

_copyviosSQL:
    host: 127.0.0.1
    db: u_earwig_afc_copyvios
    user: root
    password: ''

I'll keep poking at that, but if there's an obvious solution, please let me know.

You probably didn't put it in the "wiki" section.

That was it, thanks. Now I'm getting a KeyError in _get_search_engine
(stack trace below in case it has extra info)--it looks like I do need to
specify an engine and credentials for any search to work. What's the best
way to do that? Can I use the BOSS creds, or is there another alternative?

Traceback (most recent call last):
  File "./app.py", line 38, in inner
    return func(*args, **kwargs)
  File "./app.py", line 104, in index
    query = do_check()
  File "/home/name/bots/copyvios/copyvios/checker.py", line 37, in do_check
    _get_results(query, follow=not _coerce_bool(query.noredirect))
  File "/home/name/bots/copyvios/copyvios/checker.py", line 78, in _get_results
    short_circuit=not query.noskip)
  File "/home/name/bots/copyvios/earwigbot/earwigbot/wiki/copyvios/__init__.py",
line 116, in copyvio_check
    searcher = self._get_search_engine()
  File "/home/name/bots/copyvios/earwigbot/earwigbot/wiki/copyvios/__init__.py",
line 65, in _get_search_engine
    engine = self._search_config["engine"]
KeyError: 'engine'

Hmm... I guess you can ask @coren for a BOSS key for testing? Alternatively, disable part of EarwigBot: in earwigbot/wiki/copyvios/__init__.py, comment out line 116 and change 133 to if True:. That should make it just report "no match" for everything. I might add a more graceful fallback in the future.

Sorry Coren, I didn't really mean to add you as a subscriber...!

@Fhocutt: Is it necessary to get the search feature working for this particular task? We should be able to run the Turnitin/Plagiabot request without hitting Yahoo BOSS at all.

@kaldari Still useful to test how the results look when combined with the regular BOSS hits, I guess?

Hmm... I guess you can ask @coren for a BOSS key for testing?

Can you clarify that a bit, @Earwig? Does @coren specifically have a BOSS key that can be handed to @Fhocutt? Did you want him to use his root powers to give @Fhocutt access to the tool?

@yuvipanda

Okay, so Coren's been the point of contact in the past between me and the WMF with regards to managing the Yahoo! BOSS API keys that are necessary to use that service. As far as I know, he still has that role. I was suggesting that he could create a new key for Fhocutt for developing/testing this new feature (since sharing of keys doesn't sound like a good idea, although we could do that too, I guess).

Nothing about root powers here; I could add her myself if I wanted to, right? I don't think we need that for now.

Ah, alright. That's clearer, thank you!

@Earwig: thanks for the dev fixes! That's working now.

This is... done, I think. I want to hack on the visual output further, but it works.