Page MenuHomePhabricator

Make plagiabot API output report id, matching urls, % match, and number of words as separate data
Closed, DeclinedPublic5 Estimated Story Points

Description

See https://github.com/valhallasw/plagiabot/issues/2.

Current API output looks like:

[{'lang': 'en', 'page_ns': '0', 'diff_timestamp': '20150816171215', 'ithenticate_id': '19033870', 'project': 'wikipedia', 'report': '<div class="mw-ui-button">[tools.wmflabs.org/eranbot/ithenticate.py?rid=19033870 report]</div>\n* I 64% 61 words at [http://8xm.tv/birthday-of-a-legend-and-his-daughter/ http://8xm.tv/birthday-of-a-legend-and-his-daughter/] <div class="mw-ui-button">[tools.wmflabs.org/copyvios?lang={{subst:CONTENTLANG}}&project={{lc:{{ns:Project}}}}&title=&oldid=676384764&action=compare&url=http://8xm.tv/birthday-of-a-legend-and-his-daughter/ Compare]</div>', 'diff': '676384764'}]

Example API request: http://tools.wmflabs.org/eranbot/plagiabot/api.py?action=suspected_diffs&page_title=Rajesh_Khanna&report=1

Current code lives at: https://github.com/valhallasw/plagiabot/blob/master/webservice/api.py

We should modify the API to also output report id, matching urls, % match, and number of words as separate data. This will likely require some complicated Regexs on the raw results. If it looks like there's any way to get more granular data from the Turnitin/ithenticate API (http://www.ieee.org/documents/iThenticateAPIGuide.pdf), we may want to use that instead. The existing raw 'report' node in the plagiabot API should be preserved for backwards-compatibility.

Event Timeline

kaldari raised the priority of this task from to Needs Triage.
kaldari updated the task description. (Show Details)
kaldari added a project: Community-Tech.
kaldari subscribed.
kaldari triaged this task as Medium priority.Aug 28 2015, 9:37 PM
kaldari moved this task from New & TBD Tickets to Ready on the Community-Tech board.
kaldari set Security to None.

Moving this task to In Analysis pending an answer to the question about the public assessibility of Turnitin sources for diff analysis: T110144#1573302

kaldari updated the task description. (Show Details)
DannyH renamed this task from [AOI] Make plagiabot API output report id, matching urls, % match, and number of words as separate data to Make plagiabot API output report id, matching urls, % match, and number of words as separate data [AOI].Oct 28 2015, 7:05 PM

After talking with Frances, we agreed that it would probably make more sense to implement this on the Copyvio Detector tool side, rather than the API side, since the Copyvio Detector tool is the only thing that needs this and it's going to have to be implemented as a RegEx hack either way.

Reopening, since this will be useful for the new copy and paste bot interface as well.

kaldari moved this task from Archive to Up Next (June 3-21) on the Community-Tech board.
kaldari added a subscriber: Fhocutt.
kaldari edited subscribers, added: eranroz; removed: Fhocutt.
DannyH renamed this task from Make plagiabot API output report id, matching urls, % match, and number of words as separate data [AOI] to Make plagiabot API output report id, matching urls, % match, and number of words as separate data.Apr 18 2016, 10:41 PM
DannyH subscribed.

Reopening, since this will be useful for the new copy and paste bot interface as well.

Ryan, since we're making use of direct DB access in the app, is this task still useful?
Should we do this in the CopyPatrol app instead of trying to modify Plagiabot code?

@Niharika: Yeah, I think doing this directly in the CopyPatrol app is fine. I'll reclose.