In preparation for T193782, we need to think through how to apply copyright violation predications to pages awaiting review via [[ https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Articles_for_creation | AfC ]] and [[ https://en.wikipedia.org/wiki/Wikipedia:New_pages_patrol | NPP ]], specifically in the [[ https://en.wikipedia.org/wiki/Special:NewPagesFeed | New Pages Feed ]] interface.
There are several existing or former tools with varying methods of determining and surfacing likelihoods of copyright violation:
* [[ https://tools.wmflabs.org/copyvios/ | Earwig's Copyvio Detector ]]: can use Google or Turnitin. Has an [[ https://tools.wmflabs.org/copyvios/api | API ]].
* [[ https://tools.wmflabs.org/copypatrol/en | CopyPatrol ]]: uses Turnitin. Built by Community Tech team.
* [[ https://en.wikipedia.org/wiki/User:CorenSearchBot | CorenSearchBot ]]: defunct bot that automatically checked new pages for copyright violations and applied templates to violators. Used Yahoo search. More information [[ https://en.wikipedia.org/wiki/Wikipedia:Bots/Requests_for_approval/CorenSearchBot | here ]].
The following are relevant user stories that we should plan for, as of 2018-05-03:
* As a reviewer, I need to be able to filter the New Pages Feed to likely copyright violations, corresponding to some pre-set cutoff of the copyvio likelihood score.
* As a reviewer, I need to be able to sort the New Pages Feed by the copyvio score.
* As a reviewer, I need a page's copyvio score to be displayed with its entry in the New Pages Feed list.
* As a reviewer, I need to be able to click through to see more information about a copyright score, specifically the likely violating text and its source, similarly to how Earwig's Copyvio Detector and CopyPatrol currently work.
* As a reviewer, I need all pages listed in the New Pages Feed to be sortable/filterable with copyvio, regardless of namespace.
* As a reviewer, I need copyvio scores to be up-to-date with the latest revision of a page at all times.
If I (Marshall) can be helpful communicating with any of the external services (e.g. Google and Turnitin) to do this investigation, please let me know.
Some technical considerations that have been brought up so far in discussion of these user stories (though there are likely many more):
* Will we run into usage limits for external services, like Google and Turnitin?
* Will results be returned quickly enough to be reasonable?
* What will happen if we score these models on the User namespace, which is currently accessible in the New Pages Feed?
* Create list of Phab tickets for a rough implementation plan
* Identify & document any dependencies and risks
* Answer technical consideration questions above.