As the investigation on the parent task continues, it is becoming clear that a copyvio solution will use either Google or Turnitin to look for violations. To help in the decision-making process, a valuable data point would come from benchmarking the two services against each other in terms of their approximate precision and recall. In other words, the questions are:
- How many copyright violations detected by Google are also found by Turnitin?
- How many copyright violations detected by Turnitin are also found by Google?
- How common do false positives appear in the two services?
- Nice-to-have: what is the response time like for these two services?
To do this, I recommend that we use both services on a sample of pages from both New Page Review and from submitted AfC drafts. This sample could contain 250 random pages from each of those two sources. Perhaps the APIs related to Earwig's Copyvio Detector or CopyPatrol can help.
After the pages are scored, we'll want to assemble the results into a short report with both visuals and text that we can post on wiki for the reviewing communities to read.