Page MenuHomePhabricator

Investigation: benchmark Google and Turnitin for copyvio
Closed, ResolvedPublic

Description

As the investigation on the parent task continues, it is becoming clear that a copyvio solution will use either Google or Turnitin to look for violations. To help in the decision-making process, a valuable data point would come from benchmarking the two services against each other in terms of their approximate precision and recall. In other words, the questions are:

  • How many copyright violations detected by Google are also found by Turnitin?
  • How many copyright violations detected by Turnitin are also found by Google?
  • How common do false positives appear in the two services?
  • Nice-to-have: what is the response time like for these two services?

To do this, I recommend that we use both services on a sample of pages from both New Page Review and from submitted AfC drafts. This sample could contain 250 random pages from each of those two sources. Perhaps the APIs related to Earwig's Copyvio Detector or CopyPatrol can help.


After the pages are scored, we'll want to assemble the results into a short report with both visuals and text that we can post on wiki for the reviewing communities to read.

Event Timeline

MaxSem removed a subscriber: MaxSem.Jun 27 2018, 9:46 PM

You benchmark is strictly about efficiency. Do you plan to have one about communities' perception of those services?

@Trizek-WMF: We already know the community favors using Google for copyvio detection. There was a lot of discussion about this a couple years ago when we had to replace Yahoo for the search API used by Earwig's Copyvio Detector (since they killed their search API). We discussed using Bing, Yandex, Turnitin, and Google, and the community heavily favored using Google.

Thank you for documenting it.

I was also asking because of privacy. Privacy matters a lot to users. That question may be asked, like it as been raised up when discussing about using Yandex as a translation engine on CX.

Thanks for bringing this up, @Trizek-WMF and for the background, @kaldari. One other element I'll add to the mix is the discussion amongst the NPP and AfC reviewing communities. We discussed the community preferences of Turnitin and Google directly on this page, and it seems that most reviewers don't have experience with both -- only with one or the other. That's one of the reasons we want to generate this comparison data, because the discussion was inconclusive.

Privacy matters a lot to users.

And to us! :) The API is behind our own web proxy, and in this case only the backend is using it anyway (not clientside scripting), and the data we're sending it (article content) is public.

kostajh updated the task description. (Show Details)Jul 9 2018, 9:27 PM

How common do false positives appear in the two services?

How do you propose to determine that? You can't really tell if it's a false positive until a human has looked at it. One possible way to do this is to utilize the already reviewed records in CopyPatrol. All records in the copyright_diffs table with the status as false are essentially false positives.
But it's made a little tricky by the fact that CopyPatrol looks at edits alone and Earwig's tool looks at the entire page.

@Niharika -- thanks for bringing that up. Yes, my plan was only to look at the results myself to get a sense of the false positive rate. But that's a good idea to use the info in the copyright_diffs table. Maybe @Catrope can incorporate that.

@MMiller_WMF: You can also select "Reviewed cases" in the CopyPatrol interface if you want to comb through the resolved cases manually. I would not recommend trying to determine copyvio status yourself. It sounds easy, but is surprisingly hard.

@kostajh -- below is a SQL statement that finds drafts awaiting AfC review. This will find all of them, though I think you should only use a random sample of them for this task (maybe 100).

SELECT count(*)
FROM page tp
JOIN categorylinks
ON tp.page_id = cl_from
WHERE cl_to = 'Pending_AfC_submissions'
AND tp.page_is_redirect = 0
AND tp.page_namespace = 118;

We also want to run this test for new articles that are unreviewed in New Page Patrol, because we will be checking them for copyvio, too. I do not, however, know how to write SQL to find them. I think other teammates can help with that.

Kosta sent me the results from running 200 pages (AfC drafts and new articles) through both Google and Turnitin. I'll take this task now and update it with the results of the benchmarking.

I know that's an English Wikipedia project, but is that investigation covering other languages at some point? I'm pretty sure that other communities will ask to have NPP as well.

@Trizek-WMF -- just to follow up on this, we know that Google can handle more languages than Turnitin. That said, CopyPatrol (our tool that uses Turnitin) checks revisions in English, Czech, Spanish, and French. But there are two things that are good news with respect to other languages in the future:

  • We will be architecting our copyvio service so that other third-parties apart from Turnitin could be used in the future (such as Google).
  • The iThenticate website (which is part of Turnitin), says they support 12 languages: http://www.ithenticate.com/products/faqs

Thank you for the update, @MMiller_WMF.

MMiller_WMF updated the task description. (Show Details)
MMiller_WMF added a subscriber: Nettrom.

@Nettrom -- this is the Phabricator task for the Google/Turnitin comparison. I'll send you the spreadsheet separately. It would be great if you could take a look at the data and generate some conclusions of your own. Then I will draft a report that you can review. I would like to post this report for the community next week (week of August 13).

Nettrom moved this task from Triage to Doing on the Product-Analytics board.Aug 16 2018, 8:18 PM
Nettrom closed this task as Resolved.Aug 21 2018, 8:46 PM
Nettrom reassigned this task from Nettrom to nettrom_WMF.

The report is now posted as a sub-page of the AfC Process Improvement page on enwiki. Marking this as resolved and reassigning it so I can track it there in case it gets reopened.