Page MenuHomePhabricator

[Timebox 12hr] Investigation: applying copyvio for page review prioritization
Closed, ResolvedPublic

Description

In preparation for T193782, we need to think through how to apply copyright violation predications to pages awaiting review via AfC and NPP, specifically in the New Pages Feed interface.

There are several existing or former tools with varying methods of determining and surfacing likelihoods of copyright violation:

  • Earwig's Copyvio Detector: can use Google or Turnitin. Has an API.
  • CopyPatrol: uses Turnitin. Built by Community Tech team.
  • CorenSearchBot: defunct bot that automatically checked new pages for copyright violations and applied templates to violators. Used Yahoo search. More information here.

The following are relevant user stories that we should plan for, as of 2018-05-03:

  • As a reviewer, I need to be able to filter the New Pages Feed to likely copyright violations, corresponding to some pre-set cutoff of the copyvio likelihood score.
  • As a reviewer, I need to be able to sort the New Pages Feed by the copyvio score.
  • As a reviewer, I need a page's copyvio score to be displayed with its entry in the New Pages Feed list.
  • As a reviewer, I need to be able to click through to see more information about a copyright score, specifically the likely violating text and its source, similarly to how Earwig's Copyvio Detector and CopyPatrol currently work.
  • As a reviewer, I need all pages listed in the New Pages Feed to be sortable/filterable with copyvio, regardless of namespace.
  • As a reviewer, I need copyvio scores to be up-to-date with the latest revision of a page at all times.

If I (Marshall) can be helpful communicating with any of the external services (e.g. Google and Turnitin) to do this investigation, please let me know.


Some technical considerations that have been brought up so far in discussion of these user stories (though there are likely many more):

  • Will we run into usage limits for external services, like Google and Turnitin?
  • Will results be returned quickly enough to be reasonable?
  • What will happen if we score these models on the User namespace, which is currently accessible in the New Pages Feed?

Deliverables
  • Create list of Phab tickets for a rough implementation plan
  • Identify & document any dependencies and risks
  • Answer technical consideration questions above.

Related: T194541

Event Timeline

MMiller_WMF updated the task description. (Show Details)
TBolliger renamed this task from Investigation: applying copyvio for page review prioritization to [Timebox 12hr] Investigation: applying copyvio for page review prioritization.May 8 2018, 11:31 PM
TBolliger updated the task description. (Show Details)

@MMiller_WMF stray thought about this - will it help reduce the clickthroughs to Earwig's tool if we also include a link to the page which Earwig's tool tells us is the copyvio source?
As it stands now, the 10k per day limit is frequently reached: T193559#4172586. @MusikAnimal I can get you more data about current usage of the google api if you need.

Thanks! Yeah we were talking about that. The pagetriage_page_tags table has a ptrpt_value column where we could store the URL. It is limited to 255 characters (bytes?), which should be fine in most cases. We will probably want the percentage of duplication, too, which means yet another related tag...

We still have the issue that reviewers are going to want a diff of the article and sourced web page. I believe in general we don't link to Toolforge from anything MediaWiki, so I'm thinking we'll have a system message to make this configurable, and have it link to the Duplication Detector, since that won't use the Google API and is much faster. Guaranteed some folks will still prefer Earwig's tool, though.

I'm able to see our current API usage and yes it looks worrying. As part of this investigation I was going to try to come up with some rough numbers on how much that would increase, but frankly it will be a wild guess. We're not even sure how much quota would currently be used beyond the 10,000 query limit.

Thinking about when I've done this in the past, I have these two thoughts:

  • It's useful to have a reasonable accurate/updated number, so we can find the drafts of interest. (Maybe some sort of "if n% has changed, then re-score" tool?)
  • But in practice, when I'm already on a page that interests me, the most useful thing is to get the score right now, on the exact version that I'm looking at. "There have been 2 edits since the copyvio score was last updated. Click here to get a current copyvio score" would be very helpful.

Some preliminary findings:

The biggest pain point is asserting that content was copied and from what website (and the most important part, obviously :). Using Toolforge APIs isn't kosher for extensions, I believe, so we may have to reimplement what Copyvios already does. In addition, we need a community-built whitelist of sites that are known Wikipedia mirrors, just like Copyvios has with User:EarwigBot/Copyvios/Exclusions. I really hate that we'd have to reimplement the whole system. Copyvios has some complex logic that wouldn't be easy to just port to PHP. Maybe, just maybe, given that PageTriage is already mostly specific to enwiki, code reviewers could turn a blind eye and let us use the Copyvios API? It works very well, and as far as I know has minimal downtime (though maybe that wouldn't hold true with the extra load we put on it).

EDIT: It's possible that if we use Turnitin, we won't need to do this giant chunk of work, see below T193809#4206480

Will we run into usage limits for external services, like Google and Turnitin?

Yes. We are already hitting it (for Google). Currently there are some 1,000 non-redirect mainspace pages created a day, and ~250 draft pages a day. That alone isn't much, but we want to continually update the copyvio scores with subsequent edits. To give a good answer on expected usage, we'd need some rough idea of how many edits are made to pages before they are reviewed, and the average time frame.

I'll note that while we do desire copyvio scores for non-AfC drafts, they are at least safely tucked away in a namespace that doesn't get indexed. AfC submissions are important, I think, because when they get accepted they will be marked as patrolled (if the user has the rights) and become indexed. Similarly, unreviewed pages in the mainspace will get indexed after 30 days of being unreviewed (or maybe it's 90 days, can't remember). If we find quota limitations are a major problem, I'd recommend limiting copyvio checks only to AfC submissions and newly created mainspace pages.

Hopefully the result of T194541 will mean we have a little more quota to work with. We should also be mindful that the popularity of the Copyvios tool is going to grow, especially if/when it gets localized and is more widely used on other projects (https://github.com/earwig/copyvios/issues/34).

Will results be returned quickly enough to be reasonable?
As a reviewer, I need copyvio scores to be up-to-date with the latest revision of a page at all times.

This is the other major hurdle. In order to achieve our goals, we need the copyvio scores to be precomputed. Just as with T193807#4198661 I think some sort of cronjob is the best way to go. We can't run the copyvio checks in real time, that will surely be too slow. Related:

Thinking about when I've done this in the past, I have these two thoughts:

  • It's useful to have a reasonable accurate/updated number, so we can find the drafts of interest. (Maybe some sort of "if n% has changed, then re-score" tool?)
  • But in practice, when I'm already on a page that interests me, the most useful thing is to get the score right now, on the exact version that I'm looking at. "There have been 2 edits since the copyvio score was last updated. Click here to get a current copyvio score" would be very helpful.

These are great ideas. I especially like "if n% has changed then rescore", this seems more ideal and we could use the same comparison for deciding when to update ORES scores. The "N edits since the copyvio score was last updated" may be a bit challenging. There is a pagetriage_page.ptrp_tags_updated field, but this tells us when any tag was updated and not specifically the copyvio score. So I think we'd need a dedicated tag in pagetriage_page_tags.

What will happen if we score these models on the User namespace, which is currently accessible in the New Pages Feed?

User pages often do contain copyright violations, so I think in general this is a good idea. However on enwiki, the userspace is never indexed (unless they manually put __INDEX__, but its usage is patrolled with Special:AbuseFilter/840). That being said, if we have quota issues with the Google Search API (or Turnitin), I'd deprioritize doing these checks on the userspace.

As a reviewer, I need to be able to filter the New Pages Feed to likely copyright violations, corresponding to some pre-set cutoff of the copyvio likelihood score.
As a reviewer, I need to be able to sort the New Pages Feed by the copyvio score.
As a reviewer, I need a page's copyvio score to be displayed with its entry in the New Pages Feed list.
As a reviewer, I need all pages listed in the New Pages Feed to be sortable/filterable with copyvio, regardless of namespace.

Similar to ORES and other stats, we'll store the copyvio score and source URL in the pagetriage_page_tags. This is a little different because we have two values here instead of one. So I guess we need two separate tags. Even so, I think we'll be able to show and sort by copyvio score efficiently. Note we may want a third tag for the time since the copyvio score was last generated (see above).

As a reviewer, I need to be able to click through to see more information about a copyright score, specifically the likely violating text and its source, similarly to how Earwig's Copyvio Detector and CopyPatrol currently work.

We can provide an empty system message (MediaWiki namespace) that passes in the URL of the source webpage. That message can be created to link to the Copyvios "URL comparsion" feature, providing the URL and the wiki page title. This will show a raw diff, and hence won't run any Google Search API queries. Maybe this will reduce reviewer-initiated Google Search API requests (since the work is already done for them). The other concern here is that sometimes the copyright violations are from multiple web pages. What I recommend is that if the score is say, >= 75%, we simply link to the Copyvios URL comparison feature. If it's less than 75%, we link to the full Copyvios report so that the reviewer can check each of the search results. This would mean we'd need to pass the copyvio score to the system message, too, which is no problem.

We could alternatively explore using Turnitin instead of Google. I think Turnitin may actually identify the URLs the content was copied from, and gives us the copyvio score (example). If this is true, it is probably the better option because we won't have to reimplement what Copyvios does, saving us a lot of work. The drawback is that Turnitin isn't as comprehensive (limited to scholarly papers and websites used for research, as I understand it). I will look more into this and comment back here.

Using Turnitin would also avoid the faux pas of depending on a Tool Forge tool. Turnitin is a commercial service so should in theory have reliable uptime. There is precedent for extensions interfacing with 3rd party APIs, for example, UploadWizard has an interface to import images from Flickr via the Flickr API. You would need to talk to Ops about how to store and access the API key non-publicly. I don't remember how to do it. @MaxSem might know as well. (And we would probably need to talk to Turnitin and get a sign-off from Legal).

Using Turnitin would also avoid the faux pas of depending on a Tool Forge tool. Turnitin is a commercial service so should in theory have reliable uptime.

Hmm, I'd say Google is more reliable than Turnitin. Tunitin has a history of unexpected "maintenance" outages as we've seen with CopyPatrol. They also require us to ask them to give us more credits every few months. There is no guarantee that they wouldn't refuse to do that at some point in future.

Just a note to confirm that we're leaving this investigation open because it is not yet complete.

@MMiller_WMF — as the Growth team will tackle copyvio, can we remove this task from the CommTech sprint?

@TBolliger -- yes, I moved it out of the sprint. I kept it assigned to Community Tech until we officially transfer it over to Growth, which I think we'll do closer to the end of June.

Vvjjkkii renamed this task from [Timebox 12hr] Investigation: applying copyvio for page review prioritization to qndaaaaaaa.Jul 1 2018, 1:11 AM
Vvjjkkii removed MusikAnimal as the assignee of this task.
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
CommunityTechBot renamed this task from qndaaaaaaa to [Timebox 12hr] Investigation: applying copyvio for page review prioritization.Jul 2 2018, 4:31 PM
CommunityTechBot assigned this task to MusikAnimal.
CommunityTechBot raised the priority of this task from High to Needs Triage.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added a subscriber: Aklapper.

@Catrope and I spoke with @eranroz this morning to learn more about EranBot, which uses the Turnitin API to back CopyPatrol. Here are some of the important notes:

General process

  1. Bot consumes the recent changes stream.
  2. Applies logic to filter to only revisions from Main and Draft space.
  3. Applies logic to filter out revisions that are too small (approximately under 1500 bytes).
  4. Sends revisions in batches of 10 -- but this can be changed to send revisions in whatever batch size.
  5. Sends over to iThenticate in XML.
  6. Polls to find out when the report is ready. This can take seconds or minutes.
  7. Compares results to a blacklist of sites that mirror Wikipedia.
  8. Only stores in the database those results that are above a 50% score threshold.
  9. Results get written to a database in Tool Forge. Only the positive results are stored -- not the negative ones.

Other notes

  • Because the revisions wait for a batch of ten, and because of how long the API takes to return, it can end up being seconds or minutes before a result is ready.
  • Bot could be modified to send a whole page over instead of just a revision.
  • Bot supports English and French.

@Catrope -- feel free to add other notes.

@Catrope and I spoke to @Earwig on July 3 to learn more about Earwig's Copyvio Detector, which uses the Google search API. Here are some of the important notes:

  • We should be wary of both false positives and false negatives. The percentage score given by the tool is not scientifically calibrated.
  • Recommends that in future implementations of this or similar functionality, we don't provide such granular scores, but instead provide chunkier categories, which will help people understand that the results are approximate.
  • The tool caches results for revisions for a few days.
  • The tool's current API may not scale to the amount of traffic that the New Pages Feed would require.
  • There is core internal logic in the tool for splitting articles into Google-able pieces that might not be that hard to re-implement in another language.
  • Agrees that Google probably has more coverage than Turnitin.
  • If we do end up writing a service for general copyvio detecting, he could imagine altering Earwig's Copyvio Detector to use it.

@Catrope -- feel free to add other notes.

I'm recording here some additional things we've learned from @Earwig and from some of our testing. FYI @SBisson and @kostajh.

  • The tool makes a maximum of 8 Google queries per page. In our testing, even small stubs could use six or seven queries. But it will use fewer if the page is really short, or if a high confidence hit is found quickly.
  • Each of those queries is a sample of about 128 characters, and they are distributed roughly equally around the page. The queries find likely matching sites, and then the tool compares all the text on the matching site to all the text on the page.
  • Results are cached for three days. This limit could be adjusted.

I consider this investigation to have been done thoroughly and to be complete. We now understand the nature of our business relationships with Google and Turnitin, the way that both CopyPatrol and Earwig's tool work, and the pros and cons of how well those services do at detecting copyvio. We have decided to build our tool using Turnitin and linking to CopyPatrol.