Page MenuHomePhabricator

Copyvio: make it so that all new pages are scanned, regardless of size
Closed, DeclinedPublic

Description

EranBot currently has rules that determine which revisions to check, and one of the rules limits to edits over a certain number of bytes. It's possible for new pages (the first revision) to have a really low number of bytes -- a new page may just be a stub with one short sentence.

For the purposes of the New Pages Feed, we want ALL new pages to be scanned (the first revision), regardless of size.

Event Timeline

@eranroz -- we were just discussing this task with our team, and we wanted to ask you about it.

We filed this task because we wanted to make sure that all pages in the New Pages Feed got checked for copyvio, even if they didn't meet the size threshold. In other words, we want users of the New Pages Feed to know that everything in the feed has been checked. But then a reviewer brought up that very small pages, like one sentence stubs, would be very likely to have their content found somewhere on the internet by Turnitin. And so checking those small pages with Turnitin might not be useful.

Do you have thoughts on this? Is this something you considered?

We can't scan all new pages regardless of the size, for 2 main concerns: many of the edits are minor and this is waste of credits + we will likely have too many false positives.

The current threshold is set to 500 bytes (after removing wikitext code) and at least 20 spaces. Those threshold are arbitrary and we can consider to change them - either based on data (precision and recall for small texts) or just gradually decrease them till the point we see it gets too junky.

MMiller_WMF closed this task as Declined.Sep 4 2018, 8:47 PM
MMiller_WMF added a subscriber: Insertcleverphrasehere.

Thanks, @eranroz.

Given this opinion, and the point from @Insertcleverphrasehere that a short stub would likely be tagged as copyvio and be a false positive, I am going to decline this task. CopyPatrol will continue to not check new pages if they are too small, and therefore the New Pages Feed will have no copyvio flag on very small pages.