Page MenuHomePhabricator

Create a ML model to score new files in commons for copyvio issues
Open, Needs TriagePublic

Description

To help with copyvio patrolling in Commons it would be helpful to have a model that score a file for probability for being copyvio. The model should score new files for possible copyvios, to help patrollers focus on new files that require extra attention.

A rule based model could be a file uploaded by non trusted user (e.g not auto-patrolled), and file that lacks EXIF data etc.
ML based model can do similarly to rule based, but using features and weights and may take advantage of other properties we don't take into account.

As an ML based model it SHOULD NOT be dependent on external commercial search system (Bing, Baidu, Google etc) and should be free to use (no paying $$$ for commercial systems). It may be further used either by patrollers to manually search using their own favourite search engine, or using other tool(s) that interacts with commercial systems.

Event Timeline

eranroz created this task.Aug 15 2019, 3:47 PM
Restricted Application added a project: Community-Tech. · View Herald TranscriptAug 15 2019, 3:47 PM
eranroz claimed this task.Aug 15 2019, 3:51 PM
Yann added a comment.Aug 16 2019, 5:08 AM

Yes, good idea. I suggest the following:

  • Uploaded by a new user (less than 20 uploads): +1
  • Uploaded by an account without any right (not autopatroller): +1
  • File without complete EXIF data: +1
  • File size less than 200 KB: +1
  • File size less than 100 KB: +2
  • File dimension less than 1 Mpixels: +1
  • File depicting a person: +1
  • File depicting a logo: +1

Files with 5 points or more have a very high chance to be copyright violations.

eranroz added a subscriber: AlexisJazz.EditedAug 18 2019, 11:45 AM

As a proof of concept, I wrote during the Wikimania-Hackathon-2019 a script to scan new images with a rule based approach based on earlier discussion in https://commons.wikimedia.org/wiki/Commons:Village_pump/Copyright/Archive/2019/07#Copypatrol_for_images

Process:

  1. New files are processed based on rules suggested @AlexisJazz good comment (but it doesn't implement all the suggested rules).
  2. It then use simple random forest model to provide a score for attention (training and test set are based on https://commons.wikimedia.org/wiki/Category:Copyright_violations and randomly selected files, features are all EXIF data, all templates and user groups of the uploader).

Next steps would be:

  1. community evaluation of the generated reports and fine tunning the script with more rules
  2. running them periodically in toolserver

Once it get mature enough we can connect it to google search, selecting the top X ranked images for search where X will be defined by the credits we have for search in google (T31793).