Page MenuHomePhabricator

Create a ML model to score new files in commons for copyvio issues
Closed, DuplicatePublic

Description

To help with copyvio patrolling in Commons it would be helpful to have a model that score a file for probability for being copyvio. The model should score new files for possible copyvios, to help patrollers focus on new files that require extra attention.

A rule based model could be a file uploaded by non trusted user (e.g not auto-patrolled), and file that lacks EXIF data etc.
ML based model can do similarly to rule based, but using features and weights and may take advantage of other properties we don't take into account.

As an ML based model it SHOULD NOT be dependent on external commercial search system (Bing, Baidu, Google etc) and should be free to use (no paying $$$ for commercial systems). It may be further used either by patrollers to manually search using their own favourite search engine, or using other tool(s) that interacts with commercial systems.

Event Timeline

Yes, good idea. I suggest the following:

  • Uploaded by a new user (less than 20 uploads): +1
  • Uploaded by an account without any right (not autopatroller): +1
  • File without complete EXIF data: +1
  • File size less than 200 KB: +1
  • File size less than 100 KB: +2
  • File dimension less than 1 Mpixels: +1
  • File depicting a person: +1
  • File depicting a logo: +1

Files with 5 points or more have a very high chance to be copyright violations.

As a proof of concept, I wrote during the Wikimania-Hackathon-2019 a script to scan new images with a rule based approach based on earlier discussion in https://commons.wikimedia.org/wiki/Commons:Village_pump/Copyright/Archive/2019/07#Copypatrol_for_images

Process:

  1. New files are processed based on rules suggested @AlexisJazz good comment (but it doesn't implement all the suggested rules).
  2. It then use simple random forest model to provide a score for attention (training and test set are based on https://commons.wikimedia.org/wiki/Category:Copyright_violations and randomly selected files, features are all EXIF data, all templates and user groups of the uploader).

Next steps would be:

  1. community evaluation of the generated reports and fine tunning the script with more rules
  2. running them periodically in toolserver

Once it get mature enough we can connect it to google search, selecting the top X ranked images for search where X will be defined by the credits we have for search in google (T31793).

@eranroz: Hi, are you still working on this? Should this task still remain open?

Removing task assignee due to inactivity, as this open task has been assigned for more than two years. See the email sent to the task assignee on February 06th 2022 (and T295729).

Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome.

If this task has been resolved in the meantime, or should not be worked on ("declined"), please update its task status via "Add Action… 🡒 Change Status".

Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator.

Closing as this task was created years after the CWS one, and is mostly about a specific potential implementation of T120453, Future discussion about training a model can occur in the 2015 task. Also note that training ML models can be costly themselves.