Create a ML model to score new files in commons for copyvio issues
Closed, DuplicatePublic
Actions

Assigned To

None

Authored By

	eranroz
	Aug 15 2019, 3:47 PM

Description

To help with copyvio patrolling in Commons it would be helpful to have a model that score a file for probability for being copyvio. The model should score new files for possible copyvios, to help patrollers focus on new files that require extra attention.

A rule based model could be a file uploaded by non trusted user (e.g not auto-patrolled), and file that lacks EXIF data etc.
ML based model can do similarly to rule based, but using features and weights and may take advantage of other properties we don't take into account.

As an ML based model it SHOULD NOT be dependent on external commercial search system (Bing, Baidu, Google etc) and should be free to use (no paying $$$ for commercial systems). It may be further used either by patrollers to manually search using their own favourite search engine, or using other tool(s) that interacts with commercial systems.

Related Objects
Search...

Status	Assigned	Task
Open	None	T134802 Improve the curator workflow for reviewing new files
Open	None	T120453 Copyright violation detection tool for Commons
Duplicate	None	T123517 Automatically check Commons uploads for possible copyright violations
Duplicate	None	T230561 Create a ML model to score new files in commons for copyvio issues

Event Timeline

eranroz created this task.Aug 15 2019, 3:47 PM

Restricted Application added a project: Community-Tech. · View Herald TranscriptAug 15 2019, 3:47 PM

eranroz mentioned this in T123517: Automatically check Commons uploads for possible copyright violations.Aug 15 2019, 3:49 PM

eranroz claimed this task.Aug 15 2019, 3:51 PM

Yes, good idea. I suggest the following:

Uploaded by a new user (less than 20 uploads): +1
Uploaded by an account without any right (not autopatroller): +1
File without complete EXIF data: +1
File size less than 200 KB: +1
File size less than 100 KB: +2
File dimension less than 1 Mpixels: +1
File depicting a person: +1
File depicting a logo: +1

Files with 5 points or more have a very high chance to be copyright violations.

As a proof of concept, I wrote during the Wikimania-Hackathon-2019 a script to scan new images with a rule based approach based on earlier discussion in https://commons.wikimedia.org/wiki/Commons:Village_pump/Copyright/Archive/2019/07#Copypatrol_for_images

Process:

New files are processed based on rules suggested @AlexisJazz good comment (but it doesn't implement all the suggested rules).
It then use simple random forest model to provide a score for attention (training and test set are based on https://commons.wikimedia.org/wiki/Category:Copyright_violations and randomly selected files, features are all EXIF data, all templates and user groups of the uploader).

Next steps would be:

community evaluation of the generated reports and fine tunning the script with more rules
running them periodically in toolserver

Once it get mature enough we can connect it to google search, selecting the top X ranked images for search where X will be defined by the credits we have for search in google (T31793).

eranroz edited projects, added Wikimania-Hackathon-2019; removed CopyPatrol.Aug 18 2019, 12:04 PM

Aklapper added a project: ORES.Apr 2 2020, 12:51 PM

Restricted Application added a project: Machine-Learning-Team. · View Herald TranscriptApr 2 2020, 12:51 PM

Halfak moved this task from Unsorted to Research on the Machine-Learning-Team board.Apr 6 2020, 4:57 PM

Halfak edited projects, added Machine-Learning-Team (Research); removed Machine-Learning-Team.

calbon removed projects: Machine-Learning-Team (Research), ORES.Sep 23 2020, 4:24 PM

Aklapper edited projects, added artificial-intelligence; removed Community-Tech.Feb 22 2021, 10:43 AM

@eranroz: Hi, are you still working on this? Should this task still remain open?

Sdkb mentioned this in T145165: Investigation: Copyvio tools for Commons.Nov 4 2021, 6:42 AM

Removing task assignee due to inactivity, as this open task has been assigned for more than two years. See the email sent to the task assignee on February 06th 2022 (and T295729).

Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome.

If this task has been resolved in the meantime, or should not be worked on ("declined"), please update its task status via "Add Action… 🡒 Change Status".

Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator.

Closing as this task was created years after the CWS one, and is mostly about a specific potential implementation of T120453, Future discussion about training a model can occur in the 2015 task. Also note that training ML models can be costly themselves.

Frostly closed this task as a duplicate of T120453: Copyright violation detection tool for Commons.Jan 14 2023, 10:14 PM

Create a ML model to score new files in commons for copyvio issuesClosed, DuplicatePublicActions

Description

Related ObjectsSearch...

Event Timeline

Create a ML model to score new files in commons for copyvio issues
Closed, DuplicatePublic
Actions

Related Objects
Search...