Page MenuHomePhabricator

Automatically check Commons uploads for possible copyright violations
Closed, DuplicatePublic

Description

We are exploring some options for automatic detection of files uploaded from somewhere else. This bot will only run on Commons, on files where the source is "own work", and where the uploader is not "trusted" (to be determined what that means, exactly)

Approach to detect possible copyright violations:

  • T31793 - google search for new uploaded files (or other providers?)
  • T230561 - create a model to score files require additional attention for copyvio aspects

Event Timeline

MarkTraceur claimed this task.
MarkTraceur raised the priority of this task from to Medium.
MarkTraceur updated the task description. (Show Details)
MarkTraceur subscribed.

We are exploring some options for automatic detection of files uploaded from somewhere else.

@MarkTraceur: Has that exploration happened in the last five months, and what was the outcome?
If not, does anyone plan to work on this soon (or should the priority be changed?
Thanks!

MarkTraceur lowered the priority of this task from Medium to Low.Jun 14 2016, 1:42 PM

@Aklapper Sorry about that, yeah, this is on the back burner because we have no good leads on partnerships with services that could provide image checking for us. Google has said in no uncertain terms that they don't have an open API for this, and I think partnerships are being negotiated with some other system, but are not finalized. In any case, it's blocked internally and low priority for us currently.

@Aklapper Sorry about that, yeah, this is on the back burner because we have no good leads on partnerships with services that could provide image checking for us. Google has said in no uncertain terms that they don't have an open API for this, and I think partnerships are being negotiated with some other system, but are not finalized. In any case, it's blocked internally and low priority for us currently.

Noted on AN: https://commons.wikimedia.org/w/index.php?title=Commons:Administrators%27_noticeboard&diff=198996743&oldid=198967192

I suggest different approach that is independent on external service (this may be compementary way):

  • Train ML classfier of "copyright violation"
    • Define features relevant to guess whether a file is at high risk for copyright violation (metadata - user properties/history, image description, EXIF, logo like in the corner etc)
    • Get data from commons on many files to gather statistics on those features and train classifier
  • Predict/score new uploads
    • Score new uploads based soely on the upload itself (rather than external service)

@Halfak @Ladsgroup is it something thatalready planned in ORES?

ABorbaWMF moved this task from Needs QA to Desired epics on the Multimedia board.
ABorbaWMF subscribed.

Dragged and dropped by mistake. Putting this ticket back.

eranroz renamed this task from Create a bot to automatically check Commons uploads for possible copyright violations to Automatically check Commons uploads for possible copyright violations.Aug 15 2019, 3:49 PM
eranroz raised the priority of this task from Low to High.
eranroz updated the task description. (Show Details)

updated the description to fit to other subtasks (and aligning priority to T31793)

Removing CopyPatrol as it is about text edits, not media.

Not necessarily a check suitable for all uploads, but comparing a new upload against a hash for an 'office' actioned removal might be useful. You could compute an SHA-1 on a new upload and compare it against ones previously removed. (Like is done in ehcking for duplicate uploads). This could prevent accidental re-upload of previously removed material, with a suitable warning to the uploader.

Closing as the Multimedia team was disbanded in 2019, this task is largely similar, if not identical to the CWS one, and this task was created one year after T120453.