Investigation: Copyvio tools for Commons
Closed, ResolvedPublic3 Estimated Story Points
Actions

Assigned To

Authored By

	• DannyH
	Sep 9 2016, 1:00 AM

Description

This is an investigation to see what we could do for wish #26, Copyvio tools for Commons.

This task: look at the existing tickets, and at the proposal and discussion (including the endorsements conversation in the green box).

Output for this task: A little proposal of the problem/use case that we're solving, and how we could handle it, for team discussion.

Tickets:
T120453: Copyright violation detection tool for Commons
T31793: Check uploaded images with Google image search to find copyright violations
T123517: Automatically check Commons uploads for possible copyright violations

Proposal/discussion:
https://meta.wikimedia.org/wiki/2015_Community_Wishlist_Survey/Commons#Copyvio_tools_for_Commons

The requirement is to create a system to bring files to the attention of the community where there is a higher likelihood of them being in copyright violation. Discussion is mostly around photographs, and raster images in general. However, other images and other files will be found using some of the same techniques.

There are two main groups of criteria for identifying potentially-copyrighted images: what the image itself is, and what metadata exists around it and the activity of uploading it.

1. Image analysis

Firstly (and likely to be more difficult and/or financially costly) is to analyze the image data itself and compare it with a collection of known files. The two key parts to this are a) the method of analysis; and b) the database to compare with. ~~Google and~~ TinEye seem to be the ~~two~~ main commercial services that could be used. (c.f. T31793) (Google doesn't actually expose their similar-image search via the API.)

Google search API can find "similar images", where images are of the same thing or similar in other characteristics (e.g. mostly blue, high contrast, etc.).
TinEye's MatchEngine API is geared towards finding variants of the same image file, where it's been rotated, colorized, obscured, etc. Requires uploading images to their database.
A third option (T121797) in this vein is to run one of the open-source image-matching engines in-house. This may be a brilliant thing to do for other reasons (such as clever searching within Commons) but it doesn't help with identifying things that shouldn't be uploaded because we wouldn't have a database of copyright images to compare against.

2. Image metadata and user characteristics

The second group of things to look at to determine copyright violations are generally easier to access and more intuitive to compare.

Roughly in order of usefulness, it's worth looking closer at files:

that have been previously deleted (exact checksum match only?)
from users with multiple recent deletions (or high deletion/keep ratio)
from new users, or users with few uploads — newbie-uploads tool
where source is {{own}} or similar
where licence is {{custom license}} or similar
that are small and raster (because small vector files aren't as liable to be low-quality)
with no EXIF metadata, or missing key common fields (can indicate that a photo has been copied of e.g. social media) T121869

Workflow

After some set of files has been isolated, it needs to be easy to do some things with each of them. Crucially:

Nominate for deletion
- notify the uploader on their (globaluserinfo) home wiki
- perhaps add to [[Category:Undelete in year Y]] (c.f. the dark archive ideas)
Mark as vetted (how?)

And some other nice things could be:

ability to whitelist and blacklist certain users/files/categories/sources
ability to add past uploads, perhaps by category (include all descendents?)

Interface

Part of MediaWiki, an extension, or a tool? Probably all three! How useful is it to non-WMF wikis? Probably quite, and so aspects could be considered as a non-WMF-specific extension. Not the policy-specific deletion request stuff though (for example).

If a tool, can it be the same tool as CopyPatrol? Probably too different, but should at least be built much the same, for the benefit of developers' familiarity.

Special:NewFiles is the most basic starting point (c.f. T121870). It shows: thumbnail, filename, uploader's username, datetime, and filesize. A new interface would have these and more, but wouldn't really be a "new files" list, because it'd be leaving out lots (hopefully!) where there's good reason to trust the files. So it could perhaps make sense to be a separate special page?

Desired features, in three groups:

MediaWiki core (as part of Special:NewFiles or elsewhere):
- Previously deleted (by exact match only)
- User's upload/deletion ratio
- exclude certain users or categories
MediaWiki extension(s):
- Is duplicate of a previously deleted image. Possible, if T120759 is done first? This is about duplication, not similarity; e.g. imagehash could be useful.
Commons specific features (a tool à la CopyPatrol):
- EXIF missing / suspicious? (might not be general enough to be required on other wikis)
- Matches an external image-database similarity search (TinEye etc.)
- File size, type
- Source and licence metadata
- Flag for deletion / keep
- Home wiki notification
- Add all files from specific category trees on request

Related Objects
Search...

Status	Assigned	Task
Open	None	T134802 Improve the curator workflow for reviewing new files
Open	None	T120453 Copyright violation detection tool for Commons
Resolved	Samwilson	T145165 Investigation: Copyvio tools for Commons

Event Timeline

• DannyH created this task.Sep 9 2016, 1:00 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 9 2016, 1:00 AM

• DannyH updated the task description. (Show Details)Sep 9 2016, 1:01 AM

• DannyH edited projects, added Community-Tech-Sprint; removed Community-Tech.

• DannyH set the point value for this task to 3.

Samwilson claimed this task.Sep 9 2016, 2:39 AM

Framawiki subscribed.Sep 9 2016, 5:24 PM

• Niharika moved this task from Ready to In Development on the Community-Tech-Sprint board.Sep 12 2016, 12:47 PM

Samwilson updated the task description. (Show Details)Sep 13 2016, 3:47 AM

Samwilson subscribed.

Samwilson updated the task description. (Show Details)Sep 13 2016, 7:28 AM

Samwilson updated the task description. (Show Details)Sep 14 2016, 1:15 AM

Samwilson moved this task from In Development to Needs Review/Feedback on the Community-Tech-Sprint board.

Is duplicate of a previously deleted image. Possible, if T120759 is done first? This is about duplication, not similarity; e.g. imagehash could be useful.

Not sure I understand this one. How is it different from "Previously deleted (by exact match only)"? Are you talking about images that are partial duplications of deleted images, but not exact matches? If so, that seems like an expensive edge case to cover.

Yes, it refers to duplicate matching even when the image has been transformed in some manner. It'd be expensive, but if such a thing were to be done for images in general then it would perhaps not be too hard to extend it to index deleted images as well.

Josve05a added a project: Commons.Sep 16 2016, 10:12 AM

Restricted Application added a subscriber: Poyekhali. · View Herald TranscriptSep 16 2016, 10:12 AM

Josve05a subscribed.Sep 16 2016, 10:12 AM

kaldari closed this task as Resolved.Sep 16 2016, 9:54 PM

kaldari moved this task from Needs Review/Feedback to Q1 2018-19 on the Community-Tech-Sprint board.

kaldari added a parent task: T120453: Copyright violation detection tool for Commons.Sep 22 2016, 9:44 PM

Google search API can find "similar images"...

@Samwilson: I couldn't find any documentation about this at Google, either for the Custom Search API or the Vision API. Can you point me to the documentation for this?

• DannyH edited projects, added Community-Tech; removed Community-Tech-Sprint.Sep 27 2016, 8:48 PM

• DannyH moved this task from Needs Discussion to Archive on the Community-Tech board.Sep 27 2016, 8:53 PM

@kaldari: it looks like I was wrong, sorry. I'd seen a few people doing this, but it seems they were scraping web results rather than using any API (and so were in contravention of the ToS). I didn't dig deep enough.

Bing is the same, and that pretty much seems to mean that there is in fact no available commercial image-similarity service other than Tineye.

Samwilson updated the task description. (Show Details)Sep 28 2016, 4:59 AM

@Samwilson: Thanks for digging deeper. That's too bad that Google doesn't provide an API. I'll try to ask them about it when I meet with their rep on Friday (at least to let them know we would be interested in such a service). In the meantime, I guess that means that TinEye is the only realistic possibility.

In T145165#2674806, @kaldari wrote:

@Samwilson: Thanks for digging deeper. That's too bad that Google doesn't provide an API. I'll try to ask them about it when I meet with their rep on Friday (at least to let them know we would be interested in such a service). In the meantime, I guess that means that TinEye is the only realistic possibility.

FYI they used to have an image search API which was deprecated in 2011.

FYI, there has been continued interest in this feature at this thread. Cc'ing related tickets T31793, T123517, T121797, T167947, T230561, T251026, and T120453, and @Quiddity who helped point to them.

Investigation: Copyvio tools for CommonsClosed, ResolvedPublic3 Estimated Story PointsActions