Page MenuHomePhabricator

Investigation: Copyvio tools for Commons
Closed, ResolvedPublic3 Estimated Story Points

Description

This is an investigation to see what we could do for wish #26, Copyvio tools for Commons.

This task: look at the existing tickets, and at the proposal and discussion (including the endorsements conversation in the green box).

Output for this task: A little proposal of the problem/use case that we're solving, and how we could handle it, for team discussion.

Tickets:
T120453: Copyright violation detection tool for Commons
T31793: Check uploaded images with Google image search to find copyright violations
T123517: Automatically check Commons uploads for possible copyright violations

Proposal/discussion:
https://meta.wikimedia.org/wiki/2015_Community_Wishlist_Survey/Commons#Copyvio_tools_for_Commons


The requirement is to create a system to bring files to the attention of the community where there is a higher likelihood of them being in copyright violation. Discussion is mostly around photographs, and raster images in general. However, other images and other files will be found using some of the same techniques.

There are two main groups of criteria for identifying potentially-copyrighted images: what the image itself is, and what metadata exists around it and the activity of uploading it.

1. Image analysis

Firstly (and likely to be more difficult and/or financially costly) is to analyze the image data itself and compare it with a collection of known files. The two key parts to this are a) the method of analysis; and b) the database to compare with. Google and TinEye seem to be the two main commercial services that could be used. (c.f. T31793) (Google doesn't actually expose their similar-image search via the API.)

  1. Google search API can find "similar images", where images are of the same thing or similar in other characteristics (e.g. mostly blue, high contrast, etc.).
  2. TinEye's MatchEngine API is geared towards finding variants of the same image file, where it's been rotated, colorized, obscured, etc. Requires uploading images to their database.
  3. A third option (T121797) in this vein is to run one of the open-source image-matching engines in-house. This may be a brilliant thing to do for other reasons (such as clever searching within Commons) but it doesn't help with identifying things that shouldn't be uploaded because we wouldn't have a database of copyright images to compare against.

2. Image metadata and user characteristics

The second group of things to look at to determine copyright violations are generally easier to access and more intuitive to compare.

Roughly in order of usefulness, it's worth looking closer at files:

  • that have been previously deleted (exact checksum match only?)
  • from users with multiple recent deletions (or high deletion/keep ratio)
  • from new users, or users with few uploads — newbie-uploads tool
  • where source is {{own}} or similar
  • where licence is {{custom license}} or similar
  • that are small and raster (because small vector files aren't as liable to be low-quality)
  • with no EXIF metadata, or missing key common fields (can indicate that a photo has been copied of e.g. social media) T121869

Workflow

After some set of files has been isolated, it needs to be easy to do some things with each of them. Crucially:

  • Nominate for deletion
    • notify the uploader on their (globaluserinfo) home wiki
    • perhaps add to [[Category:Undelete in year Y]] (c.f. the dark archive ideas)
  • Mark as vetted (how?)

And some other nice things could be:

  • ability to whitelist and blacklist certain users/files/categories/sources
  • ability to add past uploads, perhaps by category (include all descendents?)

Interface

Part of MediaWiki, an extension, or a tool? Probably all three! How useful is it to non-WMF wikis? Probably quite, and so aspects could be considered as a non-WMF-specific extension. Not the policy-specific deletion request stuff though (for example).

If a tool, can it be the same tool as CopyPatrol? Probably too different, but should at least be built much the same, for the benefit of developers' familiarity.

Special:NewFiles is the most basic starting point (c.f. T121870). It shows: thumbnail, filename, uploader's username, datetime, and filesize. A new interface would have these and more, but wouldn't really be a "new files" list, because it'd be leaving out lots (hopefully!) where there's good reason to trust the files. So it could perhaps make sense to be a separate special page?

Desired features, in three groups:

  • MediaWiki core (as part of Special:NewFiles or elsewhere):
    • Previously deleted (by exact match only)
    • User's upload/deletion ratio
    • exclude certain users or categories
  • MediaWiki extension(s):
    • Is duplicate of a previously deleted image. Possible, if T120759 is done first? This is about duplication, not similarity; e.g. imagehash could be useful.
  • Commons specific features (a tool à la CopyPatrol):
    • EXIF missing / suspicious? (might not be general enough to be required on other wikis)
    • Matches an external image-database similarity search (TinEye etc.)
    • File size, type
    • Source and licence metadata
    • Flag for deletion / keep
    • Home wiki notification
    • Add all files from specific category trees on request

Event Timeline

DannyH edited projects, added Community-Tech-Sprint; removed Community-Tech.
DannyH set the point value for this task to 3.

Is duplicate of a previously deleted image. Possible, if T120759 is done first? This is about duplication, not similarity; e.g. imagehash could be useful.

Not sure I understand this one. How is it different from "Previously deleted (by exact match only)"? Are you talking about images that are partial duplications of deleted images, but not exact matches? If so, that seems like an expensive edge case to cover.

Yes, it refers to duplicate matching even when the image has been transformed in some manner. It'd be expensive, but if such a thing were to be done for images in general then it would perhaps not be too hard to extend it to index deleted images as well.

Google search API can find "similar images"...

@Samwilson: I couldn't find any documentation about this at Google, either for the Custom Search API or the Vision API. Can you point me to the documentation for this?

@kaldari: it looks like I was wrong, sorry. I'd seen a few people doing this, but it seems they were scraping web results rather than using any API (and so were in contravention of the ToS). I didn't dig deep enough.

Bing is the same, and that pretty much seems to mean that there is in fact no available commercial image-similarity service other than Tineye.

@Samwilson: Thanks for digging deeper. That's too bad that Google doesn't provide an API. I'll try to ask them about it when I meet with their rep on Friday (at least to let them know we would be interested in such a service). In the meantime, I guess that means that TinEye is the only realistic possibility.

@Samwilson: Thanks for digging deeper. That's too bad that Google doesn't provide an API. I'll try to ask them about it when I meet with their rep on Friday (at least to let them know we would be interested in such a service). In the meantime, I guess that means that TinEye is the only realistic possibility.

FYI they used to have an image search API which was deprecated in 2011.

FYI, there has been continued interest in this feature at this thread. Cc'ing related tickets T31793, T123517, T121797, T167947, T230561, T251026, and T120453, and @Quiddity who helped point to them.