Change Details

This is an investigation to see what we could do for wish #26, Copyvio tools for Commons. This task: look at the existing tickets, and at the proposal and discussion (including the endorsements conversation in the green box). Output for this task: A little proposal of the problem/use case that we're solving, and how we could handle it, for team discussion. Tickets: {T120453} {T31793} {T123517} Proposal/discussion: https://meta.wikimedia.org/wiki/2015_Community_Wishlist_Survey/Commons#Copyvio_tools_for_Commons ---- The requirement is to create a system to bring files to the attention of the community where there is a higher likelihood of them being in copyright violation. Discussion is mostly around photographs, and raster images in general. However, other images and other files will be found using some of the same techniques. There are two main groups of criteria for identifying potentially-copyrighted images: what the image itself is, and what metadata exists around it and the activity of uploading it. == 1. Image analysis == Firstly (and likely to be more difficult and/or financially costly) is to analyze the image data itself and compare it with a collection of known files. The two key parts to this are a) the method of analysis; and b) the database to compare with. Google and TinEye seem to be the two main commercial services that could be used. (c.f. T31793) # Google search API can find "similar images", where images are of the same thing or similar in other characteristics (e.g. mostly blue, high contrast, etc.). # [[http://www.tineye.com/|TinEye's]] MatchEngine API is geared towards finding variants of the //same// image file, where it's been rotated, colorized, obscured, etc. Requires uploading images to their database. # A third option (T121797) in this vein is to run one of the open-source image-matching engines in-house. This may be a brilliant thing to do for other reasons (such as clever searching //within// Commons) but it doesn't help with identifying things that shouldn't be uploaded because we wouldn't have a database of copyright images to compare against. == 2. Image metadata and user characteristics == The second group of things to look at to determine copyright violations are generally easier to access and more intuitive to compare. Roughly in order of usefulness, it's worth looking closer at files: * that have been previously deleted (exact checksum match only?) * from users with multiple recent deletions (or high deletion/keep ratio) * from new users, or users with few uploads — [[https://tools.wmflabs.org/newbie-uploads/|newbie-uploads tool]] * where source is {{own}} or similar * where licence is {{custom license}} or similar * that are small and raster (because small vector files aren't as liable to be low-quality) * with no EXIF metadata, or missing key common fields (can indicate that a photo has been copied of e.g. social media) T121869 == Workflow == After some set of files has been isolated, it needs to be easy to do some things with each of them. Crucially: * Nominate for deletion ** notify the uploader on their (globaluserinfo) home wiki ** perhaps add to [[Category:Undelete in year Y]] (c.f. the dark archive ideas) * Mark as vetted (how?) And some other nice things could be: * ability to whitelist and blacklist certain users/files/categories/sources * ability to add past uploads, perhaps by category (include all descendents?) == Interface == Part of MediaWiki, an extension, or a tool? How useful is it to non-WMF wikis? Probably quite, and so could be considered as a non-WMF-specific extension. Not the policy-specific deletion request stuff though. If a tool, can it be the same tool as CopyPatrol? Probably too different, but should at least be built much the same, for the benefit of developers' familiarity. Special:NewFiles is the most basic starting point (c.f. T121870). It shows: thumbnail, filename, uploader's username, datetime, and filesize. A new interface would have these and more, but wouldn't really be a "new files" list, because it'd be leaving out lots (hopefully!) where there's good reason to trust the files. So it could perhaps make sense to be a separate special page? Desired features, in two groups: * **MediaWiki general features** (as part of Special:NewFiles or elsewhere): ** Previously deleted (by exact match only) ** Is similar to a previously deleted image. Possible, if T120759 is done first? ** User's upload/deletion ratio ** EXIF missing / suspicious? (might not be general enough to be required on other wikis) ** exclude certain users or categories * **Commons specific features** (a tool a la CopyPatrol): ** Matches an external image-database similarity search (TinEye etc.) ** File size, type, EXIF missing ** Source and licence metadata ** Flag for deletion / keep ** Home wiki notification ** Add all files from specific category trees on request

This is an investigation to see what we could do for wish #26, Copyvio tools for Commons. This task: look at the existing tickets, and at the proposal and discussion (including the endorsements conversation in the green box). Output for this task: A little proposal of the problem/use case that we're solving, and how we could handle it, for team discussion. Tickets: {T120453} {T31793} {T123517} Proposal/discussion: https://meta.wikimedia.org/wiki/2015_Community_Wishlist_Survey/Commons#Copyvio_tools_for_Commons ---- The requirement is to create a system to bring files to the attention of the community where there is a higher likelihood of them being in copyright violation. Discussion is mostly around photographs, and raster images in general. However, other images and other files will be found using some of the same techniques. There are two main groups of criteria for identifying potentially-copyrighted images: what the image itself is, and what metadata exists around it and the activity of uploading it. == 1. Image analysis == Firstly (and likely to be more difficult and/or financially costly) is to analyze the image data itself and compare it with a collection of known files. The two key parts to this are a) the method of analysis; and b) the database to compare with. Google and TinEye seem to be the two main commercial services that could be used. (c.f. T31793) # Google search API can find "similar images", where images are of the same thing or similar in other characteristics (e.g. mostly blue, high contrast, etc.). # [[http://www.tineye.com/|TinEye's]] MatchEngine API is geared towards finding variants of the //same// image file, where it's been rotated, colorized, obscured, etc. Requires uploading images to their database. # A third option (T121797) in this vein is to run one of the open-source image-matching engines in-house. This may be a brilliant thing to do for other reasons (such as clever searching //within// Commons) but it doesn't help with identifying things that shouldn't be uploaded because we wouldn't have a database of copyright images to compare against. == 2. Image metadata and user characteristics == The second group of things to look at to determine copyright violations are generally easier to access and more intuitive to compare. Roughly in order of usefulness, it's worth looking closer at files: * that have been previously deleted (exact checksum match only?) * from users with multiple recent deletions (or high deletion/keep ratio) * from new users, or users with few uploads — [[https://tools.wmflabs.org/newbie-uploads/|newbie-uploads tool]] * where source is {{own}} or similar * where licence is {{custom license}} or similar * that are small and raster (because small vector files aren't as liable to be low-quality) * with no EXIF metadata, or missing key common fields (can indicate that a photo has been copied of e.g. social media) T121869 == Workflow == After some set of files has been isolated, it needs to be easy to do some things with each of them. Crucially: * Nominate for deletion ** notify the uploader on their (globaluserinfo) home wiki ** perhaps add to [[Category:Undelete in year Y]] (c.f. the dark archive ideas) * Mark as vetted (how?) And some other nice things could be: * ability to whitelist and blacklist certain users/files/categories/sources * ability to add past uploads, perhaps by category (include all descendents?) == Interface == Part of MediaWiki, an extension, or a tool? Probably all three! How useful is it to non-WMF wikis? Probably quite, and so aspects could be considered as a non-WMF-specific extension. Not the policy-specific deletion request stuff though (for example). If a tool, can it be the same tool as CopyPatrol? Probably too different, but should at least be built much the same, for the benefit of developers' familiarity. Special:NewFiles is the most basic starting point (c.f. T121870). It shows: thumbnail, filename, uploader's username, datetime, and filesize. A new interface would have these and more, but wouldn't really be a "new files" list, because it'd be leaving out lots (hopefully!) where there's good reason to trust the files. So it could perhaps make sense to be a separate special page? Desired features, in three groups: * **MediaWiki core** (as part of Special:NewFiles or elsewhere): ** Previously deleted (by exact match only) ** User's upload/deletion ratio ** exclude certain users or categories * **MediaWiki extension(s):** ** Is duplicate of a previously deleted image. Possible, if T120759 is done first? This is about duplication, //not// similarity; e.g. [[https://github.com/jenssegers/imagehash|imagehash]] could be useful. * **Commons specific features** (a tool à la CopyPatrol): ** EXIF missing / suspicious? (might not be general enough to be required on other wikis) ** Matches an external image-database similarity search (TinEye etc.) ** File size, type ** Source and licence metadata ** Flag for deletion / keep ** Home wiki notification ** Add all files from specific category trees on request

This is an investigation to see what we could do for wish #26, Copyvio tools for Commons. This task: look at the existing tickets, and at the proposal and discussion (including the endorsements conversation in the green box). Output for this task: A little proposal of the problem/use case that we're solving, and how we could handle it, for team discussion. Tickets: {T120453} {T31793} {T123517} Proposal/discussion: https://meta.wikimedia.org/wiki/2015_Community_Wishlist_Survey/Commons#Copyvio_tools_for_Commons ---- The requirement is to create a system to bring files to the attention of the community where there is a higher likelihood of them being in copyright violation. Discussion is mostly around photographs, and raster images in general. However, other images and other files will be found using some of the same techniques. There are two main groups of criteria for identifying potentially-copyrighted images: what the image itself is, and what metadata exists around it and the activity of uploading it. == 1. Image analysis == Firstly (and likely to be more difficult and/or financially costly) is to analyze the image data itself and compare it with a collection of known files. The two key parts to this are a) the method of analysis; and b) the database to compare with. Google and TinEye seem to be the two main commercial services that could be used. (c.f. T31793) # Google search API can find "similar images", where images are of the same thing or similar in other characteristics (e.g. mostly blue, high contrast, etc.). # [[http://www.tineye.com/|TinEye's]] MatchEngine API is geared towards finding variants of the //same// image file, where it's been rotated, colorized, obscured, etc. Requires uploading images to their database. # A third option (T121797) in this vein is to run one of the open-source image-matching engines in-house. This may be a brilliant thing to do for other reasons (such as clever searching //within// Commons) but it doesn't help with identifying things that shouldn't be uploaded because we wouldn't have a database of copyright images to compare against. == 2. Image metadata and user characteristics == The second group of things to look at to determine copyright violations are generally easier to access and more intuitive to compare. Roughly in order of usefulness, it's worth looking closer at files: * that have been previously deleted (exact checksum match only?) * from users with multiple recent deletions (or high deletion/keep ratio) * from new users, or users with few uploads — [[https://tools.wmflabs.org/newbie-uploads/|newbie-uploads tool]] * where source is {{own}} or similar * where licence is {{custom license}} or similar * that are small and raster (because small vector files aren't as liable to be low-quality) * with no EXIF metadata, or missing key common fields (can indicate that a photo has been copied of e.g. social media) T121869 == Workflow == After some set of files has been isolated, it needs to be easy to do some things with each of them. Crucially: * Nominate for deletion ** notify the uploader on their (globaluserinfo) home wiki ** perhaps add to [[Category:Undelete in year Y]] (c.f. the dark archive ideas) * Mark as vetted (how?) And some other nice things could be: * ability to whitelist and blacklist certain users/files/categories/sources * ability to add past uploads, perhaps by category (include all descendents?) == Interface == Part of MediaWiki, an extension, or a tool? Probably all three! How useful is it to non-WMF wikis? Probably quite, and so aspects could be considered as a non-WMF-specific extension. Not the policy-specific deletion request stuff though (for example). If a tool, can it be the same tool as CopyPatrol? Probably too different, but should at least be built much the same, for the benefit of developers' familiarity. Special:NewFiles is the most basic starting point (c.f. T121870). It shows: thumbnail, filename, uploader's username, datetime, and filesize. A new interface would have these and more, but wouldn't really be a "new files" list, because it'd be leaving out lots (hopefully!) where there's good reason to trust the files. So it could perhaps make sense to be a separate special page? Desired features, in twohree groups: * **MediaWiki general features**core** (as part of Special:NewFiles or elsewhere): ** Previously deleted (by exact match only) ** Is similar to a previously deleted image. Possible, if T120759 is done first? ** User's upload/deletion ratio ** EXIF missing / suspicious? (might not be general enough to be required on other wikis) ** exclude certain users or categories * **MediaWiki extension(s):** ** Is duplicate of a previously deleted image. Possible, if T120759 is done first? This is about duplication, //not// similarity; e.g. [[https://github.com/jenssegers/imagehash|imagehash]] could be useful. * **Commons specific features** (a tool aà la CopyPatrol): ** EXIF missing / suspicious? (might not be general enough to be required on other wikis) ** Matches an external image-database similarity search (TinEye etc.) ** File size, type, EXIF missing ** Source and licence metadata ** Flag for deletion / keep ** Home wiki notification ** Add all files from specific category trees on request