Page MenuHomePhabricator

Check uploaded images with Google image search to find copyright violations
Open, HighPublic

Description

Idea: Check images uploaded with the wizard with the Google image search to identify potentially copyright violations. Example: http://goo.gl/XbNPB (hopefully the link inside this shortened link is stable).

If Google finds an identical image add the newly uploaded image to a hidden cat to be processed by Commons admins.


Version: unspecified
Severity: enhancement

Details

Reference
bz29793

Event Timeline

bzimport raised the priority of this task from to Lowest.Nov 21 2014, 11:34 PM
bzimport added a project: UploadWizard.
bzimport set Reference to bz29793.
bzimport added a subscriber: Unknown Object (MLST).
Raymond created this task.Jul 9 2011, 9:10 PM

neilk wrote:

That's a great idea, but should it be in the UploadWizard or should a bot be doing that to everything uploaded? Things can be uploaded via API too.

In principle for every upload. But with an integration into the UploadWizard the user can be warned prior he finishes the upload.

test5555 wrote:

What I like about this search is that it finds similar images, but doesn't it always find some? -- Sometimes similar in ways one hadn't thought of, but unlikely copies of the one I started out with.

neilk wrote:

Yeah, I'm not sure what it would mean if we found similar images. Does that mean it's bad?

In any case, I don't see any supported API for Google's image similarity search, and certainly not one that returns some sort of similarity rating.

Also, who is the target audience here? Someone who is determined to copyvio will do it anyway. Perhaps there are some users who might abandon their upload once they realized we didn't want copyvio images (having missed every other warning)....

It's a neat idea but I'm not seeing an easy way to make it work. Deferring for now

carlb613 wrote:

Is this search finding images with the same content or merely images with the same title?

If (as I suspect) it's the latter, it may be better to look into something like TinEye.com - again not ideal, as it merely detects the same image to be on a hundred other sites without indicating the original license for any of them. It'd keep a few very tired memes and visual Internet clichés off the site, but that's about it.

Then again, under the current system I could grab a camera, take a photo of some non-notable elementary school that someone requested on some WikiProject, upload it with no tags and a textual description of "I took this photo twenty minutes ago; do what you want with it, I don't care." and rest assured that some obnoxious robot would delete the image as a copyvio before the week is done.

That's what happens when this sort of thing is entrusted to entirely-automated processes.

I have to question this.

Now, it could be interesting to fill a source URL by this, but I'm not sure it's worth a call to an API that may or may not exist....anyway, I digress from the original point.

If we tried to use this API (which may not exist) to detect whether the image exists, a large portion of the traffic would be turned away, as I understand it. Many of the images uploaded through UW are uploaded from another source, which is perfectly legal if the license is right. Since I don't see any way we could detect the license of the image from a Google Images search, with or without an API, we are in the soup.

I could maybe see this working with an image host's API, because I think they might store licensing information in a simple format, and that would allow us to pre-fill a lot of information (original source, author name, EXIF data possibly, licensing information), but it's a pretty small chance that the image exists on the image host. Maybe. Of course, this is all contingent on the image hosts' ability to search by image contents, which could be tough.

Maybe this is the sort of thing that we could consider implementing as a super-extra feature for communities that extensively use a specific image host and disable it for Commons, etc., where the images come from all over.

Just a thought!

Thehelpfulonewiki wrote:

Reassigning to wikibugs-l per bug 37789

Jdforrester-WMF moved this task from Untriaged to Backlog on the Multimedia board.Sep 4 2015, 6:14 PM
Restricted Application added subscribers: Steinsplitter, Matanya, Aklapper. · View Herald TranscriptSep 4 2015, 6:14 PM

OK, so, to disagree with my past self...

This seems like a thing that WMF could negotiate with Google (or TinEye) and run on new uploads to Commons that are marked as "own work". Don't get too excited, those conversations still need to happen.

I've written most of a bot to do the browsing and the flagging, and I'll be polishing it when we get more information about an agreement and an API from one of the two above.

TinEye pros:

  • Easy API
  • Catch more innocuous, but more high-value, copyvios

Cons:

  • Miss a large portion of potential copyvios
  • Potential financial cost (nearly negligible, esp. given the benefit we get in return)

Google pros:

  • Catch almost every possible copyvio if it was copied from the Internet
  • Free(?)
  • Improve relationship between Google and this iteration of the Multimedia team

Cons:

  • Miss some non-Internet possible copyvios, maybe?
  • Potential false positives with the "similar image" functionality - still need human checking
  • API seems non-existent
MarkTraceur raised the priority of this task from Lowest to High.Jan 11 2016, 1:56 PM
Gunnex added a subscriber: Gunnex.Feb 24 2016, 6:52 PM
In T31793#1924052, @MarkTraceur wrote on Jan 11 2016, 1:56 PM:

OK, so, to disagree with my past self...
This seems like a thing that WMF could negotiate with Google (or TinEye) and run on new uploads to Commons that are marked as "own work". Don't get too excited, those conversations still need to happen.
I've written most of a bot to do the browsing and the flagging, and I'll be polishing it when we get more information about an agreement and an API from one of the two above.
TinEye pros:

  • Easy API
  • Catch more innocuous, but more high-value, copyvios

Cons:

  • Miss a large portion of potential copyvios
  • Potential financial cost (nearly negligible, esp. given the benefit we get in return)

Google pros:

  • Catch almost every possible copyvio if it was copied from the Internet
  • Free(?)
  • Improve relationship between Google and this iteration of the Multimedia team

Cons:

  • Miss some non-Internet possible copyvios, maybe?
  • Potential false positives with the "similar image" functionality - still need human checking
  • API seems non-existent

Just some rapid comments:
--> Forget TinEye: it might be okay for album covers and related popular stuff but they fail for more "in deep" check of copyvios.
--> Google API: that's the deal I already thought about it --> but: I personally got already some strikes for using Google Images because of the high traffic I caused = captcha ("are you a machine or human?"). In other words: There most likely have to be a special agreement between Google <--> WMF, liberating a special IP [range] to do the scans. Btw, sorry to say: without Google Images --> Commons = imgur.com//imageshack.us/etc.

Google pros:

  • Catch almost every possible copyvio if it was copied from the Internet

Forget it: Flickr (and other sites) + most of the social networks like (especially) Facebook are almost not visible through Google Images. I would say that +/- 70 % of uploads with a 960 or 720 resolution (the old limitation of FB till 2010) were grabbed from Internet [Facebook]. Currently FB allows up to 2048 px. Same for Instagram: 612 px, 640 px, and since 2015 1080 px. For Instagram, the situation is better (also because Instagram is mirrored by several tools). Twitter is also similar (better indexed) etc... but for all: social media copyvios are quite hard to find.

Curiously, Google Images depends also of the location. Often, I got better result using .com.br (Brazil) Google Images (unlogged) instead of my standard .de Google Images (logged). Btw, currently, Google Images is failing a bit: Google shows similar [but different] photos from a site, but where indeed [opening the site] the searched photo also appears. Confusing...

Anyway: An API/Bot which double-checks fresh uploads with Google Image would immensely improve Commons goal to be a online repository of free-use images, sound, and other media files — which I do not see, especially after enabling uploads through the local VisualEditor (cross-wiki uploads, see also T120867 and e.g. interim result from pt.wikipedia.org = User:Gunnex/Cross-wiki uploads from pt.wikipedia.org and e.g a fresh detected "cross-wiki upload case" [here form from th.wiki] = flood of IT-related copyvios.

Commons is constantly understaffed [admins + users] to treat +/- daily 10.000 uploads and is already turning — IMHO — to a license washing machine...
It has to be stopped: now — and if I would have the power to unplug the cross-wiki-uploads... well... the approach via quantity [more uploads] is definitely false.

In other words: Commons needs urgently some not-human based machine-tools to filter all the random/spontaneous uploads (feel free to check some newbie-uploads) because the available human power in Commons is NOT able to handle these uploads. The available human power in Commons is additionaly also NOT able to handle to check every upload with copyright laws from +/- 194 different countries.

But an API/Bot in cooperation with Google, double-checking all these spontaneous [cross-wiki] uploads regarding actors & actresses, film +album +game covers, footballers, celebreties, models, politicians, socialites, images grabbed from Panoramio/official sites/blogs etc. [uploads type "where I am living], and-so-an... would release extra-time for the low Commons [power] user base to check all the more complicated cases concearning FOP/PD-Old/PD[own-work}-whatever cases.

2cents by me [as I said above: "rapid]

Why is this in UploadWizard? The upload checker should be generic for any upload method and it's not necessary for it to be part of MediaWiki even. A Python bot or web tool would be equally good, for instance.

Restricted Application added a subscriber: pywikibot-bugs-list. · View Herald TranscriptFeb 25 2016, 8:05 PM
Nemo_bis added a subscriber: Fae.Feb 25 2016, 8:06 PM

I'm tentatively moving this in Pywikibot territory as MarkTraceur above mentioned having a python script and @Fae also has one IIRC. However this is mostly something to be done by Wikimedia people and maybe a MediaWiki extension turns out to be easier, so I'm leaving some tag-ambiguity for now.

Do you think that Python bots scale better than MediaWiki extensions? Think again! interwiki.py vs Wikidata...

Nemo_bis added a comment.EditedMar 2 2016, 9:37 AM

Do you think that Python bots scale better than MediaWiki extensions? Think again! interwiki.py vs Wikidata

Pywikibot vs. Wikidata is certainly a false dichotomy. Bots still handle interwikis, just in a different place. In general, it depends on what "scale" you have in mind. Certainly a pywikibot solution scales better than an UploadWizard solution, as UploadWizard is only a fraction of all the uploads on MediaWiki wikis.

Why is this in UploadWizard? The upload checker should be generic for any upload method and it's not necessary for it to be part of MediaWiki even. A Python bot or web tool would be equally good, for instance.

Well, as stated above: If the uploading user could be warned while attempting to upload a copyvio, this could prevent some of those uploads to happen in the first place …