Page MenuHomePhabricator

Implement perceptual/visual image hashing/fingerprinting in MediaWiki for detection of non-exact duplicate uploads
Open, LowPublic

Description

It would be wonderful to implement some perceptual/visual image hashing/fingerprinting mechanism in MediaWiki for detection of non-exact duplicate uploads.

This came up a few times recently in relation to copyvio uploads (copyright violation), in particular on T120867 and on-wiki discussion of T120453. In this context, it would be critical to be able to find similar images cross-wiki.

I'm pretty sure this is still an area of active academic research… but then, that means there are probably papers written by smart people available for us to use! Quite a lot of content with big words like transforms and wavelets comes up for this topic.

It's clearly possible, even on a large set of images, as evidenced by Google Images and TinEye doing this. (Their algorithms actually look pretty different; Google Images often produces results that are similar in the sense of containing the same kind of object, while TinEyes seems to give images that were actually derived from the same source image, cropped or resized or otherwise). There's a Windows freeware tool called Visipics which also does this (I found it pretty good; closed source, unfortunately) and a Linux one called GQview (primarily an image viewer, but can detect duplicates).

The last one, GQview, is actually open source (http://gqview.sourceforge.net/view-down.html) and uses a pretty simple algorithm (src/similar.c): it essentially just resizes all images to 32x32px and uses that as a fingerprint to compare them. Images are considered similar if the fingerprints differ by no more than 5%. Not sure if we would be able to do that kind of comparison in a SQL query (but the power of SQL keeps surprising me).

Event Timeline

matmarex created this task.Dec 17 2015, 8:33 PM
matmarex raised the priority of this task from to Needs Triage.
matmarex updated the task description. (Show Details)
matmarex added subscribers: matmarex, Bawolff.
Restricted Application added a project: Multimedia. · View Herald TranscriptDec 17 2015, 8:33 PM
Restricted Application added subscribers: StudiesWorld, Steinsplitter, Aklapper. · View Herald Transcript
Restricted Application added a subscriber: Matanya. · View Herald TranscriptDec 17 2015, 9:11 PM
Tgr added a subscriber: Tgr.Dec 17 2015, 11:08 PM

There are fairly powerful opensource AI frameworks for object recognition and such, but doing something like that seems like a very ambitious project, especially given that there isn't a lot of machine learning expertise in the Foundation. At the same time, while such functionality has very interesting uses (see e.g. T49492: Automatically propose/suggest a category for images), it does not seem particularly useful for copyvio detection.

As for fingerprinting, Commons Machinery has an algorithm + database they use in Elog.io to recognize unattributed reuse of Commons files. They participate in MediaWiki discussions sometimes; you should definitely reach out to them.

Note the somewhat related Community Wishlist tasks T120435: Improve the plagiarism detection bot and T120759: Image searches based on image recognition.

Sort-of-duplicate from olden times: T31793: Check uploaded images with Google image search to find copyright violations

I think there was a CirrusSearch task about this with some useful discussion and feedback from Nik about what kind of functionality is easily available in ElasticSearch, but I was unable to find it.

DarTar moved this task from Staged to Radar on the Research board.Dec 17 2015, 11:36 PM
zhuyifei1999 moved this task from Incoming to Backlog on the Commons board.
zhuyifei1999 added a subscriber: zhuyifei1999.
Jay8g added a subscriber: Jay8g.Dec 18 2015, 11:03 PM
MarkTraceur triaged this task as Low priority.Dec 21 2015, 8:56 PM
MarkTraceur added a subscriber: MarkTraceur.

There are fairly powerful opensource AI frameworks for object recognition and such, but doing something like that seems like a very ambitious project, especially given that there isn't a lot of machine learning expertise in the Foundation. At the same time, while such functionality has very interesting uses (see e.g. T49492: Automatically propose/suggest a category for images), it does not seem particularly useful for copyvio detection.

Yeah, that's a different thing. Machine learning could be good at finding images containing the same subject (like Google Images is), while some solid algorithm would be better at finding modified copies of the same image (like TinEye).

As for fingerprinting, Commons Machinery has an algorithm + database they use in Elog.io to recognize unattributed reuse of Commons files. They participate in MediaWiki discussions sometimes; you should definitely reach out to them.

Ooooh, this is neat. I didn't contact them, but I read their FAQs and they're using http://blockhash.io/ (https://github.com/commonsmachinery/blockhash) for hashing and the HmSearch algorithm (paper (PDF); https://github.com/commonsmachinery/hmsearch) to find similar hashes efficiently.

Yann added a subscriber: Yann.Dec 24 2015, 2:21 PM
leila set Security to None.

Not sure if we would be able to do that kind of comparison in a SQL query (but the power of SQL keeps surprising me).

I looked into this a little while back. Usually for these types of hashes, you want to find everything within a certain hamming distance of the current hash, which is very difficult to index for in sql.

Solutions ive heard of (but not sure how practical):

  • putting multiple permutations of the hash in the db, with the hope that regardless of which bits differ, all the hashes of interest will be sequentially near at least one of the permutations
  • trying to use error correcting codes but in reverse, where the hadh is treated as an error correcting code , where all hashes in a certain ball of hamming radius x are converted to a single representation, and then during query, you do a couple queries, for all hamming radius balls that are within your interest ball. (That may have been explained poorly)

Some people also make custom things using metric trees to index hashes. I remember manybubbles saying something about elasticsrarch being usable for this sort of thing.

Also Check out http://www.phash.org/

MarkTraceur moved this task from Untriaged to Triaged on the Multimedia board.Dec 6 2016, 3:50 PM
Restricted Application added a subscriber: Poyekhali. · View Herald TranscriptDec 6 2016, 3:50 PM
Kjetil removed a subscriber: Kjetil.Aug 7 2017, 6:45 PM
Perhelion added a subscriber: Perhelion.