Page MenuHomePhabricator

Implement perceptual/visual image hashing/fingerprinting in MediaWiki for detection of non-exact duplicate uploads
Open, LowPublic

Description

It would be wonderful to implement some perceptual/visual image hashing/fingerprinting mechanism in MediaWiki for detection of non-exact duplicate uploads.

This came up a few times recently in relation to copyvio uploads (copyright violation), in particular on T120867 and on-wiki discussion of T120453. In this context, it would be critical to be able to find similar images cross-wiki.

I'm pretty sure this is still an area of active academic research… but then, that means there are probably papers written by smart people available for us to use! Quite a lot of content with big words like transforms and wavelets comes up for this topic.

It's clearly possible, even on a large set of images, as evidenced by Google Images and TinEye doing this. (Their algorithms actually look pretty different; Google Images often produces results that are similar in the sense of containing the same kind of object, while TinEyes seems to give images that were actually derived from the same source image, cropped or resized or otherwise). There's a Windows freeware tool called Visipics which also does this (I found it pretty good; closed source, unfortunately) and a Linux one called GQview (primarily an image viewer, but can detect duplicates).

The last one, GQview, is actually open source (http://gqview.sourceforge.net/view-down.html) and uses a pretty simple algorithm (src/similar.c): it essentially just resizes all images to 32x32px and uses that as a fingerprint to compare them. Images are considered similar if the fingerprints differ by no more than 5%. Not sure if we would be able to do that kind of comparison in a SQL query (but the power of SQL keeps surprising me).

Event Timeline

matmarex raised the priority of this task from to Needs Triage.
matmarex updated the task description. (Show Details)
matmarex added subscribers: matmarex, Bawolff.
Restricted Application added subscribers: StudiesWorld, Steinsplitter, Aklapper. · View Herald Transcript

There are fairly powerful opensource AI frameworks for object recognition and such, but doing something like that seems like a very ambitious project, especially given that there isn't a lot of machine learning expertise in the Foundation. At the same time, while such functionality has very interesting uses (see e.g. T49492: Automatically propose/suggest a category for images), it does not seem particularly useful for copyvio detection.

As for fingerprinting, Commons Machinery has an algorithm + database they use in Elog.io to recognize unattributed reuse of Commons files. They participate in MediaWiki discussions sometimes; you should definitely reach out to them.

Note the somewhat related Community Wishlist tasks T120435: Improve the plagiarism detection bot and T120759: Image searches based on image recognition.

Sort-of-duplicate from olden times: T31793: Check uploaded images with Google image search to find copyright violations

I think there was a CirrusSearch task about this with some useful discussion and feedback from Nik about what kind of functionality is easily available in ElasticSearch, but I was unable to find it.

zhuyifei1999 moved this task from Incoming to Backlog on the Commons board.
zhuyifei1999 added a subscriber: zhuyifei1999.
MarkTraceur added a subscriber: MarkTraceur.

There are fairly powerful opensource AI frameworks for object recognition and such, but doing something like that seems like a very ambitious project, especially given that there isn't a lot of machine learning expertise in the Foundation. At the same time, while such functionality has very interesting uses (see e.g. T49492: Automatically propose/suggest a category for images), it does not seem particularly useful for copyvio detection.

Yeah, that's a different thing. Machine learning could be good at finding images containing the same subject (like Google Images is), while some solid algorithm would be better at finding modified copies of the same image (like TinEye).

As for fingerprinting, Commons Machinery has an algorithm + database they use in Elog.io to recognize unattributed reuse of Commons files. They participate in MediaWiki discussions sometimes; you should definitely reach out to them.

Ooooh, this is neat. I didn't contact them, but I read their FAQs and they're using http://blockhash.io/ (https://github.com/commonsmachinery/blockhash) for hashing and the HmSearch algorithm (paper (PDF); https://github.com/commonsmachinery/hmsearch) to find similar hashes efficiently.

Not sure if we would be able to do that kind of comparison in a SQL query (but the power of SQL keeps surprising me).

I looked into this a little while back. Usually for these types of hashes, you want to find everything within a certain hamming distance of the current hash, which is very difficult to index for in sql.

Solutions ive heard of (but not sure how practical):

  • putting multiple permutations of the hash in the db, with the hope that regardless of which bits differ, all the hashes of interest will be sequentially near at least one of the permutations
  • trying to use error correcting codes but in reverse, where the hadh is treated as an error correcting code , where all hashes in a certain ball of hamming radius x are converted to a single representation, and then during query, you do a couple queries, for all hamming radius balls that are within your interest ball. (That may have been explained poorly)

Some people also make custom things using metric trees to index hashes. I remember manybubbles saying something about elasticsrarch being usable for this sort of thing.

Also Check out http://www.phash.org/

I just tested Phash (ImageHash.py :s implementation) and for detecting duplicates it would be huge improvement compared to SHA1 even if it would be used for testing the exact matches which would be fast using SQL. There would be false negatives though.

Not sure if we would be able to do that kind of comparison in a SQL query (but the power of SQL keeps surprising me).

I looked into this a little while back. Usually for these types of hashes, you want to find everything within a certain hamming distance of the current hash, which is very difficult to index for in sql.

One solution for this could be generating shorter hashes for indexing which would lead that there is lot of hash collisions between similar images, but not so many false negatives and then filter out the correct ones from those. In SQL it would be something like this

SELECT
   p1.id as photo1 , 
   p2.id as photo2,
   BIT_COUNT(p1.64bithash ^ p2.64bithash) as bitdiff
FROM
  photos as p1, photos as p2 
WHERE 
  p1.id<p2.id 
  AND p1.16bithash=p2.16bithash
  AND BIT_COUNT(p1.64bithash ^ p2.64bithash)<5

And with 15k photos and max bitdiff < 3 it found 90% of the results compared to query without 16bithash and bitdiff < 7 it found 66% of the results. Speed difference was 30s vs <0.5s.