Page MenuHomePhabricator

Allow searching for similar images on Commons via perceptual hashes
Open, LowPublic


After some initial rewarding experiments using standard difference hashes and perception hashes, I'm creating this task to start a wider discussion on whether and how image hashes could be implemented globally on Commons.

The benefit of an image hash is that very near duplicates can be found, for example where the EXIF has changed or whether images have been altered through saturation changes or other minor visual enhancement, or where the same image exists on Commons in different resolutions. A globally available and searchable imagehash would help battle copyright violations, and a system for finding close imagehashes would provide benefits of intelligently clustering related images. For experiments finding duplicates and close matches refer to

Using my local machine, it takes around 2 seconds to both download a 320px thumbnail and create the 64bit hash. When run on the servers it should be a magnitude faster, making it realistic to generate hashes in realtime, at the same time the SHA1 values for images are created for Commons files.

There are implementation questions, such as which hash(es) to implement based on their benefits, and whether the standard 64bit hash would be sufficient. These would need more experiments and testing to make the best choices. This wider project should be a WMF supported venture, though unpaid volunteers can do some interesting things, a comprehensive project needs coordination and a modest amount of investment to get right.

Event Timeline

A bigger question is where to store the hashes. Usually perceptual hashes are compared using hamming distance. This is sort of inefficient to do in a traditional mysql database. I remember years ago that manybubbles talked about how it would make sense to use elasticsearch as the storage backend for image similarity, so itd probably be useful to look into that direction

Bawolff renamed this task from Create an ImageHash for all Commons images to Allow searching for similar images on Commons via perceptual hashes.Jun 15 2017, 11:52 AM

This is indeed a resurrection of the 2 years old T121797, however that got waylaid by the same "bigger question" of creating an independent database to return general Hamming distances. If this proposal to make available image hashes (whether perception, difference or others), it has little chance of getting anywhere if we don't at least take the first step of being able to return the image hash on an API request, or database query for an image. This minimal change does not require much smart programming, nor creative design. With the hashes available, anyone can immediately search for hash matches, and if they wish to compare Hamming distance for non-matches, they can write separate scripts or tools to do it far more easily, the bit-wise difference being extremely simple. In my experiments with greater-than-zero distances, the results have much narrower potential utility, leading me to believe that this would be for analysing rather specialized collections and questions which means only having to process a constrained sample space. Simple matches, where the Hamming distance is zero, across all Commons images offers immediate benefits, namely finding duplicates and detecting copyright violations by matching new uploads against the hashes for already deleted images, rather than only doing a comparison with the SHA1 cryptic hash.

Right now, just by straightforward comparisons of hashes we have identified over 6,000 duplicate images which were virtually impossible to discover using normal queries.

Shall we get on with a simple solution, then in an Agile way, add more interesting functionality later on? Nobody thinks that moving forward with image hashes is a bad idea, and it looks like everyone that sees real examples of usage, wants to see it become a feature of Commons.

dr0ptp4kt moved this task from Untriaged to Triaged on the Multimedia board.