Page MenuHomePhabricator

Improve wikimedia Commons imagehashing
Open, Needs TriagePublic

Description

Plan is to index all Wikimedia Commons photos using Perceptual hashing and difference hashing. These hashes can be used for fiding visually identically images even if files arent same. Ie. images can be scaled or compressed.

Tasks what needs to be solved.

Image hashing speed shoud be. Currend speed is aprox 10M images per month, target speed would be 30M+ images per month. One method to increase the speed would be to detect which images are already downsaled and are in cache and index those. However, there is no solid method to check if image is already in cacle.

Publishing the imagehash database.

One solution would be saving the hashes to SDC, but it takes years if the hashes are saved one by one. Question is if there is faster way to do it? Also if hashes are stored to SDC by bot then it will also need to be a community discussion before doing it. Another method could be setup own sparql server (such as ontop) and publish images through that.