Page MenuHomePhabricator

Warn the user if they're uploading a non-exact duplicate image via UploadWizard
Open, Needs TriagePublicFeature

Description

We'll need to

  • create an implementation of the pHash algorithm that can be run inside MediaWiki (or find an open source implementation we can use)
  • create a new hash and store it for every new upload (not just via UW)
  • maybe also do this for all existing images, and store the hashes

The above is covered by T167947, and will allow the community to create tools to find duplicates.

If we also gathered deleted images and generated and stored hashes for those this would be even more useful. Storing hashes for images on other wikis would be useful too - if those images are fair-use then they're likely to be copyrighted

Once that's done we'll need to also generate the hash for stashed uploads in UW, and match that against existing stored hashes, and alert the user if there's a match (and warn them if that match has already been deleted)


Some implementation details

  • Trying to figure out how to store hashes that can be searched based on hamming distance is what made a previous version of this run aground (see T121797). According to User:Fæ's work on this we'll catch ~90% of duplicates just checking for identical hashes, so let's just check for identical hashes
  • Hashes will need to be updated if a new version of an image is uploaded
  • We can possibly store the hashes in a mysql table with a flag to indicate if the original image has been deleted (that gets updated when the image is deleted/undeleted)
    • Or better still we'd store the pHash in the new file table, see T28741

Event Timeline

Cparle updated the task description. (Show Details)

It would be interesting to have such an API, for instance for mass uploading tools like OpenRefine. This could potentially enable us to check for close duplicates before upload.
It would be ideal if the hashing algorithm used could be sort of standardized, meaning that it could also be implemented client side. But I understand if this is too much of a stability guarantee to offer.

Basically, the ideal situation for us would be:

  • we compute the perceptual hash locally (with a Java implementation of the hashing algorithm)
  • we query Commons to see if this perceptual hash already exists there
  • if it does, we report it as a duplicate to the user, before they even upload the file to Commons via OpenRefine

Some comments and thoughs

The currently commonly used libraries (e.g., Python Imagehash, PHP Imagehash, OpenCV,JImageHash, etc.) use the pHash function to generate their hash values, which are libary specific. Differences occur because implementation details and bit length vary. The performance and accuracy of implementations may also be different.

Ideally, the target would be to write or document a reference pHash version that could be easily ported to other programming languages. The hash values could also be used outside of the MediaWiki core, and hash values would be metadata for Wikimedia Commons images. This would allow external apps to calculate hash values and compare them to values in Wikimedia Commons without transferring actual photos. Pywikibot and OpenRefine are clear candidates for utilizing hashes, but use cases are similar for all uploading tools, and there is a need to confirm similarities in machine learning using secondary methods, too.

Other hashes than pHash

The requirement for the imagehashing function is also that it would be not just accurate, but it should generate unique hashes. For example, Blockhash is designed to be portable. Still, it is not so resistant to scaling/compression, so it would require a hamming distance search instead of just checking ​ for identical hashes. Wavelet hashing has the same accuracy as pHash, but it will have much more hash collisions, so it doesn't work with 100M photos. dHash has a low hash collision rate, and its accuracy is at the same level as phash when calculated in both directions, doubling the bit length of the hash. One consideration is that more than 64-bit hash resolution is needed to detect differences between almost similar but unique photos. (for example, time series with small increases in movement) However, it is unknown how increasing the pHash hashing length would affect the false negative rate.

Querying hashes in Mariadb

Imagehash.toolforge.org uses the Python imagehash library 64-bit phash + 64-bit dhash combination, so the values will fit MariaDB's unsigned BIGINT field. The search is done like this:

SELECT p1.page_id, p2.page_id FROM imagehash i1, imagehash as i2 
WHERE   i1.page_id = PAGE_ID 
AND   i1.page_id != i2.page_id 
AND ( i1.phash = i2.phash AND BIT_COUNT(i1.dhash ^ i2.dhash)< 4 )
AND ( i1.dhash = i2.dhash AND BIT_COUNT(i1.phash ^ i2.phash)< 4 )

After comparing the Imagehash.py and OpenCV imagehash

There is least following differences in implementation

1.) Resizing order difference

  • imagehash converts image to grayscale and then resizes it
  • OpenCV resizes image and then converts it to grayscale

2.) Converting grayscale contains float to integer rounding difference

  • Imagehash floors the numbers to nearest integer
  • OpenCV rounds the numbers to nearest integer. To do same in OpenCV than in Imagehash would be
cv_image = cv2.imread('test.jpg')
float32_image = cv_image.astype(numpy.float32)
grayscale_float32 = cv2.cvtColor(float32_image, cv2.COLOR_BGR2GRAY)
resized_grayscale_float32 = cv2.resize(grayscale_float32, (32, 32), interpolation=cv2.INTER_NEAREST_EXACT)
resized_grayscale_uint8 = resized_grayscale_float32.astype(numpy.uint8)

3.) Resizing image method

  • Imagehash uses Image.Resampling.LANCZOS ( Image.Resampling.NEAREST would be same than in OpenCV)
  • OpenCV uses INTER_LINEAR_EXACT

4.) Discrete Cosine Transform

  • OpenCV's cv2.dct function and SciPy's scipy.fftpack.dct uses different normalisation methods. Workaround in Imagehash would be to use orthogonal normalization so the result would nearly same than in OpenCV
dat = fftpack.dct(fftpack.dct(pixels, axis=0, norm='ortho'), axis=1, norm='ortho')

5.) When calculating middle point value range OpenCV uses average and Imagehash uses median.

6.) OpenCV sets the pixel 0,0 as zero before calculating the middle point.

There is also difference how result bitstring is generated from 8x8 dct array and middle point. However, to this point was possible to modify the OpenCV and Imagehash algorithms so that the data behind the printed value is identical.

iNaturalistReviewer currently uses the python imagehash library's phash function to perform fuzzy comparison, with a maximum Hamming distance of 4. I found this gave me adequate results in detecting different versions of the same image without false positives, though I haven't done significant amounts of tuning. In practice I've found it's effective on scaling, but has very little crop tolerance.

libfastimagehash is a imagehash replacement for C/C++ ... The methods used to compute the image hashes are identical to the imagehash python library, however, due to some slight differences in the way OpenCV vs Pillow images are resized, the final image hashes are not always exactly the same.

I tested the javascript version and it generated hash in reversed order. Ie. it needed to be printed using hash.toHexStringReversed() to get it in same format than with the python library. There were couple bits dfference when comparing to python hashes because scaling differences).

So it would require implementation of Pillow Lanczos to make it exact. Code for Pillows Lanczos were linked in reddit if somebody wants to try to write separate standalone function for resizing.

Aklapper changed the subtype of this task from "Task" to "Feature Request".Mar 18 2025, 4:11 PM