Page MenuHomePhabricator

Detect duplicate images via perceptual hashes before upload
Open, HighPublic


In the spirit of, use to generate perceptual hashes for all pictures.

Compare these hashes to find duplicated images before upload, always upload the best one.

Event Timeline

Don-vip triaged this task as High priority.Apr 25 2020, 2:45 PM
Don-vip moved this task from Backlog to In progress on the Tool-spacemedia board.

Related to T167947 / T121797

Hi, is it possible to get some phash hash values for pictures in Wikimedia Commons? I would like to see if the JImageHash would generate compatible hash values with Pythons Imagehash, but I expect that it doesn't do that.

@Zache here you go:

id asset_url sha1 title full_res_asset_url full_res_sha1 url phash full_res_phash
"22416295"	""	"5818ba0c066294d45817dc66ec87f6b6d3b83245"	"Tanezrouft Basin"	""	"94cb65bde10f03c5c8a6e00a02000171fbcae0f5"	""	"27hbf17dvaxj6h1hyngps0pv0peqkg06qhymmyb0mfuiy503q7"	"27hfyk4w36cz145z5itihm4o61ecom7g9a7jrcvssnm13wb9b3"

"22415247"	""	"99c5be75c36a862c2bbbaac333a203921a09238c"	"Spain’s chilly blanket"	""	"112b150778ceb71a7629df5ad65f6ccbdf17fde3"	""	"480codgruluugbs5n0decd72dz4h0njabnt3v1illbnhutoe2d"	"4md6a9qgq9iq32x0oebqmgmx8u41mm1tytfga1hxuiauf0q71x"

"22412552"	""	"86a21ff1a2d0d2c4eedbde45d428039007186054"	"Madrid snowbound"	""	"bd26888f1589c29b952a34077937687f17874438"	""	"10tyopydpwrzsyjpvxj760d71zvh9fwjv1omdcitwxi5lw7qcy"	"3gzkqznxs4apzp396ecl5zojiz11kl668hmx1jmhaznx598qk2"

"22406576"	""	"43882d762da9c22a2a56e8099f2790a41e052b86"	"Frosty scenes in martian summer"	""	"8f77b34fac670ccd0fb5b0f143d9f6db49a23919"	""	"2wysf9uu85b9ppj21cn4xpi5jnrvm46hla39cpsdo4zww6x1dq"	"2wysf9uua71ywya1lz49u64u60q0xlapchcya0hhmx4m7puuke"

"22401942"	""	"d976c3ba9a414e803c6da8d907e1807936346a1d"	"A-68A iceberg loses chunk of ice"	\N	\N	""	"gzdx70uqzim8k8lqz65mgtwxjjbtlpuuecd4sag9visdnxner"	\N

"22399213"	""	"65353abe687d3f36c90513e58fb76109ba2c8b3d"	"A-68A iceberg breaks off"	\N	\N	""	"2oq8104flqjocebxkfnu1uf8qws5f7aj1xi29ubfdxdjqipiu5"	\N

"22397816"	""	"6693ff3845b7c9f087f9c014076b1ff4bdc3939d"	"Mountains of snow"	\N	\N	""	"515ithqpf9w8py9rqxzs70t3pqknhjlp0lhjf5ct93g50jrt7w"	\N

"22394509"	""	"6aad89d0f585b22a39bc36709e86e85a7883ea16"	"Rovaniemi, Lapland"	""	"65d270c6253561ca2f94c1fe9464eb5de937af34"	""	"wqubr8aevmnxvoj944ajkozudi6y8vsagsf2hy30nw3lq6u6x"	"x6zr2d3bd8saypzjqm344foayeo6im0jzbypdbiw5ze3fmb15"

"22393189"	""	"30b999585fe9f2c6e65c4f1ace56fa0ee1456431"	"A festive scene near Mars’ south pole – in 3D"	""	\N	""	"5gpf65hzrwdmpxa81g5mov58hdlesk4o1dm44uxukbs7ezemf"	\N

"22393148"	""	"8d00d0d1b23acd8dd263408b4c1bcdcad3f1d65d"	"Perspective view: A heart on Mars"	""	"454dd1d6f938a55fb73071d96e44f72001be9d8e"	""	"14ukfuxfhsw8hedccrang4boc7txtgzl41ixi81bks3eglepjy"	"4bp5l90jx2hp6arvhlpci2b54eqmft458stdt3irj6snbw5jha"

Java code:

import com.github.kilianB.hash.Hash;
import com.github.kilianB.hashAlgorithms.HashingAlgorithm;
import com.github.kilianB.hashAlgorithms.PerceptiveHash;

public final class HashHelper {

    private static final int PHASH_RADIX = 36;

    private static final int BIT_RESOLUTION = 256;

    private static final HashingAlgorithm ALGORITHM = new PerceptiveHash(BIT_RESOLUTION);

    private static final int ALGORITHM_ID = ALGORITHM.algorithmId();

    public static BigInteger computePerceptualHash(BufferedImage image) {
        return ALGORITHM.hash(image).getHashValue();

    public static double similarityScore(BigInteger phash1, String phash2) {
        return similarityScore(phash1, decode(phash2));

    public static double similarityScore(BigInteger phash1, BigInteger phash2) {
        return newHash(phash1).normalizedHammingDistanceFast(newHash(phash2));

    private static Hash newHash(BigInteger phash) {
        return new Hash(phash, BIT_RESOLUTION, ALGORITHM_ID);

    public static BigInteger decode(String phash) {
        return phash != null ? new BigInteger(phash, PHASH_RADIX) : null;

    public static String encode(BigInteger phash) {
        return phash != null ? phash.toString(PHASH_RADIX) : null;

Ok, thanks. By default pythons ImageHash librarys pHash length is 64 bit and even if I change it longer it doesn't generate same hashes. So it is confirmed that its pHash and JImageHash PerceptiveHash doesn't generate same hashes.