Page MenuHomePhabricator

Detect duplicate images via perceptual hashes before upload
Open, HighPublic

Description

In the spirit of https://commons.wikimedia.org/wiki/User:Fae/Imagehash, use https://github.com/KilianB/JImageHash to generate perceptual hashes for all pictures.

Compare these hashes to find duplicated images before upload, always upload the best one.

Related Objects

Event Timeline

Don-vip triaged this task as High priority.Apr 25 2020, 2:45 PM
Don-vip moved this task from Backlog to In progress on the Tool-spacemedia board.

Related to T167947 / T121797

Hi, is it possible to get some phash hash values for pictures in Wikimedia Commons? I would like to see if the JImageHash would generate compatible hash values with Pythons Imagehash, but I expect that it doesn't do that.

@Zache here you go:

id asset_url sha1 title full_res_asset_url full_res_sha1 url phash full_res_phash
"22416295"	"https://www.esa.int/var/esa/storage/images/esa_multimedia/images/2021/01/tanezrouft_basin/22416285-1-eng-GB/Tanezrouft_Basin.jpg"	"5818ba0c066294d45817dc66ec87f6b6d3b83245"	"Tanezrouft Basin"	"https://esamultimedia.esa.int/img/2021/01/Saharan_fractal_Sentinel-2_12January2020_432_enhanced_ML_8bit.tif"	"94cb65bde10f03c5c8a6e00a02000171fbcae0f5"	"https://www.esa.int/ESA_Multimedia/Images/2021/01/Tanezrouft_Basin"	"27hbf17dvaxj6h1hyngps0pv0peqkg06qhymmyb0mfuiy503q7"	"27hfyk4w36cz145z5itihm4o61ecom7g9a7jrcvssnm13wb9b3"

"22415247"	"https://www.esa.int/var/esa/storage/images/esa_multimedia/images/2021/01/spain_s_chilly_blanket/22415236-1-eng-GB/Spain_s_chilly_blanket.jpg"	"99c5be75c36a862c2bbbaac333a203921a09238c"	"Spain’s chilly blanket"	"https://www.esa.int/var/esa/storage/images/esa_multimedia/images/2021/01/spain_s_chilly_blanket/22415237-1-eng-GB/Spain_s_chilly_blanket.tif"	"112b150778ceb71a7629df5ad65f6ccbdf17fde3"	"https://www.esa.int/ESA_Multimedia/Images/2021/01/Spain_s_chilly_blanket"	"480codgruluugbs5n0decd72dz4h0njabnt3v1illbnhutoe2d"	"4md6a9qgq9iq32x0oebqmgmx8u41mm1tytfga1hxuiauf0q71x"

"22412552"	"https://www.esa.int/var/esa/storage/images/esa_multimedia/images/2021/01/madrid_snowbound/22412541-4-eng-GB/Madrid_snowbound.jpg"	"86a21ff1a2d0d2c4eedbde45d428039007186054"	"Madrid snowbound"	"https://www.esa.int/var/esa/storage/images/esa_multimedia/images/2021/01/madrid_snowbound/22412542-1-eng-GB/Madrid_snowbound.tif"	"bd26888f1589c29b952a34077937687f17874438"	"https://www.esa.int/ESA_Multimedia/Images/2021/01/Madrid_snowbound"	"10tyopydpwrzsyjpvxj760d71zvh9fwjv1omdcitwxi5lw7qcy"	"3gzkqznxs4apzp396ecl5zojiz11kl668hmx1jmhaznx598qk2"

"22406576"	"https://www.esa.int/var/esa/storage/images/esa_multimedia/images/2021/01/frosty_scenes_in_martian_summer/22406565-1-eng-GB/Frosty_scenes_in_martian_summer.jpg"	"43882d762da9c22a2a56e8099f2790a41e052b86"	"Frosty scenes in martian summer"	"https://www.esa.int/var/esa/storage/images/esa_multimedia/images/2021/01/frosty_scenes_in_martian_summer/22406566-1-eng-GB/Frosty_scenes_in_martian_summer.png"	"8f77b34fac670ccd0fb5b0f143d9f6db49a23919"	"https://www.esa.int/ESA_Multimedia/Images/2021/01/Frosty_scenes_in_martian_summer"	"2wysf9uu85b9ppj21cn4xpi5jnrvm46hla39cpsdo4zww6x1dq"	"2wysf9uua71ywya1lz49u64u60q0xlapchcya0hhmx4m7puuke"

"22401942"	"https://www.esa.int/var/esa/storage/images/esa_multimedia/images/2020/12/a-68a_iceberg_loses_chunk_of_ice/22401931-1-eng-GB/A-68A_iceberg_loses_chunk_of_ice.jpg"	"d976c3ba9a414e803c6da8d907e1807936346a1d"	"A-68A iceberg loses chunk of ice"	\N	\N	"https://www.esa.int/ESA_Multimedia/Images/2020/12/A-68A_iceberg_loses_chunk_of_ice"	"gzdx70uqzim8k8lqz65mgtwxjjbtlpuuecd4sag9visdnxner"	\N

"22399213"	"https://www.esa.int/var/esa/storage/images/esa_multimedia/images/2020/12/a-68a_iceberg_breaks_off/22399203-1-eng-GB/A-68A_iceberg_breaks_off.gif"	"65353abe687d3f36c90513e58fb76109ba2c8b3d"	"A-68A iceberg breaks off"	\N	\N	"https://www.esa.int/ESA_Multimedia/Images/2020/12/A-68A_iceberg_breaks_off"	"2oq8104flqjocebxkfnu1uf8qws5f7aj1xi29ubfdxdjqipiu5"	\N

"22397816"	"https://www.esa.int/var/esa/storage/images/esa_multimedia/images/2020/12/mountains_of_snow/22397805-1-eng-GB/Mountains_of_snow.jpg"	"6693ff3845b7c9f087f9c014076b1ff4bdc3939d"	"Mountains of snow"	\N	\N	"https://www.esa.int/ESA_Multimedia/Images/2020/12/Mountains_of_snow"	"515ithqpf9w8py9rqxzs70t3pqknhjlp0lhjf5ct93g50jrt7w"	\N

"22394509"	"https://www.esa.int/var/esa/storage/images/esa_multimedia/images/2020/12/rovaniemi_lapland/22394499-1-eng-GB/Rovaniemi_Lapland.jpg"	"6aad89d0f585b22a39bc36709e86e85a7883ea16"	"Rovaniemi, Lapland"	"https://esamultimedia.esa.int/img/2020/12/TIFFLapland_S1_Feb28-11Mar-04Apr_multitemp_ML.tif"	"65d270c6253561ca2f94c1fe9464eb5de937af34"	"https://www.esa.int/ESA_Multimedia/Images/2020/12/Rovaniemi_Lapland"	"wqubr8aevmnxvoj944ajkozudi6y8vsagsf2hy30nw3lq6u6x"	"x6zr2d3bd8saypzjqm344foayeo6im0jzbypdbiw5ze3fmb15"

"22393189"	"https://www.esa.int/var/esa/storage/images/esa_multimedia/images/2020/12/a_festive_scene_near_mars_south_pole_in_3d/22393178-1-eng-GB/A_festive_scene_near_Mars_south_pole_in_3D.jpg"	"30b999585fe9f2c6e65c4f1ace56fa0ee1456431"	"A festive scene near Mars’ south pole – in 3D"	"https://www.esa.int/var/esa/storage/images/esa_multimedia/images/2020/12/a_festive_scene_near_mars_south_pole_in_3d/22393179-1-eng-GB/A_festive_scene_near_Mars_south_pole_in_3D.tif"	\N	"https://www.esa.int/ESA_Multimedia/Images/2020/12/A_festive_scene_near_Mars_south_pole_in_3D"	"5gpf65hzrwdmpxa81g5mov58hdlesk4o1dm44uxukbs7ezemf"	\N

"22393148"	"https://www.esa.int/var/esa/storage/images/esa_multimedia/images/2020/12/perspective_view_a_heart_on_mars/22393137-1-eng-GB/Perspective_view_A_heart_on_Mars.jpg"	"8d00d0d1b23acd8dd263408b4c1bcdcad3f1d65d"	"Perspective view: A heart on Mars"	"https://www.esa.int/var/esa/storage/images/esa_multimedia/images/2020/12/perspective_view_a_heart_on_mars/22393138-1-eng-GB/Perspective_view_A_heart_on_Mars.tif"	"454dd1d6f938a55fb73071d96e44f72001be9d8e"	"https://www.esa.int/ESA_Multimedia/Images/2020/12/Perspective_view_A_heart_on_Mars"	"14ukfuxfhsw8hedccrang4boc7txtgzl41ixi81bks3eglepjy"	"4bp5l90jx2hp6arvhlpci2b54eqmft458stdt3irj6snbw5jha"

Java code:

import com.github.kilianB.hash.Hash;
import com.github.kilianB.hashAlgorithms.HashingAlgorithm;
import com.github.kilianB.hashAlgorithms.PerceptiveHash;

public final class HashHelper {

    private static final int PHASH_RADIX = 36;

    private static final int BIT_RESOLUTION = 256;

    private static final HashingAlgorithm ALGORITHM = new PerceptiveHash(BIT_RESOLUTION);

    private static final int ALGORITHM_ID = ALGORITHM.algorithmId();

    public static BigInteger computePerceptualHash(BufferedImage image) {
        return ALGORITHM.hash(image).getHashValue();
    }

    public static double similarityScore(BigInteger phash1, String phash2) {
        return similarityScore(phash1, decode(phash2));
    }

    public static double similarityScore(BigInteger phash1, BigInteger phash2) {
        return newHash(phash1).normalizedHammingDistanceFast(newHash(phash2));
    }

    private static Hash newHash(BigInteger phash) {
        return new Hash(phash, BIT_RESOLUTION, ALGORITHM_ID);
    }

    public static BigInteger decode(String phash) {
        return phash != null ? new BigInteger(phash, PHASH_RADIX) : null;
    }

    public static String encode(BigInteger phash) {
        return phash != null ? phash.toString(PHASH_RADIX) : null;
    }
}

Ok, thanks. By default pythons ImageHash librarys pHash length is 64 bit and even if I change it longer it doesn't generate same hashes. So it is confirmed that its pHash and JImageHash PerceptiveHash doesn't generate same hashes.

I'm working on it right now: https://github.com/toolforge/tool-spacemedia/commits/develop-0.4.x
I've already computed more than 4 million hashes. "Only" ~73 millions to go, it will take a few months.

Aklapper assigned this task to Don-vip.