Page MenuHomePhabricator

Duplicate images goes undetected if if they have different metadata
Closed, DuplicatePublic

Description

It seems like images uploaded by Riksbot, at least from Digitalarkivet, somehow has another sha1sum than the same images available at their web site. An example is Louis Armstrong at their site (available at Commons), and compare to Louis Armstrong at Commons.

Sha1sum

  • b0c82538e81f17eeb340f87dc622a1b3d4818c11 louis1.jpg (from foto.digitalarkivet.no)
  • 50b08d0ac86e7a7fb2c231dde2679d8c942d29dc louis2.jpg (from commons)

md5sum

  • 87d04775b160e4ee73609abb501ab1c2 louis1.jpg (from foto.digitalarkivet.no)
  • 3446eb3be7fdf5f48035b8fb263deb56 louis2.jpg (from commons)

sha256sum

  • 7b9fd274742a6ac0e4c708ad3d3c80536dc54939f69b9bd039830ddcc2fff012 louis1.jpg (from foto.digitalarkivet.no)
  • c420cd14d1f984d3f07558bf6018772a8b3b113948c917b7ee1441268c9ee4c8 louis2.jpg (from commons)

A few failing images wouldn't be a big deal, but this is not just a few images. The category for the magazine NÅ! contains 6500 images.

It could be a problem with the bot (Riksbot, non-standard bot, added invisible metadata?), it could be a problem with the software at Arkivverket (PhotoStation, perhaps watermarking?), and it could also be something weird going on at Commons (the test images was downloaded from Commons).

If nothing else works, then a proper fingerprinting of the images at Commons should be implemented.

The downloaded test images are to large for Phabricator, so must be downloaded from Commons. Ask me if you need my files.

I have informed the operator of Riksbot about the problems with the files.

Event Timeline

jeblad renamed this task from Duplicate images goes undetected to Duplicate images goes undetected if previously uploaded by Riksbot.Aug 4 2017, 2:42 PM
jeblad updated the task description. (Show Details)

I have checked with compare louis1.jpg louis2.jpg -compose src diff.png and there are no visual difference between the images.

The images (louis1.jpg and louis2.jpg) have differences in their metadata. Once I strip the metadata, I get the same md5sum:

kjetil@titus:~/temp$ md5sum louis1.jpg
87d04775b160e4ee73609abb501ab1c2  louis1.jpg
kjetil@titus:~/temp$ md5sum louis2.jpg
3446eb3be7fdf5f48035b8fb263deb56  louis2.jpg
kjetil@titus:~/temp$ cp louis1.jpg louis1-no-metadata.jpg
kjetil@titus:~/temp$ exiv2 rm louis1-no-metadata.jpg
kjetil@titus:~/temp$ cp louis2.jpg louis2-no-metadata.jpg
kjetil@titus:~/temp$ exiv2 rm louis2-no-metadata.jpg
kjetil@titus:~/temp$ md5sum louis1-no-metadata.jpg
8b1dd87aef32bd85c1d67c53285473eb  louis1-no-metadata.jpg
kjetil@titus:~/temp$ md5sum louis2-no-metadata.jpg
8b1dd87aef32bd85c1d67c53285473eb  louis2-no-metadata.jpg

Looks like Mediawiki's algorithm for detecting duplicates needs to strip metadata before comparing, if we want these images to be detected.

zhuyifei1999 renamed this task from Duplicate images goes undetected if previously uploaded by Riksbot to Duplicate images goes undetected if if they have different metadata.Aug 4 2017, 3:00 PM
zhuyifei1999 subscribed.

None of the images have a thumbnail, just checked with convert louis1.jpg thumbnail:thumb1.jpgand convert louis2.jpg thumbnail:thumb2.jpg

FWIW: simple binary diff of the two files:

$ diff <(xxd L0062_965Fo30141701300043.jpg) <(xxd Louis_Armstrong_til_Oslo_og_konserter_-_L0062_965Fo30141701300043.jpg)
774,776c774,776
< 00003050: 7449 443e 786d 702e 6469 643a 3534 3141  tID>xmp.did:541A
< 00003060: 4133 4443 3337 3839 3446 4431 2038 3635  A3DC37894FD1 865
< 00003070: 4337 3435 4132 3345 3235 3844 313c 2f78  C745A23E258D1</x
---
> 00003050: 7449 443e 786d 702e 6469 643a 4637 4443  tID>xmp.did:F7DC
> 00003060: 3039 3845 4145 4345 3437 3546 2039 4242  098EAECE475F 9BB
> 00003070: 3934 4131 4337 3336 3832 4235 413c 2f78  94A1C73682B5A</x
780,782c780,782
< 000030b0: 7449 443e 786d 702e 6469 643a 3534 3141  tID>xmp.did:541A
< 000030c0: 4133 4443 3337 3839 3446 4431 2038 3635  A3DC37894FD1 865
< 000030d0: 4337 3435 4132 3345 3235 3844 313c 2f78  C745A23E258D1</x
---
> 000030b0: 7449 443e 786d 702e 6469 643a 4637 4443  tID>xmp.did:F7DC
> 000030c0: 3039 3845 4145 4345 3437 3546 2039 4242  098EAECE475F 9BB
> 000030d0: 3934 4131 4337 3336 3832 4235 413c 2f78  94A1C73682B5A</x
786,788c786,788
< 00003110: 6549 443e 786d 702e 6969 643a 3435 3645  eID>xmp.iid:456E
< 00003120: 3033 3144 3141 4530 3432 4246 2041 4531  031D1AE042BF AE1
< 00003130: 4434 3343 3942 3530 4233 4545 313c 2f78  D43C9B50B3EE1</x
---
> 00003110: 6549 443e 786d 702e 6969 643a 4434 4234  eID>xmp.iid:D4B4
> 00003120: 4545 4437 3732 4433 3446 4242 2038 4136  EED772D34FBB 8A6
> 00003130: 4242 4143 4646 3138 4138 3638 363c 2f78  BBACFF18A8686</x
799c799
< 000031e0: 372d 3038 2d30 3454 3131 3a32 323a 3532  7-08-04T11:22:52
---
> 000031e0: 372d 3035 2d33 3054 3131 3a35 303a 3232  7-05-30T11:50:22
803,804c803,804
< 00003220: 3e32 3031 372d 3038 2d30 3454 3131 3a32  >2017-08-04T11:2
< 00003230: 323a 3532 2b30 323a 3030 3c2f 786d 703a  2:52+02:00</xmp:
---
> 00003220: 3e32 3031 372d 3035 2d33 3054 3131 3a35  >2017-05-30T11:5
> 00003230: 303a 3232 2b30 323a 3030 3c2f 786d 703a  0:22+02:00</xmp:
807,808c807,808
< 00003260: 7461 4461 7465 3e32 3031 372d 3038 2d30  taDate>2017-08-0
< 00003270: 3454 3131 3a32 323a 3532 2b30 323a 3030  4T11:22:52+02:00
---
> 00003260: 7461 4461 7465 3e32 3031 372d 3035 2d33  taDate>2017-05-3
> 00003270: 3054 3131 3a35 303a 3232 2b30 323a 3030  0T11:50:22+02:00
819,820c819,820
< 00003320: 693e 3230 3137 2d30 382d 3034 5431 313a  i>2017-08-04T11:
< 00003330: 3232 3a35 322b 3032 3a30 303c 2f72 6466  22:52+02:00</rdf
---
> 00003320: 693e 3230 3137 2d30 352d 3330 5431 313a  i>2017-05-30T11:
> 00003330: 3530 3a32 322b 3032 3a30 303c 2f72 6466  50:22+02:00</rdf
823c823
< 00003360: 2d30 382d 3034 5431 313a 3232 3a35 332b  -08-04T11:22:53+
---
> 00003360: 2d30 352d 3330 5431 313a 3530 3a32 322b  -05-30T11:50:22+
826,827c826,827
< 00003390: 6466 3a6c 693e 3230 3137 2d30 382d 3034  df:li>2017-08-04
< 000033a0: 5431 313a 3232 3a35 332b 3032 3a30 303c  T11:22:53+02:00<
---
> 00003390: 6466 3a6c 693e 3230 3137 2d30 352d 3330  df:li>2017-05-30
> 000033a0: 5431 313a 3530 3a32 322b 3032 3a30 303c  T11:50:22+02:00<
956,957c956,957
< 00003bb0: 3343 3941 4446 4535 3741 3644 4444 3041  3C9ADFE57A6DDD0A
< 00003bc0: 3436 3244 3245 4445 3035 4345 3430 4337  462D2EDE05CE40C7
---
> 00003bb0: 4435 4541 3243 4334 3534 3736 3135 4645  D5EA2CC4547615FE
> 00003bc0: 3345 3831 4330 4536 4537 3039 3844 3434  3E81C0E6E7098D44

Stripping exif could be a simple fix, but I'm tempted to say that a proper image hash should be used.