Page MenuHomePhabricator

Store the hash of uploaded files to allow duplicate checking, etc.
Closed, ResolvedPublic

Description

I have been regularly checking & tagging the file uploads at my eswiki (main
project), and thus i have seen almost all images uploaded on last times.

I realized that we have not any good way to compare files. Sometimes you see a
file you're pretty sure has been uploaded before, but you have no method to
found it. If it's been uploaded with the same name you can think is the same
(can't be sure!). But usually has another name. Even if you have seen it on the
same session you'd need to wath the images to find it.

The same goes when a uploaded file says: "from X wikipedia". You need to go
there, download and wath the file to see if it fits.

As a conclusion, i decided we needed a file hash to uploads. Then, there cames a
new question: where do i store it?

The image table seems a good place, creating a new (indexable) field for it.

This has two problems:
-We need to change the table fields.
-We don't record the has of deleted images. No information on reupload :(

The final solution could be a new table able to relation, but i didn't want to
make drastic changes on table design.

So i tried to make it simple and simply put the md5 hash on the logs. Pros: It's
a minor change. Cons: It's not a big change, so we can't use all power this
feature gives us, BUT it's more than nothing. :-)

The patch i did for it (against r1495) is attached. A new $md5desc variable is
defined to have the description with the hash to stamp on the logs. That applies
to the Special:Log and the 'page history'. It also truncates the description
wich appears on these logs if it's too large. It was also previosly truncated,
but couldn't find where, was it truncated by the db?

Note that as at Special we can't search by descrition, we can't make a log
complete search to see if an image was previously uploaded, but we can use the
browser-search feature to check in recent ones, and also have a bot logging them
to make them easier to fetch.

Bots could also use this data for interwiki image comparing. On the TODO list
(bad to do as it's implemented) there would also be a check to the previous
image hash to reject it if new version is the same as previous.

On the SoC proposals, hashing was also requested, though in a more extensive plan.

P.D. While writing this, an image was uploaded that i'm sure that was before.
Searching, it was, in fact, deleted three days ago (with the same name, of
course only my memory can atestiguate it's the same).


Version: unspecified
Severity: enhancement

Details

Reference
bz5763

Related Objects

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 9:12 PM
bzimport set Reference to bz5763.
bzimport added a subscriber: Unknown Object (MLST).

My patch for it

Patch to r1495. Needs $wgGetImageMd5 to be defined at the LocalSettings.

Attached:

robchur wrote:

I would urge adding an img_hash field, and storing such a hash of the file
there. This would facilitate use in other locations, e.g. the aforementioned
Summer of Code idea.

besides the MD5 hash, the another hashing methods like SHA1 is quite considered. :)

I agree, Rob Church, but as i explained, i tried to keep it simple. If you make
changes to the tables, you'll have to change more clasess, maybe make some kind
of system migration.

Plus, you'd be more rejealous of it, and i could have made more bugs.

I did the first step, but there's still too much way to do on this.

Shinjiman, agree. You can see i said "decided we needed a file hash". I used md5
as it's more common (47.800.000 md5 vs 18.700.000 sha1 on google), but if you
change md5_file for sha1_file on the code, it'll give you the sha-1 instead :)

robchur wrote:

(In reply to comment #4)

I agree, Rob Church, but as i explained, i tried to keep it simple. If you make
changes to the tables, you'll have to change more clasess, maybe make some kind
of system migration.

So? Functionality should never be unreasonably sacrificed for the sake of
performance or workload.

Plus, you'd be more rejealous of it, and i could have made more bugs.

What the hell does this mean?

(In reply to comment #5)

What the hell does this mean?

a) More complexity => Easier to make errors (bugs) + I'm no expert in wikimedia
coding.
b) You are trusted enough by the community, I need to get the patch reviewed &
accepted.

So? Functionality should never be unreasonably sacrificed for the sake of

performance or workload.
I expect the above has made clearer my reasons. Take into account this is my
first code submission.

If you think adding an img_hash field urges, you can add it and make it work
doing nothing (unused). Then i can try to complement it to actually work.

Note that even the approach is subject to discussion, as that wouldn't take into
account search on deleted images ^^ Maybe this should be discussed elsewhere?

Add an img_hash field. Change ImagePage.php to display the hash from img_hash
where appropriate. Add the MD5 hash to the log comment on upload, not to
img_description. Use "MD5" in the user interface, not "Md5", and put such
strings in the language file, don't hard-code them. An indexed hash for deleted
images can wait until we have a deleted image archive.

Apparently MD5 collisions can now be found in under a minute on a desktop PC,
with any chosen IV, since March 2006. There is public source code available to
generate these collisions. It's probably time we started migrating away from it.
The author of the March 2006 paper seems to think that SHA-1 and SHA-2 may be
similarly vulnerable, but nonetheless they might be the most practical
alternatives for the time being.

We're searching for methods for detecting the same image, not image using
through detected hash. I doubt that code to generate collisions still generates
valid images, but it's worth to know it. Any link?

Php provides sha1_file() funtion too, so no problem. There's no sha2_file()
function. There're extension that provide it, but we probably don't want to need
more php extensions than indispensable.

Tim, i guess you're showing the steps to do. Again, How is a new field added? I
could touch my table myself but it'd break everyone else's ;)

Yes, i know about about language files. If i dared to hardcode it was because i
don't think there are _translated_ names for it. And also because it was a bit
simpler. ;)

robchur wrote:

(In reply to comment #9)

Php provides sha1_file() funtion too, so no problem. There's no sha2_file()
function. There're extension that provide it, but we probably don't want to need
more php extensions than indispensable.

So use the sha1_file() function.

Tim, i guess you're showing the steps to do. Again, How is a new field added? I
could touch my table myself but it'd break everyone else's ;)

  1. Update the table definitions in the maintenance folder (all of them)
  2. Add a patch file in SQL format to the archive folder
  3. Alter maintenance/updaters.inc and add the new field as demonstrated there

This means that "everyone else" can run the update scripts and expect it to work.

Yes, i know about about language files. If i dared to hardcode it was because i
don't think there are _translated_ names for it. And also because it was a bit
simpler. ;)

A poor excuse. Just add the message and leave the translators to decide if their
language has a word for it. 'MD5' and 'SHA1' don't sound like the sort of thing
that would do, however.

If you're going to do it, do it properly, otherwise it's useless.

Ok, i think i should [[Wikipedia:Be Bold]] and try it.

robchur wrote:

*** Bug 1459 has been marked as a duplicate of this bug. ***

phil.ganchev wrote:

Is it better to expose the hashes to the user, or use them only internally so
that the user only knows that images are being compared?

ayg wrote:

Another question: how about normalization? If you're using this for image
comparison, it's unnecessarily limited to only permit comparison between
identical formats, identical sizes, and identical compression levels. A logical
baseline would be a smallish, low-quality JPEG (obviously stripped of metadata),
since the compression artifacts would be important for comparing JPEGs to
lossless formats. More hits is going to be better than fewer, of course, given
that we aren't looking to *prevent* anyone from saving a duplicate, just giving
the option of cancelling and/or superseding the other image(s).

(In reply to comment #13)

Is it better to expose the hashes to the user, or use them only internally so
that the user only knows that images are being compared?

May as well expose them, unless you're going to have some kind of encryption
step using a private key (which seems more than slightly paranoid). This is an
open-source project, after all; anyone could just make the hashes themselves.

plugwash wrote:

best to expose the hashes, much easier to copy a hash from one wiki and use it
to search on another than to save and re-upload the file everywhere.

ayg wrote:

(In reply to comment #15)

The hash will be the filename.

Storing a normalized hash for further comparison would remain useful.

A "normalized hash" doesn't sound very practical when it comes to
images. It is possible to compare similar images, but that's
going to be something totally unrelated.

ayg wrote:

(In reply to comment #18)

A "normalized hash" doesn't sound very practical when it comes to
images. It is possible to compare similar images, but that's
going to be something totally unrelated.

I mean "hash of a normalized image". If you normalize the image to low-quality
fixed-size JPEG before saving, you'll be able to catch a lot of matches that
wouldn't otherwise show up due to different formats, sizes, compression levels,
even metadata. Still not perfect, but what is?

Not just not perfect, but totally impractical. You're not going
to get cryptographic hashes to match that way, at all. It simply
wouldn't work, as any 1-bit difference will give you a hugely
different value.

ayg wrote:

You're right, normalization needs to be much more extreme than just converting
to low-quality JPEG. I achieved it with two test images, [[Image:Libertatis
Aequilibritas GFDL.jpg]] and [[Image:Libertatis Aequilibritas GFDL.png]], by
reducing both to 10-pixel-wide monochrome with no dithering, after converting
transparency to white; they were then identical except that for some reason they
were negatives of each other, presumably an artifact of the algorithm used.
This would give a 2^-100 probability of a chance match, which isn't much worse
than the probability of a random MD5 match.

gmaxwell wrote:

Some useful background for anyone else looking at this issue:

Fuzzy image matching, also called image indexing, perceptual hashing, or image fingerprinting is an area under active research.

A paper you might want to read is: http://www.robots.ox.ac.uk/~vgg/research/affine/det_eval_files/mikolajczyk_pami2004.pdf

Beyond the difficulity of finding descriptors fast lookup also tends to be a problem. Good image descriptors tend to be high dimensionality. Traditional tree based (i.e. kd-tree) approaches fail to produce fast lookups for nearest matches with high dimensionality data.

For the Wikimedia projects we store SHA1s for deleted images. There is now a set of IRC bots in (#commons-image-uploads2, #wikipedia-en-image-uploads) which check all new uploads against the deleted image SHA1s. They are catching a fair number of reuploads of deleted images.

I'm hoping to add a first-pass fuzzy matching support in the next couple of weeks. I'm not sure how a fuzzy image matching 'similar images' feature can be integrated into mediawiki proper.

SHA1 hash field got added a while ago. Yay!