I have been regularly checking & tagging the file uploads at my eswiki (main
project), and thus i have seen almost all images uploaded on last times.
I realized that we have not any good way to compare files. Sometimes you see a
file you're pretty sure has been uploaded before, but you have no method to
found it. If it's been uploaded with the same name you can think is the same
(can't be sure!). But usually has another name. Even if you have seen it on the
same session you'd need to wath the images to find it.
The same goes when a uploaded file says: "from X wikipedia". You need to go
there, download and wath the file to see if it fits.
As a conclusion, i decided we needed a file hash to uploads. Then, there cames a
new question: where do i store it?
The image table seems a good place, creating a new (indexable) field for it.
This has two problems:
-We need to change the table fields.
-We don't record the has of deleted images. No information on reupload :(
The final solution could be a new table able to relation, but i didn't want to
make drastic changes on table design.
So i tried to make it simple and simply put the md5 hash on the logs. Pros: It's
a minor change. Cons: It's not a big change, so we can't use all power this
feature gives us, BUT it's more than nothing. :-)
The patch i did for it (against r1495) is attached. A new $md5desc variable is
defined to have the description with the hash to stamp on the logs. That applies
to the Special:Log and the 'page history'. It also truncates the description
wich appears on these logs if it's too large. It was also previosly truncated,
but couldn't find where, was it truncated by the db?
Note that as at Special we can't search by descrition, we can't make a log
complete search to see if an image was previously uploaded, but we can use the
browser-search feature to check in recent ones, and also have a bot logging them
to make them easier to fetch.
Bots could also use this data for interwiki image comparing. On the TODO list
(bad to do as it's implemented) there would also be a check to the previous
image hash to reject it if new version is the same as previous.
On the SoC proposals, hashing was also requested, though in a more extensive plan.
P.D. While writing this, an image was uploaded that i'm sure that was before.
Searching, it was, in fact, deleted three days ago (with the same name, of
course only my memory can atestiguate it's the same).
Version: unspecified
Severity: enhancement