Page MenuHomePhabricator

Restore a tool for nowcommons.py to find duplicated images based on file hashes
Open, LowPublic

Description

There used to be a Toolserver tool that generates a list of potentially duplicated images between a local site and the repository (Commons) site based on the match of hash values derived from the image bytes. We would like to get the function of the tool restored, either by re-launching the code or reimplementing it.


Original task description:

https://toolserver.org/~multichill/nowcommons.php?language=it&page=2&filter= is no longer available
Is there another URL yet?

Event Timeline

Xqt created this task.Apr 10 2016, 8:32 PM
Xqt triaged this task as Low priority.Apr 10 2016, 8:33 PM
Restricted Application added a subscriber: Avicennasis. · View Herald TranscriptApr 11 2016, 11:54 AM

Change 308427 had a related patch set uploaded (by Xqt):
[bugfix] remove -hash option from support

https://gerrit.wikimedia.org/r/308427

whym added a subscriber: whym.Sep 27 2016, 1:20 PM

@Multichill: any chance you can comment on the status of the webservice? Do we understand it correctly that it isn't/won't be restored?

I don't have any plans to restore it. I do have the original code somewhere if someone feels like forking it.

whym added a comment.Oct 3 2016, 11:22 AM

Thanks for the clarification.

I am guessing that its function was to generate a list of potentially duplicated images between a local site and the repository (Commons) site based on the match of hash values derived from the image bytes. If that is correct, I think it is a useful tool to restore.

May I suggest changing this task's title into something like "Restore/recreate a tool to find duplicated images based on file hashes"?

And then nowcommons.py can have a message that redirects users here. Some of them might be interested in reviving it.

whym renamed this task from webservice page is no longer valid for nowcommons.py to Restore a tool for nowcommons.py to find duplicated images based on file hashes.Oct 8 2016, 1:51 AM
whym updated the task description. (Show Details)

Change 308427 merged by jenkins-bot:
[bugfix] remove -hash option from support

https://gerrit.wikimedia.org/r/308427

scfc added a subscriber: scfc.Feb 18 2017, 2:37 PM

The query should be very simple:

MariaDB [enwiki_p]> SELECT enwiki_p.image.img_name, commonswiki_p.image.img_name FROM enwiki_p.image JOIN commonswiki_p.image USING (img_sha1) LIMIT 10;
+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+
| img_name                                                                           | img_name                                                                           |
+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+
|                                                                                    |                                                                                    |
| Clergymandaugher.jpg                                                               | Clergymandaugher.jpg                                                               |
| STS_Pallada_-_August_2011.jpg                                                      | STS_Pallada_-_August_2011.jpg                                                      |
| New_Kadampa_Tradition.png                                                          | New_Kadampa_Tradition.png                                                          |
| BanburyTramway_estate_shed_2.jpg                                                   | BanburyTramway_estate_shed_2.jpg                                                   |
| Victoria_county_with_mariposa_highlighted.png                                      | Victoria_county_with_mariposa_highlighted.png                                      |
| Short_line_coupler_unequal_impedance.svg                                           | Short_line_coupler_unequal_impedance.svg                                           |
| OgMandinoInspirationalwriter.jpg                                                   | OgMandinoInspirationalwriter.jpg                                                   |
| Daisybell.jpg                                                                      | Daisybell.jpg                                                                      |
| Roses_growing_in_front_of_graves,_Menin_Road_South_Military_cemetery_977687052.jpg | Roses_growing_in_front_of_graves,_Menin_Road_South_Military_cemetery_977687052.jpg |
+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+
10 rows in set (0.31 sec)

MariaDB [enwiki_p]>

In practice, for example on enwiki pages in Category:Wikipedia files on Wikimedia Commons for which a local copy has been requested to be kept need to be excluded.

I would suggest making a database report or its local equivalent out of the finished query; IMHO there is a bigger motivation to fix a list of things that is already on your wiki than to periodically browse some tool's website.

Xqt added a subscriber: XXN.