Page MenuHomePhabricator

Implement image tracking in the monuments database
Open, LowPublic

Description

The monuments database should contain a table with all the images at Commons which have a valid identifier.

The code to extract the identifiers is already available in the unused images bot (https://fisheye.toolserver.org/browse/erfgoed/erfgoedbot/unused_monument_images.py?hb=true).

The bot should get valid template/tracker categories from the configuration (https://fisheye.toolserver.org/browse/erfgoed/erfgoedbot/monuments_config.py?hb=true)

Loop over these and for each source get all the images + metadata.


Version: unspecified
Severity: enhancement

Details

Reference
bz37455

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 12:24 AM
bzimport set Reference to bz37455.
bzimport added a subscriber: Unknown Object (MLST).

alexxw83 wrote:

The database should at least contain:

  • Filename
  • Monuments ID as in templates
  • Uploader
  • Upload date
  • Is there {{Wiki Loves Monuments yyyy}} and year

Optional:

  • coordinates
  • categories
  • image resolution
  • file size

Probably best to start with a minimal implementation where wikitext parsing is not needed. Existing tools can be converted to make use of this table. Later this information can be extended.

Did a first implementation.

CREATE TABLE image (

`country` varchar(10) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL DEFAULT '',
`id` varchar(25) NOT NULL DEFAULT '0',
`img_name` varchar(255) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL DEFAULT '',
PRIMARY KEY (`country`,`id`,`img_name`),
KEY `country_id` (`country`,`id`),
KEY `img_name` (`img_name`)

) ENGINE=MyISAM DEFAULT CHARSET=utf8

mysql> SELECT COUNT(*) FROM image;
+----------+

COUNT(*)

+----------+

878926

+----------+
1 row in set (0.00 sec)

Playing around with api at http://toolserver.org/~multichill/monapi/api.php?action=images&country=pl&id=MA/A-1028&format=json&width=2000&limit=999999

alexxw83 wrote:

Great start, thx Maarten!

Beside the additional wished fields I noticed some errors, most should be no problem for you ;)

  • IDs are filled with zeroes (28985 appears as 00028985)
  • IDs got uppercase (ArD-9-006 appears as ARD-9-006)
  • some IDs are the uppercase image name (ie 'WIGANDG 29.JPG')
  • Pictures with more than one ID-template are just once in the table, but should be as often as there are templates (ie http://commons.wikimedia.org/wiki/File:Murtalbahnbr%C3%BCcke_1.JPG)

I think I've fixed this zero problem already, just haven't updated the database yet. Same for the lowercase/uppercase thing. I'm struggling a bit with how to do the padding on Commons. Currently it's padded with '0', but this is causing problems in the USA. The NRHP uses the first two characters for the year so all the nominations in 2000 (00xxxx) get their two zero's chopped off.

I use the categorylinks table for the id's. Multiple templates adding the same category just gives one entry, so that's what I'm using.

Restricted Application added subscribers: JeanFred, Aklapper. · View Herald TranscriptSep 4 2015, 2:29 PM

I think this is done, worked for a while and broke down. We have T165086 for the brokenness and T55814 for improving it. Any point in leaving this one open?