Page MenuHomePhabricator

Wikimania Hackathon 2019 proposal: metadata standard for map georectification
Closed, ResolvedPublic

Event Timeline

Hi @bert, thanks for taking the time to report this and welcome to Wikimedia Phabricator!

Do you plan to attend the Wikimania Hackathon? Is this a proposal for a session? Is this a proposal for a code project?

Yes, I'm planning to attend the hackathon, and the document attached is a proposal for a code project.

Thanlks for the clarification! Assigning this task to you as I assume that you plan to work on this.

Thanks for posting this task! Anything we can do to help prepare?

FYI, this is a (limited) first set of maps uploaded as a pilot project for StructuredDataOnCommons : https://commons.wikimedia.org/wiki/Category:MTN25_printed and, while quite recent, perhaps some of these maps may be interesting to work with.

I don't know too much about storing data like this in WIkimedia Commons or Wikidata, so if you can point me to some information about this, that would be useful!

Dropping this resource here:
https://www.wikidata.org/wiki/Wikidata:WikiProject_Maps/Historical_map_properties

It might need an update but should still be relevant.

Great, thanks for linking! I think I could go through the page to spot any outdated data. There may be additional, newer properties that we could use.

On another note, I think we could try to identify challenges in data storage / exchange before the hackathon. And invite related people to this ticket.

@bert I think this is great proposal!

When defining a format (or standard or API...) that is interest for significant user group there is need to balance ease of use to amount of features in the format. In any case, as shown in the proposal, I think that the idea of having possibiity to include metadata of the multiple images to same file is good since, for example, single historical map could consist of multiple images. I think the list shared by @Abbe98 is also useful for finding many possible features to include.

Historical maps that are rectified likely do not suffer from the geographical location inaccuracy caused by the use of EPSG:4326. Also, EPSG:4326 has been removed from GeoJSON standard because of the interoperability issues. On the orher hand, even though GeoJSON does not officially support other than EPSG:4326 other projections are also used with it. For example, Oskari can use EPSG:3067 in GeoJSONs. In any case, even though also GDAL supports various projections in georectification, I think a bit to think here is also if the metadata format for georectification is to be standard at some point in future.

@bert Interesting proposal. But for me it raises some issues. Firstly, about the business case for it. Secondly, regarding implementation.

Business case first. At the moment we store this information within the Commons MapWarper app. Our reasons should be clarified for why this is unsatisfactory or suboptimal, and what the new structure would aim to achieve over what exists at present. More visibility, more transparency, more obvious accessibility might be things that would be on that list. Are there other ways that we hope what we would build would improve on what presently exists?

Secondly, implementation. If we want to store the information in a WikiCommons environment, then it needs to be built on two things: Commons structured-data statements, and Commons data objects. Currently two forms of Commons data objects are defined: tabular data objects ( https://www.mediawiki.org/wiki/Help:Tabular_Data ) and shapefile objects ( https://www.mediawiki.org/wiki/Help:Map_Data ); but additional formats could possibly be added.

A question is: what information should go where.

Currently we have a file and then a link to a georeferencer app, so eg:

More maps with georeferencing can be found in sub-categories of https://commons.wikimedia.org/wiki/Category:Maps_with_georeferencing

Bert's proposal sounds like a suggestion for an additional type of Commons data object, with a specified JSON structure. This would likely require edits to the MediaWiki code itself, which might take some time coming; and would it necessarily be storing the data where we wanted it?

For maximum visibility, and accessibility through SPARQL queries, an alternative approach would be to store much of the georeferencing metadata as structured-data (SDC) statements directly on the metadata page for the file. To group everything together, one would probably want to have a single master-statement, with further information added as qualifiers. A couple of options suggest themselves for the master-statement. One might be for it to give a link to a geo-rectified version of the map, stored statically as an image on Commons. Qualifiers would then be used to state the mask and other parameters used to generate the transformation. Multiple master-statements could be used to link to different re-projections of different parts of the map.

Alternatively, it might be more flexible to make the master-statement a definition of a particular part of the map, with the link to a geo-rectified version then one of the qualifiers. This might fit better with syntax to annotate particular regions of an image -- so stage 1 of a process might be to say that part of an image depicted say Orkney and was an inset map, stage 2 might be to identify a detailed mask or outline to that sub-part of the image, stage 3 might be to add georeferencing to it; potentially with several months separating each stage. The preferred data-model for annotating part of an image in SDC hasn't really been thrashed out. But it may be that the top-level master statement would be what that region of the image depicts, with a qualifier saying where in the image it is (perhaps by box, perhaps by mask), and then further qualifiers specifying metadata about the georeferencing. One limitation of SDC is that currently one can't have a qualifier on a qualifier, so if one wanted to note additional things about any of the qualifers one couldn't. To some extent it may be possible to work round this, but this may be a limitation of the SDC model that will ultimately have to be re-visited.

The detailed data for control points (the GCPS data) is not suitable for storage as structured data statements. From experience with some of the BL georeferencing, some maps can have up to 200 control points or more added. Putting all these in structured data would make the data page unreadable. Instead, the best solution for these might be to store them as a Commons tabulated-data object, with a qualifier on the master statement pointing to the Commons file representing the tabulated data object.

Looking at the Klokan georeferencer v.2 data structure (click "view source" on the georectified map page), there are some additional metadata fields there that we may wish to consider. In the GCPS data, Klokan notes the source layer that the point was georeferenced against, and also its zoom level. It also notes the zoom level of the original map. There are scenarios where this information might be useful - for example, if one of the source layers turned out to be rather badly georeferenced, so that points georeferenced against it ought to be re-done; also, perhaps, to track which source layers at which zoom levels are most useful for georeferencing. (Also some sources package together different georeferenced layers at different zoom levels, so it might be a particular layer that was badly georeferenced). This information may not be crucial, but if it is available (and eg all BL crowdsourced data is licensed CC0), then we may wish to represent it. I haven't dug into the Klokan v4 spec in any detail to see whether that additional forms of pointwise data. In terms of global data, the Klokan v2 format also includes the co-ordinate bounding boxes and centre co-ordinates, descriptions of the image source, timestamps, user stamps, versioning info, etc -- all of which we may wish to think about, even if some of this the wiki might store for us for free.

Versioning is a potential challenge we would think about, since updates to the georeferencing might change all of: the GCPS block, the georectified image, and summary statements in the original image SDC, To look back to a previous version of the georeferencing, one would need to keep each of these in sync. So statements linking eg from the original image to the GCPS block or the georectified image (and vice-versa in the reverse direction) possibly need to be linking to a specific version of a file, as it was at a particular time, otherwise "undo"s of part of the data may cause real difficulties. This may be something we need to think about -- or perhaps it will be enough if the georeferencer app is aware of the issue, to keep look-backs synchronised.

Above I have suggested georectified map as a materialised file in its own right on Commons. This is another thing we may need to think about. Traditionally I think the warped versions of images may have been generated on-the-fly by the warper. But for presentation on wiki pages, or external use in external applications, it may be useful to have the geo-rectified image cached as an actual overlay image that can then be used directly, more easily, by all manner of 3rd party software or add-ons. The question of additional storage space vs additional processing demand may need to be assessed. As an additional consideration, georeferencer apps have typically offered multiple warping options -- eg affine transformation, global polynomial interpolation, or local spline interpolation. One might even want to offer additional options -- eg perhaps the option of an angle-preserving rotation-scaling transformation; or the option of projection estimation and direct inversion, perhaps with the option of additional interpolation on top of that. In such cases should one materialise and offer multiple georectified alternatives? Or just one? Or allow them to be displayed in the georeferencer app, with the user then having to specifically "save" one to change the one preserved? An additional complication, but I still think worth thinking about.

A final thought: If we want to be able to produce demos of different data modelling, it will be useful to be able to create new properties for a test instance of Commons SDC at will at Wikimania. Does that mean attaching it to a fully-loaded (at least as regards properties) test instance of Wikidata? That may be something to think about before the event, and not just for this project. (eg if there were workshops to develop different potential SDC modellings of GLAM metadata, they might want to be able to create test-versions of properties, too. Pinging @SandraF_WMF ).

@Jheald, thanks for your detailed response; I'll reply Monday or Tuesday, when I have more time!

(@Jheald: it took a little longer finding time to reply, I'm on my way to the conference in Stockholm, by bicycle. Between all the riding, it's been difficult to find time to finally reply to your comment! Will you be at the hackathon, by the way? I'd love to talk more about all of this with you in person.)

I had this idea before I knew I was going to attend the hackathon. I've worked with georectified maps from different sources and different software tools over the last years. I've always wished there was more interoperability of map metadata between all these sources/tools. And I think it's time for a search tool that can find historical maps from all important repositories of georectified historical maps.

Bert's proposal sounds like a suggestion for an additional type of Commons data object, with a specified JSON structure. This would likely require edits to the MediaWiki code itself, which might take some time coming; and would it necessarily be storing the data where we wanted it?

In my proposal I'm using a JSON structure as an example, just as a way to sum up all the fields needed to capture a complete map georectification. I'd like to use the hackathon to come up with a data structure that is compatible with, for example, Map Warper, Wikidata, Klokan and IIIF. The data structure could be a JSON structure, together with transformers to make it compatible with Commons structured-data statements, and Commons data objects.

I have made mockups for a georectification tool for Wikidocumentaries at https://wikidocumentaries.wmflabs.org/wiki/Rectifying_maps. They are based on layouts for the Wikimaps tools that were never realized, and adjusted for the Wikidocumentaries architecture. It does not include masking yet, maybe we could reflect on what is created at the hackathon. I will be eager to work on the best practises for storing rectification information in Wikimedia projects.

A final thought: If we want to be able to produce demos of different data modelling, it will be useful to be able to create new properties for a test instance of Commons SDC at will at Wikimania. Does that mean attaching it to a fully-loaded (at least as regards properties) test instance of Wikidata? That may be something to think about before the event, and not just for this project. (eg if there were workshops to develop different potential SDC modellings of GLAM metadata, they might want to be able to create test-versions of properties, too. Pinging @SandraF_WMF ).

@matthiasmullie will be at the hackathon too, and may be able to give insights here.

@bert I'll be there; I'm coming in on the Tuesday afternoon flight from Edinburgh, and then I'll be at the Comfort Hotel Xpress Stockholm Central.

My focus so far has been trying to identify good Commons categories for maps based on bounding boxes for the georeferencing. I've now got reasonable code to do that for the UK & Ireland -- see the pages here for preparations, and hoping to get some uploading going soon.

The rest of the world outside the UK is more work-in-progress, and I could use some input from people with local knowledge to try to achieve appropriate similar refinement.

I am or will be doing Wikidata matching for all the fields I can, but initially with a view to writing to Wikidata-driven field templates in conventional file description pages, rather than SDC statements. These should be easy enough to translate into matching SDC statements when the time comes, but at the moment my intention would be to wait until QuickStatements is available for bulk writing of statements, and a SPARQL service for systematic retrieval, before any systematic creation of SDC statements.

I am interested by IIIF, but don't know enough about it. Later versions of the Klokan georeferencer use IIIF for all their tile serving (ie including maps layers and the image to be georeferenced). The GLAM I am working with would very much like to move from its images from Klokan's IIIF service to an IIIF service provided by WikiCommons, and use that to feed the Klokan georeferencer; but that would need enterprise-level throughput robustness and reliability, which the present WMF Labs prototype just doesn't deliver. There are some indications that a proper IIIF service might be provided by an upcoming revision of the Wiki Multimedia service, but at the moment no hard timeline or guarantees.

The other aspect of IIIF would be how to expose georeferencing information as IIIF annotations. I really know *very* little about the IIIF annotation syntax, other than that it exists. But if this looked plausible (and could be made compatible with offerings from other services, eg Klokan and Recogito, it could be quite interesting.

By the way, I have a Wikidata property proposal suggested for external georeferencer URL, since a basic thing we will want to record is whether an external service has georeferencing for an image, and if so what the relevant URL is.

The proposal has been up a couple of weeks, but so far hasn't attracted any discussion.

So was thinking about moving the georeference data from Wikimaps Warper to Commons or Wikidata a few years back the main options were:

  • Wikidata statements for induvidual control points
  • One Wikidata statement with a geometry data type (new at the time)
  • Using the "Data" namespace at Commons (CSV or GeoJSON)

I think the overall conclusion were that the Data namespace was the best option but we never implemented it.

Quick update for those following from home.

@bert and others have been going gangbusters pushing this forward, and (I think) it's starting to look really good.

A first design was to try to package all the georefencing information in a singe Commons geoshape object, for example this revision

This was impressive, but it was a bit of a nasty hack, and packaging all of the data up to look like a geoshape creates a data-structure that's pretty unnatural and not immediately inuitive.

So a better and more straightforward approach seems to be to create three different objects in the Commons data namespace for each georectification:

It's now hoped to develop a micro-service that can read these objects and serve a Tiled Map Service (TMS) layer containing the georectified map, with caching for responsiveness.

In parallel to this, great progress has been made developing importers to bring in data where there is existing georeferencing for Commons maps on external sites. It should now be possible to import from Commons MapWarper, NYPL MapWarper, BL Georeferencer (Klokan v2), and the David Rumsey collection (Klokan v4). Special MVP award to @thisismattmiller for some of this.

Now is a good moment for discussion / consideration / review, but I have to say, myself, I am blown away by how much and how quickly in the last 48 hours people have managed to achieve on this.

To link the original file to these objects, three new SDC properties are proposed:

Due to the limitations inherent in the Wikibase statement structure (in particular, the limit of a single level of qualifiers), and the wish to potentially add qualifiers to these statements, eg relating them to a particular file revison, or to add information to distinguish between different live variant versions of the files, the most straightforward approach would seem for these to stand as independent full statements, related to a particular part of the image using qualifier P518 "applies to part". Additionally a new proposed qualifier "based on tabular data" (proposal) would link the mask geoshape statement to the particular files with the tabular data (out of many that might be associated with the image) used to generate it.

SDC property proposals and space for discussion at: https://www.wikidata.org/wiki/Wikidata:Property_proposal/georeferencing_data

I've created 3 GitHub repositories during the hackathon:

Closing this as the hackathon is over and all relevant links seems to be present here.