Page MenuHomePhabricator

Investigate best source for GeoPoint coordinates
Closed, ResolvedPublic3 Estimated Story Points

Description

Possible data sources are:

  • Processed, cached OSM dump (this is how geoshape works)
  • Processed, cached Wikidata dump (P625). This would be new development.
  • SPARQL query by QID.
  • wbentities query

Questions:

  • What are advantages and disadvantages of each?
    • Wikidata
      • Pro: can be edited by any wiki account.
      • Pro: contains 9x more geolocated items than OSM
      • Con: we'll need a new mechanism to cache or query this source.
    • OSM
      • Pro: High visibility means more eyes to correct errors.
      • Pro: import mostly exists already.
      • Con: import is not quite configured for point, would require adjustments.
  • Which data set is more complete? (Determine how many coordinates are associated with a QID on OSM vs. how many wikidata items have coordinates?
    • 9M geolocated Wikidata items
    • 1M OSM entries linked to a Wikidata item
  • Find out how the dataflow (wikidata -> OSM) is currently working (for GeoShapes): Do they have the same data? In which direction does the data flow? Does it happen automatically, and if so on what schedule?
    • Every 12 hours a job pulls from OSM and imports data into postgres-postgis.

Event Timeline

I would like to highlight a few options:

  1. When the users input is a SPAQRL query, I would expect them to provide a ?geo column directly from the query. This is also how the query service does it. All we basically do is to convert the response from the query service into GeoJSON.
  2. When the users SPARQL query doesn't give us coordinates but only Q-IDs, we could try to resolve these to coordinates. However, I'm not sure if this is even necessary. Why not go with the first option above and have a ?geo field in the query?
  3. When the input is just a list of Q-IDs, one way to resolve these to coordinates is as follows:
SELECT ?item ?geo WHERE {
  VALUES ?item { wd:Q64 wd:Q1055 }
  ?item wdt:P625 ?geo
}
  • This is as fast as it can get. I think this type of query directly via primary key causes effectively zero load on the query service.
  • All the code is already there. The only difference is that this is not a user-provided query, but one we auto-generate in our code. From there on the code path is the same.
  • While OSM has some concept of "admin_centre", it doesn't help much. As the name suggests it's only for "administrative" units like cities. For many other things the Q-ID can give us a shape, but no center point. Extrapolating some kind of center point from a shape is problematic and should usually not be done. Wikidata on the other hand can give us both.

Another option is the wbgetentities API. However, this is quite a beast. It doesn't look like it's possible to limit the output to the single coordinate property we need.

awight set the point value for this task to 3.

How many Wikidata items have a coordinate?: 9,060,013

SELECT (count(*) as ?count) WHERE {
  ?item wdt:P625 ?geo
}

Check whether there are more items with boundary coordinates but no main coordinate P625: 47,437

SELECT (count(*) as ?count) WHERE {
  { ?item wdt:P1332 ?geo. }
  union { ?item wdt:P1333 ?geo. }
  union { ?item wdt:P1334 ?geo. }
  union { ?item wdt:P1335 ?geo. }
  minus { ?item wdt:P625 ?geo . }
}

OSM import is done by imposm3, which is configured using various imposm_mapping.yml files. I haven't tracked down the production mappings, if any. This tool is optimized for polygon geometries, and the wikidata_relation_* tables seem to be filtered to only types of row that might have shape geometry. It seems possible to expand this import to include point entities.

Wikidata items linking to OSM: 206,226
https://www.wikidata.org/wiki/Property_talk:P402

OSM relations tagged with a Wikidata QID: 1,031,124
sophox

SELECT (count(*) as ?count) WHERE {
  ?osmid osmt:wikidata ?wd .
}

https://wikitech.wikimedia.org/wiki/Maps/OSM explains production synchronization. The production mapping configuration looks mostly the same as the kartodock file. The schedule seems to be in a messy state, with a schedule of h=*/12 m=*/30, in other words it will run four times a day, at 00:00, 00:30, 12:00, and 12:30. (I think minute=0 is what was intended.)

awight removed awight as the assignee of this task.Apr 1 2022, 2:03 PM
awight updated the task description. (Show Details)
awight moved this task from Doing to Tech Review on the WMDE-TechWish-Sprint-2022-03-30 board.
awight subscribed.

One more investigation query we came up with: take the union of all items with a coordinate in either OSM or Wikidata, then subtract the ones with a coordinate in Wikidata. This gives us a count of the items which *only* have a coordinate in OSM.

Some more details in addition to what I wrote above:

  • The wbgetentities API is expensive. Example: https://www.wikidata.org/wiki/Special:ApiSandbox#action=wbgetentities&format=json&ids=Q64%7CQ1055&props=claims&formatversion=2. While it allows to query multiple ids, it doesn't allow a lot of filtering. All statements are returned, including all references and qualifiers. That's potentially megabytes of data, multiplied by the number of ids. While the database access appears to be somewhat optimized (there is a prefetch step and some kinds of caches involved), it can't do any relevant optimization like "indexed access by item id + property id". That's what a tripple store is for. The query service.
  • There is a wbgetclaims API on wikidata.org. Example: https://www.wikidata.org/wiki/Special:ApiSandbox#action=wbgetclaims&format=json&entity=Q64&property=P625&props=&formatversion=2. While it sounds promising, it's effectively unusable for what we need: a) It can only read 1 entity at a time. We would need to call it up to 100 times in a loop. b) It also reads the entire entity. There is almost no benefit over wbgetentities, except that it saves network traffic because it drops the majority of data it just read.
  • In OSM, there is no guarantee for a Q-id to be unique. And that's probably fine on OSM. A town, for example, is often represented as the surounding shape and as a point (where the name of the town should appear).
  • Already mentioned above: Even if you find a Q-id, there is no guarantee it's a point. It might be a line, shape, anything.

Overall I think we are stuck with the Wikidata-Query-Service, except the Wikidata folks have something else we missed so far.

I also looked at the Linked Data Fragments endpoint, our simple use case seems to match this description "cheaply and efficiently browse triple data where one or two components of the triple is known and you need to retrieve all triples that match this template", but I don't think it allows multiple subjects to be queried at once.

We have some good options for actually ingesting the Wikidata tuples in near-real-time, caching the coordinates locally as we do for OSM. This should be kept on the list of alternatives that we're considering.

our simple use case seems to match this description […]

It goes on explaining "This service is implemented on the top of Blazegraph database, so it will have the same lag as the Query Service." It's certainly worth having a closer look. But from my brief investigation it looks like this just executes a SPARQL query. I can't tell though if it uses a different Blazegraph instance or the same the query server uses.

WMDE-Fisch subscribed.

Will discuss results and follow ups in story time.