Page MenuHomePhabricator

Investigation: Keep simplified geoshapes in maps database
Closed, ResolvedPublic5 Estimated Story Points

Description

The geoshapes service currently runs a transformation every time a client requests geoshapes:

SELECT id, ST_AsGeoJSON(ST_Transform(ST_Simplify(geometry, $3*sqrt(ST_Area(ST_Envelope(geometry)))), 4326)) as data
    FROM (
    SELECT id, ST_Multi(ST_Collect(geometry)) AS geometry
      FROM (
        SELECT wikidata AS id, (ST_Dump(geometry)).geom AS geometry
        FROM $1~
        WHERE wikidata IN ($2:csv)
          AND GeometryType(geometry) != 'POINT'
        ) combq
      GROUP BY id
    ) subq

This seems like an expensive query, and it might be more efficient to run only once at the time of import. The dynamic granularity "$3" might never be changed from the default in production requests.

  • Profile this transformation to determine the resource demands for these queries, and the resulting data size.
  • Profile the ST_AsGeoJSON step separately—this probably expands the data quite a bit. Is this an expensive call? How much bigger is the data?
  • Verify that Kartographer requests are never changing the query selection (currently simplifyarea, see also config.allowUserQueries), or the granularity constant (currently given as arg1 in the query, 0.001). Deprecate this feature either way, unless it's proven to be useful.
  • Look at the imposm job to see how we can hook into it, by processing either rows during or after the synchronization. We only want to simplify the changed entities.
  • Implement this simplification enhancement to the synchronization job. Create a new table, with the unique index wikidata_id, and the resulting geojson or binary geodata.
  • Update the service to use the simplified column (maybe overlapping old and new data during the migration period)
  • Fully refresh the master database to have simplified data for all wikidata entities.

Outcome

  • simplified geojson takes ~50% of the space of the high quality data
SELECT sum(pg_column_size(geometry)) from wikidata_relation_polygon;
SELECT sum(length(data)), avg(length(data)) FROM public.simplified_geojson
  • average size of simplified geojson: 2727
  • querying for the simplified version of every wikidata item in our example datasets (tested with ~1600) took ~2 seconds, normally only one wikidata item is being queried at a time which would take around 1 millisecond

--> adding a new table might not improve much, because the query is already pretty fast

Event Timeline

lilients_WMDE set the point value for this task to 5.

So this means that raster tiles too would be generated based on inaccurate data? I doubt this is desirable. Don't we also need unsimplified data so that T155919 could be improved in some form?

I'm not sure this is related. As far as I'm aware of we don't generate "raster tiles" with geoshapes, only small scale thumbnail maps (called "snapshots" in the code). These can't be scaled.

But I see that even a static thumbnail might reveal the inaccuracies of a simplified shape when the coordinates and zoom factor are chosen so that only a small part of the shape is visible.

One option is to keep both the original and simplified shapes in the database and use one or the other, depending on the situation.

Perhaps Pikne is talking about Tegola--I think it's correct that we need to keep the original, high-resolution map data. In which case the optimization described by this task would take *extra* database storage space but could still result in a big db CPU and disk access savings.

My understanding of how this all works is vague. If simplified data is kept in addition to unsimplified data then I suppose I got the wrong idea what this task is about.

My understanding of how this all works is vague. If simplified data is kept in addition to unsimplified data then I suppose I got the wrong idea what this task is about.

The questions are much appreciated! I'm also vague on whether the change will be helpful and in what ways—hopefully an investigation can answer that. The main point is that an expensive-looking query is executed for every shape request to build the geojson, but this query seems to be constant and to return the same simplified shape every time. Our initial estimate is that a query to fetch pre-processed geojson indexed by wikidata ID is at least one order of magnitude less cost to the DB.

Saving on storage was another potential benefit, but I think you're right that Tegola should have access to the high-resolution shape data in its original format. Thank you for suggesting this before we did any work to deprecate the data :-D . Keeping the data also gives us a simple way to run both algorithms side-by-side, rollback, and avoids any nastiness with unexpected incompatibility between the simplified data and other consumers.

lilients_WMDE updated the task description. (Show Details)
lilients_WMDE subscribed.
awight renamed this task from Keep simplified geoshapes in maps database to Investigation: Keep simplified geoshapes in maps database.Nov 8 2022, 8:07 AM