The geoshapes service currently runs a transformation every time a client requests geoshapes:
SELECT id, ST_AsGeoJSON(ST_Transform(ST_Simplify(geometry, $3*sqrt(ST_Area(ST_Envelope(geometry)))), 4326)) as data FROM ( SELECT id, ST_Multi(ST_Collect(geometry)) AS geometry FROM ( SELECT wikidata AS id, (ST_Dump(geometry)).geom AS geometry FROM $1~ WHERE wikidata IN ($2:csv) AND GeometryType(geometry) != 'POINT' ) combq GROUP BY id ) subq
This seems like an expensive query, and it might be more efficient to run only once at the time of import. The dynamic granularity "$3" might never be changed from the default in production requests.
- Profile this transformation to determine the resource demands for these queries, and the resulting data size.
- Profile the ST_AsGeoJSON step separately—this probably expands the data quite a bit. Is this an expensive call? How much bigger is the data?
- Verify that Kartographer requests are never changing the query selection (currently simplifyarea, see also config.allowUserQueries), or the granularity constant (currently given as arg1 in the query, 0.001). Deprecate this feature either way, unless it's proven to be useful.
- Look at the imposm job to see how we can hook into it, by processing either rows during or after the synchronization. We only want to simplify the changed entities.
- Implement this simplification enhancement to the synchronization job. Create a new table, with the unique index wikidata_id, and the resulting geojson or binary geodata.
- Update the service to use the simplified column (maybe overlapping old and new data during the migration period)
- Fully refresh the master database to have simplified data for all wikidata entities.
Outcome
- simplified geojson takes ~50% of the space of the high quality data
SELECT sum(pg_column_size(geometry)) from wikidata_relation_polygon; SELECT sum(length(data)), avg(length(data)) FROM public.simplified_geojson
- average size of simplified geojson: 2727
- querying for the simplified version of every wikidata item in our example datasets (tested with ~1600) took ~2 seconds, normally only one wikidata item is being queried at a time which would take around 1 millisecond
--> adding a new table might not improve much, because the query is already pretty fast