Investigation: Keep simplified geoshapes in maps database
Closed, ResolvedPublic5 Estimated Story Points
Actions

Assigned To

None

Authored By

	awight
	Nov 3 2022, 5:15 PM

Description

The geoshapes service currently runs a transformation every time a client requests geoshapes:

SELECT id, ST_AsGeoJSON(ST_Transform(ST_Simplify(geometry, $3*sqrt(ST_Area(ST_Envelope(geometry)))), 4326)) as data
    FROM (
    SELECT id, ST_Multi(ST_Collect(geometry)) AS geometry
      FROM (
        SELECT wikidata AS id, (ST_Dump(geometry)).geom AS geometry
        FROM $1~
        WHERE wikidata IN ($2:csv)
          AND GeometryType(geometry) != 'POINT'
        ) combq
      GROUP BY id
    ) subq

This seems like an expensive query, and it might be more efficient to run only once at the time of import. The dynamic granularity "$3" might never be changed from the default in production requests.

Profile this transformation to determine the resource demands for these queries, and the resulting data size.
Profile the ST_AsGeoJSON step separately—this probably expands the data quite a bit. Is this an expensive call? How much bigger is the data?
Verify that Kartographer requests are never changing the query selection (currently simplifyarea, see also config.allowUserQueries), or the granularity constant (currently given as arg1 in the query, 0.001). Deprecate this feature either way, unless it's proven to be useful.
Look at the imposm job to see how we can hook into it, by processing either rows during or after the synchronization. We only want to simplify the changed entities.
Implement this simplification enhancement to the synchronization job. Create a new table, with the unique index wikidata_id, and the resulting geojson or binary geodata.
Update the service to use the simplified column (maybe overlapping old and new data during the migration period)
Fully refresh the master database to have simplified data for all wikidata entities.

Outcome

simplified geojson takes ~50% of the space of the high quality data

SELECT sum(pg_column_size(geometry)) from wikidata_relation_polygon;
SELECT sum(length(data)), avg(length(data)) FROM public.simplified_geojson

average size of simplified geojson: 2727
querying for the simplified version of every wikidata item in our example datasets (tested with ~1600) took ~2 seconds, normally only one wikidata item is being queried at a time which would take around 1 millisecond

--> adding a new table might not improve much, because the query is already pretty fast

Related Objects

Mentioned In: T322353: Investigation: Move geoshape expansion to Kartographer parse-time
Mentioned Here: T155919: Kartographer geoline and geoshape services have low accuracy at medium-to-high zoom levels (for long or complicated features)

Event Timeline

awight created this task.Nov 3 2022, 5:15 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 3 2022, 5:15 PM

lilients_WMDE claimed this task.Nov 4 2022, 10:13 AM

lilients_WMDE set the point value for this task to 5.

lilients_WMDE moved this task from Sprint Backlog to Doing on the WMDE-TechWish-Sprint-2022-10-26 board.

lilients_WMDE moved this task from Backlog to In sprint on the WMDE-GeoInfo-FocusArea board.Nov 4 2022, 10:49 AM

So this means that raster tiles too would be generated based on inaccurate data? I doubt this is desirable. Don't we also need unsimplified data so that T155919 could be improved in some form?

I'm not sure this is related. As far as I'm aware of we don't generate "raster tiles" with geoshapes, only small scale thumbnail maps (called "snapshots" in the code). These can't be scaled.

But I see that even a static thumbnail might reveal the inaccuracies of a simplified shape when the coordinates and zoom factor are chosen so that only a small part of the shape is visible.

One option is to keep both the original and simplified shapes in the database and use one or the other, depending on the situation.

Perhaps Pikne is talking about Tegola--I think it's correct that we need to keep the original, high-resolution map data. In which case the optimization described by this task would take *extra* database storage space but could still result in a big db CPU and disk access savings.

awight updated the task description. (Show Details)Nov 4 2022, 11:36 AM

My understanding of how this all works is vague. If simplified data is kept in addition to unsimplified data then I suppose I got the wrong idea what this task is about.

In T322351#8369608, @Pikne wrote:

My understanding of how this all works is vague. If simplified data is kept in addition to unsimplified data then I suppose I got the wrong idea what this task is about.

The questions are much appreciated! I'm also vague on whether the change will be helpful and in what ways—hopefully an investigation can answer that. The main point is that an expensive-looking query is executed for every shape request to build the geojson, but this query seems to be constant and to return the same simplified shape every time. Our initial estimate is that a query to fetch pre-processed geojson indexed by wikidata ID is at least one order of magnitude less cost to the DB.

Saving on storage was another potential benefit, but I think you're right that Tegola should have access to the high-resolution shape data in its original format. Thank you for suggesting this before we did any work to deprecate the data :-D . Keeping the data also gives us a simple way to run both algorithms side-by-side, rollback, and avoids any nastiness with unexpected incompatibility between the simplified data and other consumers.

awight updated the task description. (Show Details)Nov 4 2022, 12:18 PM

lilients_WMDE removed lilients_WMDE as the assignee of this task.Nov 4 2022, 3:24 PM

lilients_WMDE updated the task description. (Show Details)

lilients_WMDE subscribed.

lilients_WMDE moved this task from Doing to Tech Review on the WMDE-TechWish-Sprint-2022-10-26 board.Nov 4 2022, 3:26 PM

awight renamed this task from Keep simplified geoshapes in maps database to Investigation: Keep simplified geoshapes in maps database.Nov 8 2022, 8:07 AM

awight mentioned this in T322353: Investigation: Move geoshape expansion to Kartographer parse-time.Nov 9 2022, 10:29 AM

WMDE-Fisch added a project: WMDE-TechWish-Sprint-2022-11-09.Nov 9 2022, 12:44 PM

Lena_WMDE moved this task from Sprint Backlog to Tech Review on the WMDE-TechWish-Sprint-2022-11-09 board.Nov 9 2022, 12:46 PM

lilients_WMDE moved this task from Tech Review to Done on the WMDE-TechWish-Sprint-2022-11-09 board.Nov 16 2022, 2:03 PM

thiemowmde moved this task from Incoming to In progress on the WMDE-TechWish-Maintenance board.Nov 29 2022, 5:48 PM

WMDE-Fisch closed this task as Resolved.Jan 5 2023, 8:41 AM

Investigation: Keep simplified geoshapes in maps databaseClosed, ResolvedPublic5 Estimated Story PointsActions

Description

Related Objects

Event Timeline

Investigation: Keep simplified geoshapes in maps database
Closed, ResolvedPublic5 Estimated Story Points
Actions