Page MenuHomePhabricator

Adding tags hstore GIN indexes to the OSM database on osmdb.eqiad.wmnet for performance
Open, Needs TriagePublic

Description

Short version

I'd like somebody to add these indexes to osmdb.eqiad.wmnet

CREATE INDEX planet_osm_polygon_tags_index ON planet_osm_polygon USING gin (tags)
CREATE INDEX planet_osm_point_tags_index ON planet_osm_point USING gin (tags)
CREATE INDEX planet_osm_line_tags_index ON planet_osm_line USING gin (tags)

It would speed up database queries based on OSM tags.

Long version

I'd like to switch to using the Wikimedia OSM tile server database running on osmdb.eqiad.wmnet for my OSM/Wikidata matcher.

I don't have the resources to maintain my own OSM tile server database, so currently my code uses the Overpass API to download the requested area and load it with osm2pgsql into a temporary database, then use that database to look for matches.

This works but it is can take a few minutes to download data from the Overpass API. My code splits large areas into smaller chunks for download to avoid triggering a timeout. I don't import roads because it would make my database too big, so I'm missing some matches.

Using the Wikimedia tile server database means I can avoid the OSM data loading stage of the matching process. This makes the software simpler, there is less to go wrong, and it finds more matches.

The problem with using the Wikimedia tile server database is that the tags hstore field isn't indexed. My queries to look for matches are slow. Most queries take a second, some can take 10 seconds. To process an area with 5,000 matches would take over 80 minutes. This is slower than my existing implementation.

When I download from Overpass and import the OSM data into PostGIS I pass a parameter to osm2pgsql so it creates indexes on the tags hstore field. My database queries are much quicker when the tags field is indexed.

It would be great if osmdb.eqiad.wmnet could have the tags field indexed, so querying by tag is fast. The existing database can be modified to add the indexes like this:

CREATE INDEX planet_osm_polygon_tags_index ON planet_osm_polygon USING gin (tags)
CREATE INDEX planet_osm_point_tags_index ON planet_osm_point USING gin (tags)
CREATE INDEX planet_osm_line_tags_index ON planet_osm_line USING gin (tags)

This will hold a write lock on the database table, it might be a good idea to pause the process that loads the planet diffs from OSM. The tables are still available to read while the indexes are created.

A different approach would be to create a new database from a planet dump using osm2pgsql. If the parameter --hstore-add-index is passed to osm2pgsql it will create the appropriate indexes.

If the database is being recreated with osm2pgsql it might also be good to add the --hstore-all parameter. This change would make the database bigger because some data would be duplicated, but it makes the database simpler to query because all tag information is stored in the tags hstore field. For example with this change the name tag would be stored in the name column and the tags field.

Thanks for taking the time to read my change request.

Event Timeline

OSM is PostreSQL which is not maintained by the DBA team.

This should be pretty easy in theory for the 'fresh' script but...... @MSantos eeh.. any idea how to puppetize a database scheme change like this ?

In Kartotherian we add indexes like that during initial import.

You can also add SQL to the files used by OSM databases and create a script to execute the index creations as we do to ensure grants.

Someone with the production credentials still need to run the script as root/postgres or any DB user with the DB level permission to create indexes, and the script should be added to the same script that created the database in the first place. I'm not aware if it's using the same settings available for the kartotherian stack.