Page MenuHomePhabricator

Failed to run OSM planet import on eqiad
Closed, ResolvedPublic

Description

Here is a quick description of the current status:

  • Because of https://phabricator.wikimedia.org/T296021 we manually fixed the borders and we decided to re-import everything to fix potential inconsistencies for all planet
  • We disabled OSM sync and kartotherian to free up some resources
  • We triggered the osm-initial-import
  • It failed (still need to file the ticket with the findings) with some PG errors

Related logs from postgres master on maps1009:

2022-01-13 18:34:52 GMT [28131]: [3-1] user=kartotherian,db=gis,app=[unknown],client=127.0.0.1 WARNING:  terminating connection because of crash of another server process
2022-01-13 18:34:52 GMT [28131]: [4-1] user=kartotherian,db=gis,app=[unknown],client=127.0.0.1 DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2022-01-13 18:34:52 GMT [28131]: [5-1] user=kartotherian,db=gis,app=[unknown],client=127.0.0.1 HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2022-01-13 18:34:52 GMT [28155]: [3-1] user=kartotherian,db=gis,app=[unknown],client=127.0.0.1 WARNING:  terminating connection because of crash of another server process
2022-01-13 18:34:52 GMT [28155]: [4-1] user=kartotherian,db=gis,app=[unknown],client=127.0.0.1 DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2022-01-13 18:34:52 GMT [28155]: [5-1] user=kartotherian,db=gis,app=[unknown],client=127.0.0.1 HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2022-01-13 18:34:52 GMT [28314]: [3-1] user=kartotherian,db=gis,app=[unknown],client=127.0.0.1 WARNING:  terminating connection because of crash of another server process
2022-01-13 18:34:52 GMT [28314]: [4-1] user=kartotherian,db=gis,app=[unknown],client=127.0.0.1 DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly

Imposm failure:

Jan 13 18:34:53 [2022-01-13T18:34:53Z] 20:34:49 pq: the database system is in recovery mode
Jan 13 18:34:54 imposm3 failed to complete initial import

Also some failures from kartotherian (geoshapes hitting PG):

error: remaining connection slots are reserved for non-replication superuser connections
    at Connection.parseE (/srv/deployment/kartotherian/deploy-cache/revs/65895c017dbd85ceddbb950b89a25a159b551212/node_modules/pg/lib/connection.js:539:11)
    at Connection.parseMessage (/srv/deployment/kartotherian/deploy-cache/revs/65895c017dbd85ceddbb950b89a25a159b551212/node_modules/pg/lib/connection.js:366:17)
    at Socket.<anonymous> (/srv/deployment/kartotherian/deploy-cache/revs/65895c017dbd85ceddbb950b89a25a159b551212/node_modules/pg/lib/connection.js:105:22)
    at Socket.emit (events.js:198:13)
    at Socket.EventEmitter.emit (domain.js:448:20)
    at addChunk (_stream_readable.js:288:12)
    at readableAddChunk (_stream_readable.js:269:11)
    at Socket.Readable.push (_stream_readable.js:224:10)
    at TCP.onStreamRead [as onread] (internal/stream_base_commons.js:94:17)

From grafana related to pg connections:
https://grafana.wikimedia.org/goto/J77_mTJ7k

Things to investigate:

  • How can we deal with the failing OSM import which is the main issue and needs to happen soon?
  • How can we avoid PG connection starvation on OSM masters?
  • Is this issue somehow related with PG replication?

Event Timeline

Potential next steps to improve the situation:

  • Disable kartotherian on maps1009, maps2009 so we don't get any connections to maps PG masters
  • Disable cassandra which is not used either way to avoid using a good amount of memory without a reason
    • Not sure if/how this could contribute to the issue but better to be on the safe side
  • Check if/how PG replication affects the imposm import
  • Re-run the osm-initial-import script

Also it might be worth to revisit the following given that we have way more memory available:

  • Max PG connections allowed
  • PG buffer size

Change 757427 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[mediawiki/services/kartotherian@master] Make geoshapes connection pool size configurable

https://gerrit.wikimedia.org/r/757427

This seems relevant: https://github.com/omniscale/imposm3/issues/170

Import is all done in a single transaction (with a COPY FROM data stream). If the connection is dropped, the transaction is lost and Imposm can't continue with the import.

Seems rather problematic on an action that can take like 50 hours to complete sometimes.

Change 757427 merged by jenkins-bot:

[mediawiki/services/kartotherian@master] Make geoshapes connection pool size configurable

https://gerrit.wikimedia.org/r/757427

Jgiannelos claimed this task.