Page MenuHomePhabricator

Re-import OSM data at eqiad and codfw to temporarily fix current OSM replication issues.
Closed, ResolvedPublic

Description

While @MSantos continue to on refining imposm etc, Let's do a fresh OSM import for maps servers to fix the replication issue.

After a detailed session Guillaume (Thanks Guillaume), I discover this process is more complicated that I envisaged.

Current problem:
As it is (not really), the failing replication started around 2019-10-27 (https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&fullscreen&panelId=11&from=now-90d&to=now). This means tilerator activities has stopped since then and there's no new tile list. So all our servers have tiles up to 2019-10-27.

Way forward:
What we are proposing is to start osm import before when it started failing. Before 2019-10-27. This will ensure tiles are up to date. To achieve this with our current state, we should disable tilerator on all the slave nodes before starting osm import and ensure it stays disabled during import and after. Then we should re-init postgres one at a time on each slave and then re-enable tilerator to start tile generation from an updated postgres.

More problem (not serious)
I had already started osm import without all these considerations and had to stop it. So maps1004 is out and we running eqiad on three nodes currently. My bad :(

Steps/Processes:
At each DC (starting from eqiad) maps1004

  • Downtime tilerator checks on icinga for all slaves.
  • Stop tilerator on all slaves and disable the systemd unit. Make sure it stays disabled. Ideally, we should mask this service but not sure if puppet will honour this.
  • Depool masp1004 and disable puppet on maps1004(master).
  • Reset postgres on maps1004. Make sure tilerator is running. Disable osm-replicate crontask. (This should be enabled after osm-import is complete)
  • Start osm-import script using a dump before 2019-10-27 and an even older state file.
  • Continue to monitor import while making sure tilerator on slaves stay dead.
  • when osm-import is completed, make sure tiles are being generated. This can be checked via tileratorUI.
  • Enable replicate-osm crontask
  • Pool maps1004(master)

On slaves (maps1001):

  • Depool the slave
  • downtime all alerts on the host and disable puppet.
  • Re-init postgres.

These processes can be done via the postgres reinit cookbook.

  • Enable tilerator after re-initialization is completed.

On slaves (maps1002):

  • Depool the slave
  • downtime all alerts on the host and disable puppet.
  • Re-init postgres.

These processes can be done via the postgres reinit cookbook.

  • Enable tilerator after re-initialization is completed.

On slaves (maps1003):

  • Depool the slave
  • downtime all alerts on the host and disable puppet.
  • Re-init postgres.

These processes can be done via the postgres reinit cookbook.

  • Enable tilerator after re-initialization is completed.

Codfw:

  • Downtime tilerator checks on icinga for all slaves.
  • Stop tilerator on all slaves and disable the systemd unit. Make sure it stays disabled. Ideally, we should mask this service but not sure if puppet will honour this.
  • Depool maps2004 and disable puppet on maps2004(master).
  • Reset postgres on maps2004. Make sure tilerator is running. Disable osm-replicate crontask. (This should be enabled after osm-import is complete)
  • Start osm-import script using a dump before 2019-10-27 and an even older state file.
  • Continue to monitor import while making sure tilerator on slaves stay dead.
  • when osm-import is completed, make sure tiles are being generated. This can be checked via tileratorUI.
  • Enable replicate-osm crontask
  • Pool maps2004(master)

On slaves (maps2001):

  • Depool the slave
  • downtime all alerts on the host and disable puppet.
  • Re-init postgres.

These processes can be done via the postgres reinit cookbook.

  • Enable tilerator after re-initialization is completed.

On slaves (maps2002):

  • Depool the slave
  • downtime all alerts on the host and disable puppet.
  • Re-init postgres.

These processes can be done via the postgres reinit cookbook.

  • Enable tilerator after re-initialization is completed.

On slaves (maps2003):

  • Depool the slave
  • downtime all alerts on the host and disable puppet.
  • Re-init postgres.

These processes can be done via the postgres reinit cookbook.

  • Enable tilerator after re-initialization is completed.

Event Timeline

A few more comments inline.

Steps/Processes:
At each DC (starting from eqiad)

  • Downtime tilerator checks on icinga for all slaves.

also disable postgres lag checks on slaves, those will alert once the master is down.

  • Stop tilerator on all slaves and disable the systemd unit. Make sure it stays disabled. Ideally, we should mask this service but not sure if puppet will honour this.

stop puppet on all servers during the operation

  • Depool masp1004 and disable puppet on maps1004(master).
  • Reset postgres on maps1004. Make sure tilerator is running. Disable osm-replicate crontask. (This should be enabled after osm-import is complete)

Disabling the cron tasks is a puppet change and need to be done before disabling puppet.

  • Start osm-import script using a dump before 2019-10-27 and an even older state file.

The state file and the dump should be synchronized on the same date.

  • Continue to monitor import while making sure tilerator on slaves stay dead.
  • when osm-import is completed, make sure tiles are being generated. This can be checked via tileratorUI.
  • Enable replicate-osm crontask
  • Pool maps1004(master)

On slaves (one at a time):

  • Depool the slave
  • downtime all alerts on the host and disable puppet.
  • Re-init postgres.

These processes can be done via the postgres reinit cookbook.

  • Enable tilerator after re-initialization is completed.

Change 554860 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/puppet@production] maps: disable replication cron

https://gerrit.wikimedia.org/r/554860

Change 554860 merged by Gehel:
[operations/puppet@production] maps: disable replication cron

https://gerrit.wikimedia.org/r/554860

Mentioned in SAL (#wikimedia-operations) [2019-12-05T14:51:55Z] <onimisionipe> disable tilerator on maps100[1-3].eqiad.wmnet - T239728

Mentioned in SAL (#wikimedia-operations) [2019-12-05T14:52:30Z] <onimisionipe> disable puppet on maps100[1-3].eqiad.wmnet - T239728

Mentioned in SAL (#wikimedia-operations) [2019-12-05T15:16:14Z] <onimisionipe> run osm-import on maps1004 - T239728

I'm getting:

CREATE FUNCTION
executing SQL in: /srv/deployment/kartotherian/deploy/node_modules/@kartotherian/geoshapes/sql
  executing: /srv/deployment/kartotherian/deploy/node_modules/@kartotherian/geoshapes/sql/create-indexes.sql
DO
DO
DO
ERROR:  Relate Operation called with a LWGEOMCOLLECTION type.  This is unsupported.
HINT:  Change argument 2: 'GEOMETRYCOLLECTION(POINT(2336431.35890444 7641045.27589611),LINESTRING(233584...'
CONTEXT:  PL/pgSQL function populate_admin() line 29 at IF

before replicate-osm starts. @MSantos can you look into this

Mentioned in SAL (#wikimedia-operations) [2019-12-09T19:01:20Z] <onimisionipe> continue osm-import on maps1004 - T239728

Change 556525 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/puppet@production] maps: enable replication cron

https://gerrit.wikimedia.org/r/556525

Change 556525 merged by Gehel:
[operations/puppet@production] maps: enable replication cron

https://gerrit.wikimedia.org/r/556525

Mentioned in SAL (#wikimedia-operations) [2019-12-12T14:30:05Z] <onimisionipe> pool maps1004 osm-import is complete - T239728

Mentioned in SAL (#wikimedia-operations) [2019-12-12T14:40:59Z] <onimisionipe> depool maps1001 for postgres reinitialization - T239728

Mentioned in SAL (#wikimedia-operations) [2019-12-12T20:02:55Z] <onimisionipe> pool maps1001 - postgres re-init is complete - T239728

Mentioned in SAL (#wikimedia-operations) [2019-12-13T07:56:01Z] <onimisionipe> depool maps1002 for postgres init. - T239728

Mentioned in SAL (#wikimedia-operations) [2019-12-13T14:37:06Z] <onimisionipe> pool maps1002 after postgres init - T239728

Mentioned in SAL (#wikimedia-operations) [2019-12-13T14:51:12Z] <onimisionipe> depool maps1003 after postgres init - T239728

Mentioned in SAL (#wikimedia-operations) [2019-12-13T21:29:54Z] <onimisionipe> disabled tilerator on maps200[1-3] - T239728

Mentioned in SAL (#wikimedia-operations) [2019-12-13T21:31:52Z] <onimisionipe> depool maps2004 for osm initial import - T239728

Mentioned in SAL (#wikimedia-operations) [2019-12-16T06:59:48Z] <onimisionipe> pool maps2004. osm import is complete - T239728

Mentioned in SAL (#wikimedia-operations) [2019-12-16T07:09:22Z] <onimisionipe> depool maps2001 for postgres reinit - T239728

Mentioned in SAL (#wikimedia-operations) [2019-12-16T19:04:42Z] <onimisionipe> depool maps2002 for postgres init - T239728

Mentioned in SAL (#wikimedia-operations) [2019-12-17T06:48:01Z] <onimisionipe> pool maps2002. Postgres init is complete - T239728

Mentioned in SAL (#wikimedia-operations) [2019-12-17T06:50:38Z] <onimisionipe> depool maps2003 for postgres init - T239728

Mentioned in SAL (#wikimedia-operations) [2019-12-17T15:14:25Z] <onimisionipe> pool maps2003 after postgres init - T239728

Mentioned in SAL (#wikimedia-operations) [2019-12-18T06:45:41Z] <onimisionipe> running replicate-osm on maps1004 after failed osm sync - T239728

Change 559158 had a related patch set uploaded (by MSantos; owner: MSantos):
[operations/puppet@production] Reduce osmosis maxInterval in half

https://gerrit.wikimedia.org/r/559158

Change 559158 merged by Gehel:
[operations/puppet@production] Reduce osmosis maxInterval in half

https://gerrit.wikimedia.org/r/559158

Mentioned in SAL (#wikimedia-operations) [2019-12-19T09:15:01Z] <onimisionipe> running maps osm-replicate process manually on maps1004 - T239728

Change 559442 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/puppet@production] Increase replication frequency

https://gerrit.wikimedia.org/r/559442

Change 559442 merged by Gehel:
[operations/puppet@production] maps: Increase replication frequency

https://gerrit.wikimedia.org/r/559442

Change 559581 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/puppet@production] maps: Use correct puppet cron syntax

https://gerrit.wikimedia.org/r/559581

Change 559581 merged by Gehel:
[operations/puppet@production] maps: Use correct puppet cron syntax

https://gerrit.wikimedia.org/r/559581

Change 560459 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/puppet@production] maps: Enable osm replication after state file update.

https://gerrit.wikimedia.org/r/560459

Change 560459 merged by Gehel:
[operations/puppet@production] maps: Enable osm replication after state file update.

https://gerrit.wikimedia.org/r/560459

Thanks to @Mathew.onipe @Gehel and all others who helped in completing this task, I am able to get my recent OSM edits reflect on Wikipedia.

This comment was removed by Pikne.

Change 572313 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Fix a few invocations of osm::planet_sync

https://gerrit.wikimedia.org/r/572313

Change 572313 merged by Andrew Bogott:
[operations/puppet@production] Fix a few invocations of osm::planet_sync

https://gerrit.wikimedia.org/r/572313

Change 572316 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] osm::planet_sync:hours is a list of ints, not of strings

https://gerrit.wikimedia.org/r/572316

Change 572316 merged by Andrew Bogott:
[operations/puppet@production] osm::planet_sync:hours is a list of ints, not of strings

https://gerrit.wikimedia.org/r/572316