While @MSantos continue to on refining imposm etc, Let's do a fresh OSM import for maps servers to fix the replication issue.
After a detailed session Guillaume (Thanks Guillaume), I discover this process is more complicated that I envisaged.
Current problem:
As it is (not really), the failing replication started around 2019-10-27 (https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&fullscreen&panelId=11&from=now-90d&to=now). This means tilerator activities has stopped since then and there's no new tile list. So all our servers have tiles up to 2019-10-27.
Way forward:
What we are proposing is to start osm import before when it started failing. Before 2019-10-27. This will ensure tiles are up to date. To achieve this with our current state, we should disable tilerator on all the slave nodes before starting osm import and ensure it stays disabled during import and after. Then we should re-init postgres one at a time on each slave and then re-enable tilerator to start tile generation from an updated postgres.
More problem (not serious)
I had already started osm import without all these considerations and had to stop it. So maps1004 is out and we running eqiad on three nodes currently. My bad :(
Steps/Processes:
At each DC (starting from eqiad) maps1004
[x] Downtime tilerator checks on icinga for all slaves.
[x] Stop tilerator on all slaves and disable the systemd unit. Make sure it stays disabled. Ideally, we should mask this service but not sure if puppet will honour this.
[x] Depool masp1004 and disable puppet on maps1004(master).
[x] Reset postgres on maps1004. Make sure tilerator is running. Disable osm-replicate crontask. (This should be enabled after osm-import is complete)
[x] Start osm-import script using a dump before 2019-10-27 and an even older state file.
[] Continue to monitor import while making sure tilerator on slaves stay dead.
[x] when osm-import is completed, make sure tiles are being generated. This can be checked via tileratorUI.
[x] Enable replicate-osm crontask
[x] Pool maps1004(master)
On slaves (maps1001):
[x] Depool the slave
[x] downtime all alerts on the host and disable puppet.
[x] Re-init postgres.
These processes can be done via the postgres reinit cookbook.
[x] Enable tilerator after re-initialization is completed.
On slaves (maps1002):
[x] Depool the slave
[x] downtime all alerts on the host and disable puppet.
[x] Re-init postgres.
These processes can be done via the postgres reinit cookbook.
[x] Enable tilerator after re-initialization is completed.
On slaves (maps1003):
[x] Depool the slave
[x] downtime all alerts on the host and disable puppet.
[x] Re-init postgres.
These processes can be done via the postgres reinit cookbook.
[x] Enable tilerator after re-initialization is completed.
Codfw:
[x] Downtime tilerator checks on icinga for all slaves.
[x] Stop tilerator on all slaves and disable the systemd unit. Make sure it stays disabled. Ideally, we should mask this service but not sure if puppet will honour this.
[x] Depool maps2004 and disable puppet on maps2004(master).
[x] Reset postgres on maps2004. Make sure tilerator is running. Disable osm-replicate crontask. (This should be enabled after osm-import is complete)
[x] Start osm-import script using a dump before 2019-10-27 and an even older state file.
[x] Continue to monitor import while making sure tilerator on slaves stay dead.
[x] when osm-import is completed, make sure tiles are being generated. This can be checked via tileratorUI.
[x] Enable replicate-osm crontask
[x] Pool maps2004(master)
On slaves (maps2001):
[] Depool the slave
[] downtime all alerts on the host and disable puppet.
[] Re-init postgres.
These processes can be done via the postgres reinit cookbook.
[] Enable tilerator after re-initialization is completed.
On slaves (maps2002):
[] Depool the slave
[] downtime all alerts on the host and disable puppet.
[] Re-init postgres.
These processes can be done via the postgres reinit cookbook.
[] Enable tilerator after re-initialization is completed.
On slaves (maps2003):
[] Depool the slave
[] downtime all alerts on the host and disable puppet.
[] Re-init postgres.
These processes can be done via the postgres reinit cookbook.
[] Enable tilerator after re-initialization is completed.