Page MenuHomePhabricator

Increase frequency of OSM replication
Closed, ResolvedPublic

Description

Currently, the importer causes a long spike once per day, I propose that we change it to several shorter spikes. Hourly seems like a good compromise. Or maybe 15 minutes?

Event Timeline

Change 312241 had a related patch set uploaded (by Gehel):
maps - increase osm replication frequency to hourly

https://gerrit.wikimedia.org/r/312241

Change 312241 merged by Gehel:
maps - increase osm replication frequency to hourly

https://gerrit.wikimedia.org/r/312241

Replication frequency is set to 1 hour on the maps-test cluster. We can see that the server load average and IO peaks every hour and barely has time to go back down before the next replication. We can also see that postgresql replication often lags by > 10 minutes. I have no idea what the cause of that is at the moment, but it looks like something that needs to be fixed before we enable that on production servers.

Better metrics / dashboard is required to have visibility on what is happening.

Tilerator notification is failing regularly on the maps-test cluster, which it the cluster where hourly updates are enabled. This is correlation, not causality, still, we should make sure the problem isn't related (my suspicion: it is actually related).

Yurik edited projects, added Maps (Tilerator); removed Maps.

Based on T159631: Tasmania is covered with water at z10+ we should switch to hourly diffs even if we don't change how often we update.

debt added a subscriber: debt.

Moving to prioritized as it's on our list of things that do need doing.

Still something we're interested in doing, but not sufficiently high-priority for Maps-Sprint.

Mholloway raised the priority of this task from Medium to High.Aug 14 2018, 11:11 PM
Mholloway added a project: Maps-Sprint.

Replication frequency is set to 1 hour on the maps-test cluster. We can see that the server load average and IO peaks every hour and barely has time to go back down before the next replication. We can also see that postgresql replication often lags by > 10 minutes. I have no idea what the cause of that is at the moment, but it looks like something that needs to be fixed before we enable that on production servers.

Tilerator notification is failing regularly on the maps-test cluster, which it the cluster where hourly updates are enabled. This is correlation, not causality, still, we should make sure the problem isn't related (my suspicion: it is actually related).

@Gehel Are both of these still the case? (I found a Grafana dashboard for the production cluster[1], but not the maps-test cluster.)

[1] https://grafana.wikimedia.org/dashboard/db/maps-performances

Edit: Of course I found the control to switch to maps-test two seconds after posting...

It looks like, in the course of dealing with T239728: Re-import OSM data at eqiad and codfw to temporarily fix current OSM replication issues., replication was increased to twice daily on 2019/12/19 [ 1 ], then again to hourly on 2019/12/24 [ 1, 2 ], then switched off on 2020/01/24 due to T243609: Maps master servers running out of space.

Mholloway changed the task status from Open to Stalled.Mar 4 2020, 5:30 PM
Mholloway moved this task from Tracking to Backlog on the Product-Infrastructure-Team-Backlog board.

Moving this out of Tracking and to the Backlog since PI engineers are actually involved with it.

Changing the status to Stalled to reflect reality.

MSantos changed the task status from Stalled to Open.Apr 1 2020, 1:06 PM
MSantos added a subscriber: MSantos.

Next step is:

  • Tweak hourly replication rate and monitor disk usage /+/581636

This task is finished and working on codfw cluster, but the scripts are disabled in eqiad, for that part of the work please refer to T254014: Reimport OSM data on eqiad. If you have questions or concerns, please reopen it.