Currently, the importer causes a long spike once per day, I propose that we change it to several shorter spikes. Hourly seems like a good compromise. Or maybe 15 minutes?
Description
Details
Project | Branch | Lines +/- | Subject | |
---|---|---|---|---|
operations/puppet | production | +10 -4 | maps - increase osm replication frequency to hourly |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | RKemper | T137939 Increase frequency of OSM replication | |||
Resolved | Gehel | T147194 reimage maps-test* servers | |||
Resolved | Gehel | T148031 Maps - error when doing initial tiles generation: "Error: could not create converter for SQL_ASCII"" | |||
Resolved | Gehel | T148114 Maps-test was created with incorrect initial encoding | |||
Resolved | MaxSem | T145534 maps - tilerator notification seems stuck on sorting files |
Event Timeline
Change 312241 had a related patch set uploaded (by Gehel):
maps - increase osm replication frequency to hourly
Replication frequency is set to 1 hour on the maps-test cluster. We can see that the server load average and IO peaks every hour and barely has time to go back down before the next replication. We can also see that postgresql replication often lags by > 10 minutes. I have no idea what the cause of that is at the moment, but it looks like something that needs to be fixed before we enable that on production servers.
Tilerator notification is failing regularly on the maps-test cluster, which it the cluster where hourly updates are enabled. This is correlation, not causality, still, we should make sure the problem isn't related (my suspicion: it is actually related).
Based on T159631: Tasmania is covered with water at z10+ we should switch to hourly diffs even if we don't change how often we update.
Still something we're interested in doing, but not sufficiently high-priority for Maps-Sprint.
@Gehel Are both of these still the case? (I found a Grafana dashboard for the production cluster[1], but not the maps-test cluster.)
[1] https://grafana.wikimedia.org/dashboard/db/maps-performances
Edit: Of course I found the control to switch to maps-test two seconds after posting...
It looks like, in the course of dealing with T239728: Re-import OSM data at eqiad and codfw to temporarily fix current OSM replication issues., replication was increased to twice daily on 2019/12/19 [ 1 ], then again to hourly on 2019/12/24 [ 1, 2 ], then switched off on 2020/01/24 due to T243609: Maps master servers running out of space.
Moving this out of Tracking and to the Backlog since PI engineers are actually involved with it.
Changing the status to Stalled to reflect reality.
This task is finished and working on codfw cluster, but the scripts are disabled in eqiad, for that part of the work please refer to T254014: Reimport OSM data on eqiad. If you have questions or concerns, please reopen it.