Page MenuHomePhabricator

Maps master servers running out of space
Closed, ResolvedPublic

Description

Starting from December 24, 2019, disk space has been increasing on maps servers. The most space increase happens on postgresql. There is a strong temporal correlation with the increase of OSM replication frequency.

As a stop gap measure, replication is paused, so no new writes should happen on those servers, which will give us some time to understand the issue. The impact is that no new tiles will be generated and we're going to be out of sync with OSM.

Event Timeline

I took a look at the sizes of bzipped OSM planet data files over time, and it turns out that they're growing considerably year-over-year. It looks like our increasing storage needs aren't solely an issue of DB management.

Datesize (.bz2)
2020-01-0684G
2019-01-0773G
2018-01-0163G
2017-01-0254G
2016-01-0446G
2015-01-0539G

https://planet.openstreetmap.org/planet/

It looks like we reach 85% disk space utilization on the maps masters after a fresh import of the planet, even before kicking off any change replication. IIUC, a general hardware update is in the works, which will address the storage issue (among others).

I think this is in Needs Analysis on the Product-Infrastructure-Team-Backlog because we want to better understand the effect of increasing the replication frequency on disk space usage. That analysis should probably happen on T137939, and probably can't really happen at all until we have more storage to play with. In the meantime, this task can live in the Backlog.

wait, so replication has been disabled for 2,5 months ?

@TheDJ, yes. Unfortunately, this problem fix was delayed by a variety of events that reduced the availability of maps staff for the past quarter. We hope to push this forward in the upcoming weeks now that we are being able to make room for that work.

I'm sorry for the inconvenience, I'll keep you posted.

Any news? The downtime is quite long for such a useful and crucial functionality of Wikipedia.

Now its been 5 months since replication was disabled... any updates @MSantos @Mholloway @Gehel ?

Now its been 5 months since replication was disabled... any updates @MSantos @Mholloway @Gehel ?

Data has been reimported on our codfw datacenter, we are in the process of doing the same for the eqiad datacenter (T254014). We should be back to normal operations next week.

That's good to hear. So does normal operations imply that everything should be reimported by now? I have a test page at https://en.wikipedia.org/wiki/User:%E2%B1%AE/sandbox35, still no luck with the import at least with this relation I've been continually monitoring.

@Em sorry for the confusion, I suggest that you track T254014: Reimport OSM data on eqiad in order to have more accurate feedback regarding this fix.

Reminder: we the German map-maniacs have been pleading with majority for better support for geospatial information in the latest Technical Wishlist 2020 survey ;)

Ping - any news here? How long will we have to wait fill this is fixed???

For what it’s worth, the relation I reported in https://phabricator.wikimedia.org/T218097#6007413 has begun showing up in mapframes, so something has begun moving, though I don’t know if Wikimedia Maps is fully caught up.

Wikimedia Maps appears to be pretty much caught up now, based on some spot checks of recent edits I’ve made to OpenStreetMap. For example, https://www.openstreetmap.org/changeset/93204365 added a covered bridge and creek that show up in https://maps.wikimedia.org/#19/39.71640/-81.61655.