Page MenuHomePhabricator

Maps master servers running out of space
Open, HighPublic

Description

Starting from December 24, 2019, disk space has been increasing on maps servers. The most space increase happens on postgresql. There is a strong temporal correlation with the increase of OSM replication frequency.

As a stop gap measure, replication is paused, so no new writes should happen on those servers, which will give us some time to understand the issue. The impact is that no new tiles will be generated and we're going to be out of sync with OSM.

Event Timeline

Gehel created this task.Jan 24 2020, 4:09 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 24 2020, 4:09 PM
Mholloway triaged this task as High priority.Feb 11 2020, 4:54 PM

I took a look at the sizes of bzipped OSM planet data files over time, and it turns out that they're growing considerably year-over-year. It looks like our increasing storage needs aren't solely an issue of DB management.

Datesize (.bz2)
2020-01-0684G
2019-01-0773G
2018-01-0163G
2017-01-0254G
2016-01-0446G
2015-01-0539G

https://planet.openstreetmap.org/planet/

Mholloway added a comment.EditedMar 4 2020, 5:28 PM

It looks like we reach 85% disk space utilization on the maps masters after a fresh import of the planet, even before kicking off any change replication. IIUC, a general hardware update is in the works, which will address the storage issue (among others).

I think this is in Needs Analysis on the Product-Infrastructure-Team-Backlog because we want to better understand the effect of increasing the replication frequency on disk space usage. That analysis should probably happen on T137939, and probably can't really happen at all until we have more storage to play with. In the meantime, this task can live in the Backlog.

T196474: Externalize tile storage for maps is something to consider once again in this context.

TheDJ added a subscriber: TheDJ.Mar 31 2020, 1:24 PM

wait, so replication has been disabled for 2,5 months ?

@TheDJ, yes. Unfortunately, this problem fix was delayed by a variety of events that reduced the availability of maps staff for the past quarter. We hope to push this forward in the upcoming weeks now that we are being able to make room for that work.

I'm sorry for the inconvenience, I'll keep you posted.

Jhernandez removed a subscriber: Jhernandez.Apr 2 2020, 6:46 PM
Ainali added a subscriber: Ainali.Apr 25 2020, 5:00 PM
mxn added a subscriber: mxn.Apr 25 2020, 5:00 PM
Renek78 added a subscriber: Renek78.May 7 2020, 4:21 PM

Any news? The downtime is quite long for such a useful and crucial functionality of Wikipedia.

Larske added a subscriber: Larske.May 9 2020, 9:56 AM
Evad37 added a subscriber: Evad37.May 9 2020, 11:56 PM
seav added a subscriber: seav.Jun 17 2020, 11:00 AM
Base added a subscriber: Base.Jun 17 2020, 11:01 AM
GoEThe added a subscriber: GoEThe.Jun 17 2020, 11:51 AM

Now its been 5 months since replication was disabled... any updates @MSantos @Mholloway @Gehel ?

Now its been 5 months since replication was disabled... any updates @MSantos @Mholloway @Gehel ?

Data has been reimported on our codfw datacenter, we are in the process of doing the same for the eqiad datacenter (T254014). We should be back to normal operations next week.

Em added a subscriber: Em.Jul 13 2020, 1:21 PM

That's good to hear. So does normal operations imply that everything should be reimported by now? I have a test page at https://en.wikipedia.org/wiki/User:%E2%B1%AE/sandbox35, still no luck with the import at least with this relation I've been continually monitoring.

@Em sorry for the confusion, I suggest that you track T254014: Reimport OSM data on eqiad in order to have more accurate feedback regarding this fix.

Reminder: we the German map-maniacs have been pleading with majority for better support for geospatial information in the latest Technical Wishlist 2020 survey ;)

jhsoby added a subscriber: jhsoby.Aug 11 2020, 12:33 PM
Kozuch added a subscriber: Kozuch.
Kozuch added a comment.Sep 1 2020, 1:57 PM

Ping - any news here? How long will we have to wait fill this is fixed???