Page MenuHomePhabricator

maps-test2001 is low on disk space
Closed, ResolvedPublic


maps-test2001 is part of the older servers that are now used for testing. They have smaller disks than production. We are getting low on disk space on the /srv partition, which is used as storage both for cassandra and for postgresql. We might just be reaching the limit of what we can do with a full OSM dataset, but some investigation is needed to see if we forgot some cleanup somewhere... Note that the disk usage on maps-test2001 is consistent with other maps servers.

The large consumers:

  • /srv/postgresql: 600GB
    • /srv/postgresql/9.4/main/pg_xlog: 25GB (do we really need that much space for xlog? @Pnorman know more about this than I do)
    • /srv/postgresql/9.4/main/base: 575GB (postgres logs indicate that auto vacuum is run, so probably not much we can save)
  • /srv/cassandra: 370GB
  • /srv/osmosis: 40GB (only nodes.bin, nothing we can remove here)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Probably, Cassandra didn't delete old keyspaces.

It looks like Cassandra does not have enough space to do compaction:

ERROR [CompactionExecutor:4465] 2017-12-05 09:24:41,156 - Exception in thread Thread[CompactionExecutor:4465,1,main]
java.lang.RuntimeException: Not enough space for compaction, estimated sstables = 1, expected write size = 167707607671
        at org.apache.cassandra.db.compaction.CompactionTask.checkAvailableDiskSpace( ~[apache-cassandra-2.2.6.jar:2.2.6]
        at org.apache.cassandra.db.compaction.CompactionTask.runMayThrow( ~[apache-cassandra-2.2.6.jar:2.2.6]
        at ~[apache-cassandra-2.2.6.jar:2.2.6]
        at org.apache.cassandra.db.compaction.CompactionTask.executeInternal( ~[apache-cassandra-2.2.6.jar:2.2.6]
        at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute( ~[apache-cassandra-2.2.6.jar:2.2.6]
        at org.apache.cassandra.db.compaction.CompactionManager$ ~[apache-cassandra-2.2.6.jar:2.2.6]
        at java.util.concurrent.Executors$ ~[na:1.8.0_151]
        at ~[na:1.8.0_151]
        at java.util.concurrent.ThreadPoolExecutor.runWorker( ~[na:1.8.0_151]
        at java.util.concurrent.ThreadPoolExecutor$ [na:1.8.0_151]
        at [na:1.8.0_151]

Other maps nodes don't have this issue. I'm not sure we can do much about it... disks are too small for our data set. Since we need to decommission those servers anyway, it does not make much sense to get larger disks.

Mentioned in SAL (#wikimedia-operations) [2017-12-12T09:38:05Z] <gehel> reduce replication factor for cassandra on maps-test cluster and reset cassandra on maps-test2001 to work around limited disk space - T182583

Mentioned in SAL (#wikimedia-operations) [2017-12-14T12:18:23Z] <gehel> re-initialize cassandra on maps-test2001 - T182583

Reducing cassandra replication factor frees enough space that we don't have an immediate issue anymore (compaction is running without issue). The goal being to move the maps test environment to WMCS and reduce the dataset size, we should not invest more time here.

Looking at the production clusters, the free space is 2x the size of the cassandra data store. So we're good and have some margin (and we don't expect the storage need to increase).