Maniphest T194966

disk usage increase on maps servers
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Gehel
	May 18 2018, 4:04 PM

Description

While disk usage was mostly constant over the last 6 month, it has started to increase since ~1 month ago. We are now running into space issues.

To mitigate the issue in the short term, I have reduced the replication of the v4 keyspace in cassandra from 4 to 3, which should give us some head room while investigating.

The 2 large consumers of disk space are of course cassandra and postgresql, but I don't have historical data to see which one has increase (or if it is both).

Related Objects

Mentioned In: T224395: Maps[12]004 /srv disk space is critical
T200228: disk space alert on maps1001
Mentioned Here: T243609: Maps master servers running out of space

Event Timeline

Gehel created this task.May 18 2018, 4:04 PM

Restricted Application added a project: Discovery-ARCHIVED. · View Herald TranscriptMay 18 2018, 4:04 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Cassandra might have been running into compaction issues while we had both v3 and v4 keyspaces, and not enough space to run compaction. Thought I don't see any error in cassandra logs...

• StjnVMF renamed this task from disk usage increase on maps servers to unban reguyla.May 18 2018, 5:24 PM

• StjnVMF updated the task description. (Show Details)

JJMC89 renamed this task from unban reguyla to disk usage increase on maps servers.May 18 2018, 5:29 PM

JJMC89 updated the task description. (Show Details)

Points from IRC conversation

v3 keyspace has already been removed from both
cassandra compaction is manually running and recovering space
pg_xlog takes 30GB
we have checkpoint_segments = 768 / wal_keep_segments = 768 and there's uncertainty where these numbers came from
we should increase storage size (i.e. get bigger disks)

Mentioned in SAL (#wikimedia-operations) [2018-05-21T19:19:52Z] <gehel> clearing cassandra snaphosts on maps* nodes to regain some space - T194966

Gehel mentioned this in Unknown Object (Task).May 22 2018, 7:34 AM

Mentioned in SAL (#wikimedia-operations) [2018-05-22T12:18:48Z] <gehel> set unchecked_tombstone_compaction=true for maps eqiad - T194966

From conversation with @Eevans:

I would consider altering compaction settings to make it more aggressive about performing tombstone compactions.

See https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/ops_configure_compaction_t.html for how to go about making the changes, and https://docs.datastax.com/en/cql/3.1/cql/cql_reference/compactSubprop.html for the properties you can change.

Of those settings, I would definitely enable unchecked_tombstone_compaction.

You could also consider dropping gc_grace_settings to something less than 10 days, provided you're committed to dealing with any failures within whatever period you use. It doesn't really change the long-term picture, it would just lower the number in-situ tombstones (but in your current state, cutting that in half from 10 to 5 might be a win).

unchecked_tombstone_compaction is already enabled on the maps eqiad cluster. Compaction is running. No impact on disk space yet (which is expected). Let's see how this goes...

RobH subscribed.May 22 2018, 2:52 PM

This comment was removed by RobH.

RobH mentioned this in Unknown Object (Task).May 22 2018, 3:08 PM

Just to note, the immediate disk space usage issues will get better when reimaging as part of the new style setup, because it will completely reset Cassandra.

If we were using some object store as a service not run by the maps team it would also remove any need for more disk space.

• Mholloway subscribed.May 22 2018, 6:17 PM

I see no significant drop in disk usage since enabling unchecked_tombstone_compaction. But disk usage remains stable, which means we are probably OK, at least for the short term.

Disk usage has been stable over the last week, so it looks like the tuning we did, while not recovering much space helped stabilize. I'll keep an eye on things for a while. The long term solution is probably to increase storage, or move object store (cassandra) out of the maps servers.

• Vvjjkkii renamed this task from disk usage increase on maps servers to lrcaaaaaaa.Jul 1 2018, 1:09 AM

• Vvjjkkii raised the priority of this task from Medium to High.