Page MenuHomePhabricator

disk usage increase on maps servers
Closed, ResolvedPublic

Description

While disk usage was mostly constant over the last 6 month, it has started to increase since ~1 month ago. We are now running into space issues.

To mitigate the issue in the short term, I have reduced the replication of the v4 keyspace in cassandra from 4 to 3, which should give us some head room while investigating.

The 2 large consumers of disk space are of course cassandra and postgresql, but I don't have historical data to see which one has increase (or if it is both).

Event Timeline

Gehel created this task.May 18 2018, 4:04 PM
Restricted Application added a project: Discovery. · View Herald TranscriptMay 18 2018, 4:04 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Gehel added a comment.May 18 2018, 5:19 PM

Cassandra might have been running into compaction issues while we had both v3 and v4 keyspaces, and not enough space to run compaction. Thought I don't see any error in cassandra logs...

StjnVMF renamed this task from disk usage increase on maps servers to unban reguyla.May 18 2018, 5:24 PM
StjnVMF updated the task description. (Show Details)
JJMC89 renamed this task from unban reguyla to disk usage increase on maps servers.May 18 2018, 5:29 PM
JJMC89 updated the task description. (Show Details)

Points from IRC conversation

  • v3 keyspace has already been removed from both
  • cassandra compaction is manually running and recovering space
  • pg_xlog takes 30GB
  • we have checkpoint_segments = 768 / wal_keep_segments = 768 and there's uncertainty where these numbers came from
  • we should increase storage size (i.e. get bigger disks)

Mentioned in SAL (#wikimedia-operations) [2018-05-21T19:19:52Z] <gehel> clearing cassandra snaphosts on maps* nodes to regain some space - T194966

Gehel mentioned this in Unknown Object (Task).May 22 2018, 7:34 AM

Mentioned in SAL (#wikimedia-operations) [2018-05-22T12:18:48Z] <gehel> set unchecked_tombstone_compaction=true for maps eqiad - T194966

Gehel added a subscriber: Eevans.May 22 2018, 2:44 PM

From conversation with @Eevans:

I would consider altering compaction settings to make it more aggressive about performing tombstone compactions.
See https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/ops_configure_compaction_t.html for how to go about making the changes, and https://docs.datastax.com/en/cql/3.1/cql/cql_reference/compactSubprop.html for the properties you can change.
Of those settings, I would definitely enable unchecked_tombstone_compaction.
You could also consider dropping gc_grace_settings to something less than 10 days, provided you're committed to dealing with any failures within whatever period you use. It doesn't really change the long-term picture, it would just lower the number in-situ tombstones (but in your current state, cutting that in half from 10 to 5 might be a win).

unchecked_tombstone_compaction is already enabled on the maps eqiad cluster. Compaction is running. No impact on disk space yet (which is expected). Let's see how this goes...

RobH added a subscriber: RobH.May 22 2018, 2:52 PM
This comment was removed by RobH.
RobH mentioned this in Unknown Object (Task).May 22 2018, 3:08 PM

Just to note, the immediate disk space usage issues will get better when reimaging as part of the new style setup, because it will completely reset Cassandra.

If we were using some object store as a service not run by the maps team it would also remove any need for more disk space.

Gehel added a comment.May 23 2018, 2:26 PM

I see no significant drop in disk usage since enabling unchecked_tombstone_compaction. But disk usage remains stable, which means we are probably OK, at least for the short term.

Gehel triaged this task as Medium priority.May 30 2018, 3:27 PM

Disk usage has been stable over the last week, so it looks like the tuning we did, while not recovering much space helped stabilize. I'll keep an eye on things for a while. The long term solution is probably to increase storage, or move object store (cassandra) out of the maps servers.

Vvjjkkii renamed this task from disk usage increase on maps servers to lrcaaaaaaa.Jul 1 2018, 1:09 AM
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
CommunityTechBot renamed this task from lrcaaaaaaa to disk usage increase on maps servers.Jul 2 2018, 1:56 AM
CommunityTechBot lowered the priority of this task from High to Medium.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added a subscriber: Aklapper.

Mentioned in SAL (#wikimedia-operations) [2018-09-20T14:44:09Z] <gehel> reduce replication factor to 2 on cassandra maps eqiad - T194966

RobH removed a subscriber: RobH.Tue, Mar 3, 6:23 PM
Mholloway closed this task as Resolved.Fri, Mar 6, 3:55 PM
Mholloway assigned this task to Gehel.

This particular instance of increasing disk usage appears resolved, so I'm resolving it, but see T243609 re: current disk usage.