Page MenuHomePhabricator

column family cassandra metrics size
Closed, ResolvedPublic

Description

one ripple effect of multi-instance cassandra will be on graphite metrics, ATM each cassandra JVM uses ~11G on disk in graphite, e.g. restbase1009 (total 222G now)

per-CF metrics take up two-three order of magnitudes more than the rest

$ du -hcs restbase1009/org/apache/cassandra/metrics/*
1.5M	restbase1009/org/apache/cassandra/metrics/CQL
12M	restbase1009/org/apache/cassandra/metrics/Cache
46M	restbase1009/org/apache/cassandra/metrics/ClientRequest
9.9G	restbase1009/org/apache/cassandra/metrics/ColumnFamily
9.9M	restbase1009/org/apache/cassandra/metrics/CommitLog
2.4M	restbase1009/org/apache/cassandra/metrics/Compaction
47M	restbase1009/org/apache/cassandra/metrics/Connection
14M	restbase1009/org/apache/cassandra/metrics/DroppedMessage
3.6M	restbase1009/org/apache/cassandra/metrics/FileCache
4.5M	restbase1009/org/apache/cassandra/metrics/ReadRepair
1.2M	restbase1009/org/apache/cassandra/metrics/Storage
38M	restbase1009/org/apache/cassandra/metrics/ThreadPools
10G	total

and some metric types have plenty of derived metrics which contribute to that, e.g. latencies CasCommitLatency

-rw-r--r--  1 _graphite _graphite 309088 Sep 25 13:50 15MinuteRate.wsp
-rw-r--r--  1 _graphite _graphite 309088 Sep 25 13:49 1MinuteRate.wsp
-rw-r--r--  1 _graphite _graphite 309088 Sep 25 13:49 50percentile.wsp
-rw-r--r--  1 _graphite _graphite 309088 Sep 25 13:49 5MinuteRate.wsp
-rw-r--r--  1 _graphite _graphite 309088 Sep 25 13:51 75percentile.wsp
-rw-r--r--  1 _graphite _graphite 309088 Sep 25 13:49 95percentile.wsp
-rw-r--r--  1 _graphite _graphite 309088 Sep 25 13:50 98percentile.wsp
-rw-r--r--  1 _graphite _graphite 309088 Sep 25 13:50 999percentile.wsp
-rw-r--r--  1 _graphite _graphite 309088 Sep 25 13:47 99percentile.wsp
-rw-r--r--  1 _graphite _graphite 309088 Sep 25 13:47 count.wsp
-rw-r--r--  1 _graphite _graphite 309088 Sep 25 13:47 max.wsp
-rw-r--r--  1 _graphite _graphite 309088 Sep 25 13:48 mean.wsp
-rw-r--r--  1 _graphite _graphite 309088 Sep 25 13:49 meanRate.wsp
-rw-r--r--  1 _graphite _graphite 309088 Sep 25 13:50 min.wsp
-rw-r--r--  1 _graphite _graphite 309088 Sep 25 13:50 stddev.wsp

for each of those there's between 2k and 3k files each of 309kb

15MinuteRate.wsp:1389
1MinuteRate.wsp:1389
50percentile.wsp:2141
5MinuteRate.wsp:1389
75percentile.wsp:2141
95percentile.wsp:2141
98percentile.wsp:2141
999percentile.wsp:2009
99percentile.wsp:2141
count.wsp:3548
max.wsp:2141
mean.wsp:2009
meanRate.wsp:1389
min.wsp:2141
stddev.wsp:2009

I think we can trim the list of derived metrics to the most relevant ones, e.g. 50/75/95/99 percentile, count, 1MinuteRate

Event Timeline

fgiunchedi claimed this task.
fgiunchedi raised the priority of this task from to High.
fgiunchedi updated the task description. (Show Details)
fgiunchedi added subscribers: Joe, fgiunchedi, GWicke and 4 others.
fgiunchedi lowered the priority of this task from High to Medium.Sep 25 2015, 3:23 PM

I think we can trim the list of derived metrics to the most relevant ones, e.g. 50/75/95/99 percentile, count, 1MinuteRate

I still can't help but wish that we didn't need to be so frugal with metrics storage, but if we need to get size down, some of these derived metrics are low hanging fruit.

Is this something that can be filtered out on the Graphite/Carbon-end, or should we setup some kind of filtering in cassandra-metrics-collector?

What would metrics filtering in cmc look like? A "patterns file" of some kind? Is it enough to exclude (i.e. assume that everything is an implicit include, and provide pattern for explicit exclude)?

good question! yeah I think a blacklist would be fine for now

re: metric storage space I concur, related is T85451: scale graphite deployment (tracking) even though it got better since May with smaller whisper files, still an issue for heavy hitters like cassandra of course

good question! yeah I think a blacklist would be fine for now

OK, this is implemented in https://github.com/wikimedia/cassandra-metrics-collector/pull/4; Please test vigorously.

I think we can trim the list of derived metrics to the most relevant ones, e.g. 50/75/95/99 percentile, count, 1MinuteRate

I still can't help but wish that we didn't need to be so frugal with metrics storage, but if we need to get size down, some of these derived metrics are low hanging fruit.

Is this something that can be filtered out on the Graphite/Carbon-end, or should we setup some kind of filtering in cassandra-metrics-collector?

I'm sorry, but I wouldn't call ~220 GB of storage for Cassandra metrics alone particulary "frugal". :) That's rather a lot, compared to the amount of metrics we've been gathering for overall services in the past, and not something we've really planned for.

I'm sorry, but I wouldn't call ~220 GB of storage for Cassandra metrics alone particulary "frugal". :) That's rather a lot, compared to the amount of metrics we've been gathering for overall services in the past, and not something we've really planned for.

The problem is we haven't planned on relying on metrics as much as we do. But, as it turns out, in Cassandra's case you can tell a lot by looking at the right metrics.

good question! yeah I think a blacklist would be fine for now

OK, this is implemented in https://github.com/wikimedia/cassandra-metrics-collector/pull/4; Please test vigorously.

just tested this in labs and it works, LGTM. I'll send patches to deploy, we'll need to roll restart cassandra for it to pick up instance-id property first

good question! yeah I think a blacklist would be fine for now

OK, this is implemented in https://github.com/wikimedia/cassandra-metrics-collector/pull/4; Please test vigorously.

just tested this in labs and it works, LGTM. I'll send patches to deploy, we'll need to roll restart cassandra for it to pick up instance-id property first

Do you need me to deploy a snapshot to Archiva?

just tested this in labs and it works, LGTM. I'll send patches to deploy, we'll need to roll restart cassandra for it to pick up instance-id property first

Do you need me to deploy a snapshot to Archiva?

yep that would be good, thanks! I'll change puppet accordingly

just tested this in labs and it works, LGTM. I'll send patches to deploy, we'll need to roll restart cassandra for it to pick up instance-id property first

Do you need me to deploy a snapshot to Archiva?

yep that would be good, thanks! I'll change puppet accordingly

Done; https://archiva.wikimedia.org/repository/snapshots/org/wikimedia/cassandra-metrics-collector/2.0.0-SNAPSHOT/cassandra-metrics-collector-2.0.0-20151001.182133-1-jar-with-dependencies.jar

Change 243121 had a related patch set uploaded (by Filippo Giunchedi):
update collector version

https://gerrit.wikimedia.org/r/243121

Change 243127 had a related patch set uploaded (by Filippo Giunchedi):
cassandra: new metrics-collector version

https://gerrit.wikimedia.org/r/243127

Change 243121 merged by Filippo Giunchedi:
update collector version

https://gerrit.wikimedia.org/r/243121

Change 243127 merged by Filippo Giunchedi:
cassandra: new metrics-collector version

https://gerrit.wikimedia.org/r/243127

Change 248313 had a related patch set uploaded (by Filippo Giunchedi):
cassandra: enable metric blacklist for restbase

https://gerrit.wikimedia.org/r/248313

Change 248313 merged by Filippo Giunchedi:
cassandra: enable metric blacklist for restbase

https://gerrit.wikimedia.org/r/248313

@Eevans, @fgiunchedi: Are we good with the blacklist? Should we resolve this task?

Yes we can! e.g. for restbase1009-a/org/apache/cassandra/metrics/ColumnFamily/local_group_default_T_summary/data/CasCommitLatency/

-rw-r--r--  1 _graphite _graphite 309088 Jul 12 11:20 1MinuteRate.wsp
-rw-r--r--  1 _graphite _graphite 309088 Jul 12 11:42 50percentile.wsp
-rw-r--r--  1 _graphite _graphite 309088 Jul 12 12:00 75percentile.wsp
-rw-r--r--  1 _graphite _graphite 309088 Jul 12 11:43 95percentile.wsp
-rw-r--r--  1 _graphite _graphite 309088 Jul 12 11:48 99percentile.wsp
-rw-r--r--  1 _graphite _graphite 309088 Jul 12 11:22 count.wsp
-rw-r--r--  1 _graphite _graphite 309088 Jul 12 11:41 max.wsp