Page MenuHomePhabricator

Eventstreams graphite disk usage
Closed, ResolvedPublic

Description

I noticed eventstreams using a significant amount of disk space on graphite, with ~half of rdkafka metrics being more than 10d old and not updated. @Ottomata anything we could do here like aggregating in a different way or purge old metrics?

root@graphite1001:/var/lib/carbon/whisper/eventstreams# find rdkafka/ -type f -mtime +10 | wc -l
239220
root@graphite1001:/var/lib/carbon/whisper/eventstreams# find rdkafka/ -type f  | wc -l
518155

161G eventstreams

Event Timeline

Yar, this is because many of the metrics are per-client. I'd like to know if clients start lagging, and there's not a real way to aggregate that.

But, we really don't need to keep history of this data. Can we delete certain data > 2 weeks old?

Yar, this is because many of the metrics are per-client. I'd like to know if clients start lagging, and there's not a real way to aggregate that.

But, we really don't need to keep history of this data. Can we delete certain data > 2 weeks old?

Yes we could for sure, we do something similar for instances hierarchy already

Change 343609 had a related patch set uploaded (by Filippo Giunchedi):
[operations/puppet] graphite: cleanup eventstreams rdkafka stale data

https://gerrit.wikimedia.org/r/343609

Change 343609 merged by Filippo Giunchedi:
[operations/puppet@production] graphite: cleanup eventstreams rdkafka stale data

https://gerrit.wikimedia.org/r/343609

fgiunchedi claimed this task.

eventstreams is at 110G and cleaned up periodically, good enough for now

Reopening, beginning at around 6/6 eventstreams has been creating a lot of metrics consuming ~20% of graphite disk space in 8 days and it is now at around 400G

screenshot_aCnYLV.png (269×473 px, 22 KB)

We're cleaning metrics older than 15d already but that doesn't seem to be enough with a big influx of metrics like this

fgiunchedi triaged this task as High priority.
fgiunchedi removed a project: Patch-For-Review.

Change 361818 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] graphite: lower down eventstreams whisper files retention

https://gerrit.wikimedia.org/r/361818

Change 361818 merged by Elukey:
[operations/puppet@production] graphite: lower down eventstreams whisper files retention

https://gerrit.wikimedia.org/r/361818

fgiunchedi claimed this task.

Resolving this for now, will reopen if necessary

Reopening, rdkafka metrics for eventstreams is out of control since a couple of days

graphite1001:~$ du -hcs /var/lib/carbon/whisper/eventstreams/rdkafka
590G	/var/lib/carbon/whisper/eventstreams/rdkafka

Ideally the metrics pushed are not so many, we should get more aggressive with the cleaning too, maybe 5 days (from 10 days now).

Change 374500 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::graphite::production: lower down eventstreams rdkafka retention

https://gerrit.wikimedia.org/r/374500

Change 374500 merged by Elukey:
[operations/puppet@production] role::graphite::production: lower down eventstreams rdkafka retention

https://gerrit.wikimedia.org/r/374500

The other step to take would be to limit the amount of data that we store for librkafka, because with so many clients it is impossible to keep track of all the metrics (https://grafana.wikimedia.org/dashboard/db/eventstreams doesn't even load anymore).

We're doing good space wise now:

# du -hcs /var/lib/carbon/whisper/eventstreams/
4.8G	/var/lib/carbon/whisper/eventstreams/