Eventstreams graphite disk usage
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	fgiunchedi
	Mar 16 2017, 3:56 PM

Description

I noticed eventstreams using a significant amount of disk space on graphite, with ~half of rdkafka metrics being more than 10d old and not updated. @Ottomata anything we could do here like aggregating in a different way or purge old metrics?

root@graphite1001:/var/lib/carbon/whisper/eventstreams# find rdkafka/ -type f -mtime +10 | wc -l
239220
root@graphite1001:/var/lib/carbon/whisper/eventstreams# find rdkafka/ -type f  | wc -l
518155

161G eventstreams

Details

Subject	Repo	Branch	Lines +/-
role::graphite::production: lower down eventstreams rdkafka retention	operations/puppet	production	+1 -1
graphite: lower down eventstreams whisper files retention	operations/puppet	production	+1 -1
graphite: cleanup eventstreams rdkafka stale data	operations/puppet	production	+9 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	fgiunchedi	T1075 Audit groups of metrics in Graphite that allocate a lot of disk space
Resolved	fgiunchedi	T160644 Eventstreams graphite disk usage
Resolved	Ottomata	T174435 Stop tracking EventStreams client lag in graphite

Event Timeline

fgiunchedi created this task.Mar 16 2017, 3:56 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 16 2017, 3:56 PM

fgiunchedi updated the task description. (Show Details)Mar 16 2017, 3:57 PM

Yar, this is because many of the metrics are per-client. I'd like to know if clients start lagging, and there's not a real way to aggregate that.

But, we really don't need to keep history of this data. Can we delete certain data > 2 weeks old?

In T160644#3106844, @Ottomata wrote:

Yar, this is because many of the metrics are per-client. I'd like to know if clients start lagging, and there's not a real way to aggregate that.

But, we really don't need to keep history of this data. Can we delete certain data > 2 weeks old?

Yes we could for sure, we do something similar for instances hierarchy already

Change 343609 had a related patch set uploaded (by Filippo Giunchedi):
[operations/puppet] graphite: cleanup eventstreams rdkafka stale data

https://gerrit.wikimedia.org/r/343609

gerritbot added a project: Patch-For-Review.Mar 20 2017, 10:13 AM

• Nuria moved this task from Incoming to Radar on the Analytics board.Mar 20 2017, 3:48 PM

Change 343609 merged by Filippo Giunchedi:
[operations/puppet@production] graphite: cleanup eventstreams rdkafka stale data

https://gerrit.wikimedia.org/r/343609

eventstreams is at 110G and cleaned up periodically, good enough for now

Reopening, beginning at around 6/6 eventstreams has been creating a lot of metrics consuming ~20% of graphite disk space in 8 days and it is now at around 400G

screenshot_aCnYLV.png (269×473 px, 22 KB)

We're cleaning metrics older than 15d already but that doesn't seem to be enough with a big influx of metrics like this

fgiunchedi removed fgiunchedi as the assignee of this task.Jun 21 2017, 2:21 PM

fgiunchedi triaged this task as High priority.

fgiunchedi removed a project: Patch-For-Review.

fgiunchedi mentioned this in T1075: Audit groups of metrics in Graphite that allocate a lot of disk space.Jun 21 2017, 2:50 PM

elukey subscribed.Jun 25 2017, 9:02 AM

fgiunchedi added a parent task: T1075: Audit groups of metrics in Graphite that allocate a lot of disk space.Jun 27 2017, 1:04 PM

Change 361818 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] graphite: lower down eventstreams whisper files retention

https://gerrit.wikimedia.org/r/361818

gerritbot added a project: Patch-For-Review.Jun 28 2017, 6:35 AM

Change 361818 merged by Elukey:
[operations/puppet@production] graphite: lower down eventstreams whisper files retention

https://gerrit.wikimedia.org/r/361818

Resolving this for now, will reopen if necessary

Reopening, rdkafka metrics for eventstreams is out of control since a couple of days

graphite1001:~$ du -hcs /var/lib/carbon/whisper/eventstreams/rdkafka
590G	/var/lib/carbon/whisper/eventstreams/rdkafka

Ideally the metrics pushed are not so many, we should get more aggressive with the cleaning too, maybe 5 days (from 10 days now).

fgiunchedi edited projects, added SRE; removed Patch-For-Review.Aug 29 2017, 7:45 AM

fgiunchedi added a project: observability.

Change 374500 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::graphite::production: lower down eventstreams rdkafka retention

https://gerrit.wikimedia.org/r/374500

Change 374500 merged by Elukey:
[operations/puppet@production] role::graphite::production: lower down eventstreams rdkafka retention

https://gerrit.wikimedia.org/r/374500

The other step to take would be to limit the amount of data that we store for librkafka, because with so many clients it is impossible to keep track of all the metrics (https://grafana.wikimedia.org/dashboard/db/eventstreams doesn't even load anymore).

Ottomata created subtask T174435: Stop tracking EventStreams client lag in graphite.Aug 29 2017, 12:55 PM

Kizule closed subtask T174435: Stop tracking EventStreams client lag in graphite as Resolved.Sep 20 2017, 4:35 PM

Ottomata reopened subtask T174435: Stop tracking EventStreams client lag in graphite as Open.Sep 20 2017, 5:03 PM

• Nuria closed subtask T174435: Stop tracking EventStreams client lag in graphite as Resolved.Sep 22 2017, 6:51 PM

We're doing good space wise now:

# du -hcs /var/lib/carbon/whisper/eventstreams/
4.8G	/var/lib/carbon/whisper/eventstreams/

Aklapper edited projects, added Analytics-Radar; removed Analytics.Jun 10 2020, 6:44 AM

	F8500861: screenshot_aCnYLV.png
	Jun 21 2017, 2:21 PM

Eventstreams graphite disk usageClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Eventstreams graphite disk usage
Closed, ResolvedPublic
Actions

Related Objects
Search...