Regularly purge old ores graphite metrics
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	fgiunchedi
	Jul 7 2017, 9:46 AM

Description

I noticed ores putting in a lot of metrics into graphite, looks like almost half of the metrics currently stored hasn't been touched in the last 30d

graphite1001:/var/lib/carbon/whisper/ores$ find . -type f | wc -l
462387
graphite1001:/var/lib/carbon/whisper/ores$ find . -type f -mtime +30 | wc -l
197721

I'm assuming these are old or renamed metrics that can be purged regularly?

Details

	Subject	Repo	Branch	Lines +/-
	graphite: cleanup stale ORES metrics	operations/puppet	production	+6 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	fgiunchedi	T1075 Audit groups of metrics in Graphite that allocate a lot of disk space
Resolved	Halfak	T169969 Regularly purge old ores graphite metrics
Resolved	None	T174542 Temporarily access request to graphite nodes

Event Timeline

fgiunchedi created this task.Jul 7 2017, 9:46 AM

Restricted Application added a project: Machine-Learning-Team. · View Herald TranscriptJul 7 2017, 9:46 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

cc @Halfak @Ladsgroup

@fgiunchedi : Hey, How we can purge scores? I couldn't find anything in wikitech.

@Ladsgroup this would be all old graphite metrics for ores not just scores, anyways what we do is setup a cronjob on graphite machines that does the right thing via graphite::whisper_cleanup and delete e.g. all whisper files that haven't been touched in the last 30d

I think we'd like to keep some high level metrics forever, others for just 90 days and many for just 30 days. Is that possible?

Take a look at https://github.com/wikimedia/puppet/blob/c2543d7f80fefbe39901897882c60d91d98c3950/modules/role/manifests/graphite/base.pp

In T169969#3456586, @Halfak wrote:

I think we'd like to keep some high level metrics forever, others for just 90 days and many for just 30 days. Is that possible?

Yes! As @ori mentioned we can purge metrics that haven't been touched based on filename patterns

great! We'll look into it.

fgiunchedi moved this task from Backlog to Radar on the User-fgiunchedi board.Jul 26 2017, 9:15 AM

Ladsgroup created subtask T174542: Temporarily access request to graphite nodes .Aug 30 2017, 11:11 AM

Ladsgroup closed subtask T174542: Temporarily access request to graphite nodes as Resolved.Aug 30 2017, 1:42 PM

Ladsgroup edited projects, added Machine-Learning-Team (Active Tasks); removed Machine-Learning-Team.Aug 30 2017, 3:18 PM

An example of high-level metrics we want to keep forever: request rate over time.

Copying an inventory of metrics here might help.

@awight I shared the list with Amir on T174542, happy to share it with you too (4MB gz file, 500k lines). It is in your home on tin (files generated on Aug 30th).

Halfak claimed this task.Oct 16 2017, 6:59 PM

$ find . -type f -mtime +30 | wc -l can be purged safely.

I'll build a short list of metrics that should be kept indefinitely. All others will be able to be rotated.

OK looks like we want to keep all of the base metrics for the following names indefinitely:

ores.*.precache_request
ores.*.scores_request
ores.*.revision_scored
ores.*.datasources_extracted
ores.*.score_processed
ores.*.score_processor_overloaded
ores.*.precache_cache_hit
ores.*.score_cache_hit
ores.*.precache_cache_miss
ores.*.score_cache_miss
ores.*.score_errored
ores.*.score_timed_out
ores.*.precache_score
ores.*.uwsgi.core
ores.*.uwsgi.rss_size

Many of these metrics have sub-metrics that we don't need to keep historically. E.g. ores.*.precache_request.enwiki.damaging All of the sub-directories for each of the metrics can be trimmed so that only the last 30 days of data is available. Does that make sense?

Halfak moved this task from Parked to Review on the Machine-Learning-Team (Active Tasks) board.Dec 21 2017, 10:28 PM

@fgiunchedi, can you help me figure out what our next step should be here?

In T169969#3867919, @Halfak wrote:

@fgiunchedi, can you help me figure out what our next step should be here?

For sure! Thanks for looking into what metrics can be cleaned.

In this task's context "cleaned" means I'm looking to remove metric files that haven't been updated in a certain amount of time (IOW are not receiving datapoints anymore). Graphite creates one file per metric with a fixed retention time whenever datapoints for a new metrics are pushed. Metrics that are currently receiving datapoints will not be touched by the cleanup, that is to say the cleanup I'm talking about doesn't have to do with how far back in the past we keep data for certain metrics but only when a given metric has received its datapoints last.

To give a concrete example, if a metric is renamed on the ores side foo -> bar then graphite will create a bar.wsp file receiving the new data but never remove foo.wsp in the process and foo.wsp will stop being updated.

For next steps I'll send a review to implement the "metrics not having received datapoints in the last 30d" cleanup.

re: retention of metrics that is by default (in graphite's config) 1m:7d,5m:14d,15m:30d,1h:1y,1d:5y (i.e. the longest retention is 1d aggregation for five years) but no indefinite keeping of metrics.

1 day aggregation for 5 years is practically indefinite to me. That's OK.

I'm a fan of removing anything that hasn't been written to in the last 30 days. That's all definitely safe to clean up.

Change 401917 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] graphite: cleanup stale ORES metrics

https://gerrit.wikimedia.org/r/401917

gerritbot added a project: Patch-For-Review.Jan 4 2018, 9:17 AM

fgiunchedi moved this task from Radar to Doing on the User-fgiunchedi board.Jan 4 2018, 9:24 AM

@Halfak see https://gerrit.wikimedia.org/r/401917

@fgiunchedi, what part of this config specifies that only metrics that haven't been updated in 30 days will be purged. It looks to me keep_days => 30, means that all metrics will only be kept for 30 days period. Am I misreading this?

@Halfak It's a scary parameter name, +1 that it should be called "delete_files_not_modified_since" or something. Looking at the code in modules/graphite/manifests/whisper_cleanup.pp, I think it's the logic we want, though: find ${directory} -type f -mtime +${keep_days}

In T169969#3886670, @Halfak wrote:

@fgiunchedi, what part of this config specifies that only metrics that haven't been updated in 30 days will be purged. It looks to me keep_days => 30, means that all metrics will only be kept for 30 days period. Am I misreading this?

Naming is indeed hard, but yeah as @awight that'll check only mtime not the contents of files, IOW the type of cleanup we were talking about.

OK great. I'll go +1 :)

Change 401917 merged by Filippo Giunchedi:
[operations/puppet@production] graphite: cleanup stale ORES metrics

https://gerrit.wikimedia.org/r/401917

Mentioned in SAL (#wikimedia-operations) [2018-01-11T09:32:35Z] <godog> cleanup ores metrics older than 30d - T169969

All done! Agreed the parameter isn't the best, and naming is hard :(

This task is done from my POV so tentatively resolving.

Ladsgroup moved this task from Review to Completed on the Machine-Learning-Team (Active Tasks) board.Jan 12 2018, 12:46 PM

Regularly purge old ores graphite metricsClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Regularly purge old ores graphite metrics
Closed, ResolvedPublic
Actions

Related Objects
Search...