Page MenuHomePhabricator

Regularly purge old ores graphite metrics
Closed, ResolvedPublic

Description

I noticed ores putting in a lot of metrics into graphite, looks like almost half of the metrics currently stored hasn't been touched in the last 30d

graphite1001:/var/lib/carbon/whisper/ores$ find . -type f | wc -l
462387
graphite1001:/var/lib/carbon/whisper/ores$ find . -type f -mtime +30 | wc -l
197721

I'm assuming these are old or renamed metrics that can be purged regularly?

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

@fgiunchedi : Hey, How we can purge scores? I couldn't find anything in wikitech.

@Ladsgroup this would be all old graphite metrics for ores not just scores, anyways what we do is setup a cronjob on graphite machines that does the right thing via graphite::whisper_cleanup and delete e.g. all whisper files that haven't been touched in the last 30d

Halfak raised the priority of this task from Medium to High.Jul 20 2017, 3:07 PM
Halfak moved this task from Unsorted to Maintenance/cleanup on the Machine-Learning-Team board.

I think we'd like to keep some high level metrics forever, others for just 90 days and many for just 30 days. Is that possible?

I think we'd like to keep some high level metrics forever, others for just 90 days and many for just 30 days. Is that possible?

Yes! As @ori mentioned we can purge metrics that haven't been touched based on filename patterns

An example of high-level metrics we want to keep forever: request rate over time.

Copying an inventory of metrics here might help.

@awight I shared the list with Amir on T174542, happy to share it with you too (4MB gz file, 500k lines). It is in your home on tin (files generated on Aug 30th).

$ find . -type f -mtime +30 | wc -l can be purged safely.

I'll build a short list of metrics that should be kept indefinitely. All others will be able to be rotated.

OK looks like we want to keep all of the base metrics for the following names indefinitely:

ores.*.precache_request
ores.*.scores_request
ores.*.revision_scored
ores.*.datasources_extracted
ores.*.score_processed
ores.*.score_processor_overloaded
ores.*.precache_cache_hit
ores.*.score_cache_hit
ores.*.precache_cache_miss
ores.*.score_cache_miss
ores.*.score_errored
ores.*.score_timed_out
ores.*.precache_score
ores.*.uwsgi.core
ores.*.uwsgi.rss_size

Many of these metrics have sub-metrics that we don't need to keep historically. E.g. ores.*.precache_request.enwiki.damaging All of the sub-directories for each of the metrics can be trimmed so that only the last 30 days of data is available. Does that make sense?

@fgiunchedi, can you help me figure out what our next step should be here?

@fgiunchedi, can you help me figure out what our next step should be here?

For sure! Thanks for looking into what metrics can be cleaned.

In this task's context "cleaned" means I'm looking to remove metric files that haven't been updated in a certain amount of time (IOW are not receiving datapoints anymore). Graphite creates one file per metric with a fixed retention time whenever datapoints for a new metrics are pushed. Metrics that are currently receiving datapoints will not be touched by the cleanup, that is to say the cleanup I'm talking about doesn't have to do with how far back in the past we keep data for certain metrics but only when a given metric has received its datapoints last.

To give a concrete example, if a metric is renamed on the ores side foo -> bar then graphite will create a bar.wsp file receiving the new data but never remove foo.wsp in the process and foo.wsp will stop being updated.

For next steps I'll send a review to implement the "metrics not having received datapoints in the last 30d" cleanup.

re: retention of metrics that is by default (in graphite's config) 1m:7d,5m:14d,15m:30d,1h:1y,1d:5y (i.e. the longest retention is 1d aggregation for five years) but no indefinite keeping of metrics.

1 day aggregation for 5 years is practically indefinite to me. That's OK.

I'm a fan of removing anything that hasn't been written to in the last 30 days. That's all definitely safe to clean up.

Change 401917 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] graphite: cleanup stale ORES metrics

https://gerrit.wikimedia.org/r/401917

@fgiunchedi, what part of this config specifies that only metrics that haven't been updated in 30 days will be purged. It looks to me keep_days => 30, means that all metrics will only be kept for 30 days period. Am I misreading this?

@Halfak It's a scary parameter name, +1 that it should be called "delete_files_not_modified_since" or something. Looking at the code in modules/graphite/manifests/whisper_cleanup.pp, I think it's the logic we want, though: find ${directory} -type f -mtime +${keep_days}

@fgiunchedi, what part of this config specifies that only metrics that haven't been updated in 30 days will be purged. It looks to me keep_days => 30, means that all metrics will only be kept for 30 days period. Am I misreading this?

Naming is indeed hard, but yeah as @awight that'll check only mtime not the contents of files, IOW the type of cleanup we were talking about.

Change 401917 merged by Filippo Giunchedi:
[operations/puppet@production] graphite: cleanup stale ORES metrics

https://gerrit.wikimedia.org/r/401917

Mentioned in SAL (#wikimedia-operations) [2018-01-11T09:32:35Z] <godog> cleanup ores metrics older than 30d - T169969

All done! Agreed the parameter isn't the best, and naming is hard :(

This task is done from my POV so tentatively resolving.