Page MenuHomePhabricator

Clean up labs graphite datapoints
Closed, ResolvedPublic

Description

Same as T104091, different things to delete this time.

These hosts no longer exist:
deployment-cache-bits01
deployment-cache-mobile03
deployment-cache-text02
deployment-cache-text03
deployment-cache-upload02
deployment-test
deployment-videoscaler01

These individual mounts no longer exist:
deployment-elastic05.diskspace._var_log.byte_
deployment-elastic06.diskspace._var_log.byte_
deployment-elastic07.diskspace._var_log.byte_
deployment-elastic08.diskspace._var.byte_perc
deployment-mediawiki02.diskspace._srv.byte_pe
deployment-mediawiki03.diskspace._var.byte_pe
deployment-restbase01.diskspace._var_log.byte
deployment-restbase02.diskspace._var_log.byte

Oh and - deployment-salt's data for "Long lived cherry-picks on puppetmaster" no longer makes sense as it's not a puppet master now. The functionality has been moved to deployment-puppetmaster.

Event Timeline

Krenair raised the priority of this task from to Needs Triage.
Krenair updated the task description. (Show Details)
Krenair added subscribers: Krenair, fgiunchedi.
hashar set Security to None.
hashar moved this task from To Triage to Externally Blocked on the Beta-Cluster-Infrastructure board.
hashar subscribed.

Seems the broad issue is to have a garbage collector on the labs graphite. A low hanging fruit would be to delete all metrics for instances that are no more existing.

agreed, what would be the easiest way to get a map of project -> list of instances? @yuvipanda @Andrew ?

eb3e3dbd81d263791d2ba1909f64f8a84531c65e for my revert of my original garbage collector script. It failed because txstatsd would keep re-creating the metrics even after diamond stopped sending them, and so this needed a restart of txstatsd every time. Since we've fixed that now, we can perhaps bring that back. I'd like someone else to own it though :D

I'll take this, setting to low

Change 248317 had a related patch set uploaded (by Filippo Giunchedi):
graphite: enable labs instances archiver

https://gerrit.wikimedia.org/r/248317

Change 248317 merged by Filippo Giunchedi:
graphite: enable labs instances archiver

https://gerrit.wikimedia.org/r/248317

@Krenair I've ran archive-instances on labmon1001 so the deployment-prep hosts are gone, not sure about the mount points though since we can't really detect from graphite what's there and what isn't

How do we detect that those exist in the first place?

diamond discovers the mount point locally and starts pushing metrics for those

Krenair renamed this task from Delete more specific deployment-prep graphite datapoints to Clean up labs graphite datapoints.Apr 14 2016, 10:14 PM
Krenair added a project: SRE.

Change 283779 had a related patch set uploaded (by Alex Monk):
shinken: Allow undefined data in graphite for disk space checks

https://gerrit.wikimedia.org/r/283779

Change 283779 merged by Filippo Giunchedi:
shinken: Allow undefined data in graphite for disk space checks

https://gerrit.wikimedia.org/r/283779

Can someone archive deployment-prep.deployment-tin.diskspace._mnt.byte_percentfree ? The mount no longer exists but there's still a warning in shinken about no valid datapoints being found

Supposedly that is fixed by Shinken now allows undefined data points https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/283779/

Compared to 2015, old metrics are garbage collected.