I noticed an increasing number of metrics created under instances hierarchy, I'm assuming these come from libvirtkvm.py diamond collector.
These are sent to production graphite, though now that labs graphite has ssd the metrics could live there too
I noticed an increasing number of metrics created under instances hierarchy, I'm assuming these come from libvirtkvm.py diamond collector.
These are sent to production graphite, though now that labs graphite has ssd the metrics could live there too
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | fgiunchedi | T1075 Audit groups of metrics in Graphite that allocate a lot of disk space | |||
Resolved | fgiunchedi | T143405 Move labs 'instances' data to graphite labs |
Ah, yeah sorry about this. @yuvipanda enabled this a few days ago as we have tracked down the primary symptom of T141673: Track labs instances hanging to io going stale (freezing) which is best caught at this layer. instances that that are from contintcloud (ci-*) we can discard stats for entirely I believe. That should make this sane enough to keep around?
edit: since these are tracked by uuid atm we'll have to figure out the best way to filter
indeed it might be hard to track via the uuids, alternatively we could purge instance directories not updated for some period of time, e.g. 4/5 weeks
I would like to do better than having to look up UUID's every time honestly but it looks like KVM does not support the domhostname argument for virsh
error: this function is not supported by the connection driver: virDomainGetHostname
which afaict means a big messy dance of calling out to nova and caching the UUID to public name in diamond on first lookup per service start. A 4 week staleness cleanup seems like a good compromise with an eye towards filtering client side next iteration, as long as it is something you are comfortable with.
yeah I agree looking up by uuid isn't great, I'm fine with 4w staleness. It looks like about ~6GB per day on average so 30d is 200G which is fine. I won't be able to work on said cleanup script though but there's something similar already for labmon IIRC so that could be adapted I think.
I'm thinking of just running this in a cron:
find . -type f \! -mtime 672 -delete
672 is 28 days, 4 weeks. That sound ok to everyone?
@yuvipanda the above would remove also recent files, sth like find . -type f -mtime +672 -delete and delete empty directories too afterwards
Change 323339 had a related patch set uploaded (by Filippo Giunchedi):
graphite: cleanup labs instances metrics
Change 324820 had a related patch set uploaded (by Filippo Giunchedi):
graphite: switch labs instances cleanup to cron
Change 324820 merged by Filippo Giunchedi:
graphite: switch labs instances cleanup to cron
Change 334342 had a related patch set uploaded (by Filippo Giunchedi):
graphite: keep labs instance data for 30d
I can't work on this now, though instances is taking 165G on production graphite now. It'd be nice to have it moved to labmon instead, @Andrew @madhuvishy @yuvipanda @chasemp
Change 362444 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] libvirt: turn off instance stats collection
Change 362444 merged by Rush:
[operations/puppet@production] libvirt: turn off instance stats collection
@fgiunchedi The patch from @chasemp should stop new ones from being created once puppet does its thing across all of the VMs. I think you can and should just delete the existing metrics that fit the instance.$UUID.* pattern from the libvirt collector.
+1
@fgiunchedi thanks for your patience on this, we just decided to turn it off for now and rethink
Mentioned in SAL (#wikimedia-operations) [2017-07-05T09:48:05Z] <godog> move 'instances' graphite hierarchy out of the way, do not delete yet - T143405
@chasemp @bd808 no problem! thanks for working on it :D
In terms of rethinking, I don't know exactly what was the original idea behind libvirt, I will shamelessly plug two things though: