Page MenuHomePhabricator

Move labs 'instances' data to graphite labs
Closed, ResolvedPublic

Description

I noticed an increasing number of metrics created under instances hierarchy, I'm assuming these come from libvirtkvm.py diamond collector.

These are sent to production graphite, though now that labs graphite has ssd the metrics could live there too

Event Timeline

Ah, yeah sorry about this. @yuvipanda enabled this a few days ago as we have tracked down the primary symptom of T141673: Track labs instances hanging to io going stale (freezing) which is best caught at this layer. instances that that are from contintcloud (ci-*) we can discard stats for entirely I believe. That should make this sane enough to keep around?

edit: since these are tracked by uuid atm we'll have to figure out the best way to filter

indeed it might be hard to track via the uuids, alternatively we could purge instance directories not updated for some period of time, e.g. 4/5 weeks

I would like to do better than having to look up UUID's every time honestly but it looks like KVM does not support the domhostname argument for virsh

error: this function is not supported by the connection driver: virDomainGetHostname

which afaict means a big messy dance of calling out to nova and caching the UUID to public name in diamond on first lookup per service start. A 4 week staleness cleanup seems like a good compromise with an eye towards filtering client side next iteration, as long as it is something you are comfortable with.

yeah I agree looking up by uuid isn't great, I'm fine with 4w staleness. It looks like about ~6GB per day on average so 30d is 200G which is fine. I won't be able to work on said cleanup script though but there's something similar already for labmon IIRC so that could be adapted I think.

I'm thinking of just running this in a cron:

find . -type f \! -mtime 672 -delete

672 is 28 days, 4 weeks. That sound ok to everyone?

@yuvipanda the above would remove also recent files, sth like find . -type f -mtime +672 -delete and delete empty directories too afterwards

Change 323339 had a related patch set uploaded (by Filippo Giunchedi):
graphite: cleanup labs instances metrics

https://gerrit.wikimedia.org/r/323339

Change 323339 merged by Filippo Giunchedi:
graphite: cleanup labs instances metrics

https://gerrit.wikimedia.org/r/323339

Change 324820 had a related patch set uploaded (by Filippo Giunchedi):
graphite: switch labs instances cleanup to cron

https://gerrit.wikimedia.org/r/324820

Change 324820 merged by Filippo Giunchedi:
graphite: switch labs instances cleanup to cron

https://gerrit.wikimedia.org/r/324820

fgiunchedi renamed this task from lots of graphite metrics under "instances" created to Move labs 'instances' data to graphite labs.Dec 1 2016, 11:04 PM
fgiunchedi removed a project: Patch-For-Review.
fgiunchedi updated the task description. (Show Details)

Change 334342 had a related patch set uploaded (by Filippo Giunchedi):
graphite: keep labs instance data for 30d

https://gerrit.wikimedia.org/r/334342

Change 334342 merged by Filippo Giunchedi:
graphite: keep labs instance data for 30d

https://gerrit.wikimedia.org/r/334342

fgiunchedi raised the priority of this task from Medium to High.
fgiunchedi added subscribers: madhuvishy, Andrew.

I can't work on this now, though instances is taking 165G on production graphite now. It'd be nice to have it moved to labmon instead, @Andrew @madhuvishy @yuvipanda @chasemp

Poking @bd808 on this, since it's been an issue for us again in the past week.

Change 362444 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] libvirt: turn off instance stats collection

https://gerrit.wikimedia.org/r/362444

Change 362444 merged by Rush:
[operations/puppet@production] libvirt: turn off instance stats collection

https://gerrit.wikimedia.org/r/362444

@fgiunchedi The patch from @chasemp should stop new ones from being created once puppet does its thing across all of the VMs. I think you can and should just delete the existing metrics that fit the instance.$UUID.* pattern from the libvirt collector.

@fgiunchedi The patch from @chasemp should stop new ones from being created once puppet does its thing across all of the VMs. I think you can and should just delete the existing metrics that fit the instance.$UUID.* pattern from the libvirt collector.

+1

@fgiunchedi thanks for your patience on this, we just decided to turn it off for now and rethink

Mentioned in SAL (#wikimedia-operations) [2017-07-05T09:48:05Z] <godog> move 'instances' graphite hierarchy out of the way, do not delete yet - T143405

@chasemp @bd808 no problem! thanks for working on it :D

In terms of rethinking, I don't know exactly what was the original idea behind libvirt, I will shamelessly plug two things though:

fgiunchedi claimed this task.

I've deleted the instances directory for real from graphite machines, resolving