Page MenuHomePhabricator

graphite.wmflabs.org no longer purges data for deleted instances
Open, LowPublic

Description

https://tools.wmflabs.org/nagf/?project=integration is becoming more and more confusing due to deleted instances being retained. This in combination with Diamond's weird echoing behaviour (where the last data point is echoed and repeated; even if the node no longer exists) makes things hard to monitor.

I recall this being a problem last year, but from what I remember this was fixed by @yuvipanda. Looks like it might have regressed.

  • The Graphite index includes deleted instances. This is solved by querying wikitech to discover instance names.
  • Wildcard query (e.g. integration.* for cluster overview) includes metrics from deleted instances. For example, https://tools.wmflabs.org/nagf/?project=deployment-prep "cluster Puppet agent" shows 2 puppet failures even though none of the listed instances have puppet failures. This is because Diamond keeps echoing puppet failures from instances that no longer exist.
  • The Graphite index includes deleted metrics (disk_space.* shows mounts that no longer exist). E.g. a dangerously low /var on an instance that actually no longer has that mount.

Event Timeline

Krinkle raised the priority of this task from to Needs Triage.
Krinkle updated the task description. (Show Details)
Krinkle added projects: Cloud-Services, Cloud-VPS.
Krinkle added subscribers: Krinkle, yuvipanda.
Krinkle set Security to None.

Yeah, I reverted my fix because there were bugs in that script and it ended up filling the disk with 'archived' data. See Id5f026abe1ac2de99d962ea9a8777598cb304e15

Also you should use the wikitech API (wikitech.wikimedia.org/w/api.php?action=query&list=novainstances&niproject=deployment-prep&niregion=eqiad&format=json) to get list of current instances than rely on graphite.

This is not limited to entire instances. It also applies to individual data points. For example, this graph for disk space on integration-puppetmaster shows a steady ~ 700MB free on /var when in reality it got unmounted (or rather, we recreated the instance without that mount).

graphite.wmflabs.png (250×400 px, 13 KB)

Also you should use the wikitech API (wikitech.wikimedia.org/w/api.php?action=query&list=novainstances&niproject=deployment-prep&niregion=eqiad&format=json) to get list of current instances than rely on graphite.

Cool. I'll do that. Filed https://github.com/wikimedia/nagf/issues/2.

The issue with data points remains I suppose.

(Needsvolunteer :)). Also replacing txstatsd might help