https://tools.wmflabs.org/nagf/?project=integration is becoming more and more confusing due to deleted instances being retained. This in combination with Diamond's weird echoing behaviour (where the last data point is echoed and repeated; even if the node no longer exists) makes things hard to monitor.
I recall this being a problem last year, but from what I remember this was fixed by @yuvipanda. Looks like it might have regressed.
- The Graphite index includes deleted instances. This is solved by querying wikitech to discover instance names.
- Wildcard query (e.g. integration.* for cluster overview) includes metrics from deleted instances. For example, https://tools.wmflabs.org/nagf/?project=deployment-prep "cluster Puppet agent" shows 2 puppet failures even though none of the listed instances have puppet failures. This is because Diamond keeps echoing puppet failures from instances that no longer exist.
- The Graphite index includes deleted metrics (disk_space.* shows mounts that no longer exist). E.g. a dangerously low /var on an instance that actually no longer has that mount.