- audit/update dashboards to use new metric names
- audit/update icinga checks
- retire compatibility recording rules
|Open||None||T220104 TEC6: Metrics monitoring infrastructure (Q4 2018/19 goal)|
|Resolved||colewhite||T219825 Update dashboards to node-exporter 0.16+ metric names|
@fgiunchedi @colewhite actually we have a private grafana/prometheus instance up and running in fundraising now so we can disconnect this and remove the fundraising boards. I'd like to export some FR metrics some day but for the time being this will be easier in terms of PCI compliance and data safety.
I guess I can simply remove prometheus from pay-lvs* - also I'll see if I can remove the boards in the UI, unless there's a better way.
Edit: mention T217355
I think https://grafana.wikimedia.org/d/000000607/cluster-overview might have been missed here? I see at least some old metrics being used there, e.g. node_memory_Cached in the "Memory per host" section.
@Marostegui just found something we forgot: the use of Prometheus metrics in Grafana's variable definitions (e.g. by a label_values() query)
For instance, on the Host Overview dashboard, Grafana fills in values for the "$server" variable based on all instances that export a node_boot_time metric -- which meant that all that was available there was labstore1003; apparently that's still running the old node_exporter. Changing the query to use node_boot_time_seconds fixed it.
I haven't looked for other cases, though
Thanks for the report! I've gone through and updated dashboards where I've found the legacy metric names in dashboard variables. Please let me know if you find any additional instances.