- audit/update dashboards to use new metric names
- audit/update icinga checks
- retire compatibility recording rules
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | fgiunchedi | T220104 TEC6: Metrics monitoring infrastructure (Q4 2018/19 goal) | |||
Resolved | colewhite | T219825 Update dashboards to node-exporter 0.16+ metric names |
Event Timeline
Fundraising dashboards cannot be updated at this time. It looks like the nodes may need upgrading or forwards-compatibility rules.
Indeed, cc'ing @cwdent re: upgrading node-exporter to 0.16+ in FR and compatibility rules.
@fgiunchedi @colewhite actually we have a private grafana/prometheus instance up and running in fundraising now so we can disconnect this and remove the fundraising boards. I'd like to export some FR metrics some day but for the time being this will be easier in terms of PCI compliance and data safety.
I guess I can simply remove prometheus from pay-lvs* - also I'll see if I can remove the boards in the UI, unless there's a better way.
Edit: mention T217355
Change 501399 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] grafana: update swift dashboard to use new metric names
Change 501519 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] grafana: remove frack datasources
Change 501399 merged by Cwhite:
[operations/puppet@production] grafana: update swift dashboard to use new metric names
Change 501519 merged by Filippo Giunchedi:
[operations/puppet@production] grafana: remove frack datasources
Change 502158 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] grafana: add frack to deleteDatasources
Change 502158 merged by Filippo Giunchedi:
[operations/puppet@production] grafana: add frack to deleteDatasources
Mentioned in SAL (#wikimedia-operations) [2019-04-08T08:17:26Z] <godog> delete fundraising folder from public grafana - T219825
Change 504993 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] grafana: update swift dashboard with legacy metric name fallback
Change 504993 merged by Filippo Giunchedi:
[operations/puppet@production] grafana: update swift dashboard with legacy metric name fallback
I think https://grafana.wikimedia.org/d/000000607/cluster-overview might have been missed here? I see at least some old metrics being used there, e.g. node_memory_Cached in the "Memory per host" section.
@CDanis good catch!
The code I used to detect and update did not look recursively into panels with panels. I've updated the cluster-overview dashboard. Mind having a quick look?
I will make another pass with the updated code to be sure I didn't miss anything.
Change 510977 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] update node exporter metrics to 0.16+ names
Change 511734 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] role: remove prometheus backwards-compatibility rules
Change 510977 merged by Cwhite:
[operations/puppet@production] update node exporter metrics to 0.16+ names
Change 511734 merged by Cwhite:
[operations/puppet@production] role: remove prometheus backwards-compatibility rules
Change 513609 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] prometheus: update aggregate metrics to new metric names
Change 513609 merged by Cwhite:
[operations/puppet@production] prometheus: update aggregate metrics to new metric names
@Marostegui just found something we forgot: the use of Prometheus metrics in Grafana's variable definitions (e.g. by a label_values() query)
For instance, on the Host Overview dashboard, Grafana fills in values for the "$server" variable based on all instances that export a node_boot_time metric -- which meant that all that was available there was labstore1003; apparently that's still running the old node_exporter. Changing the query to use node_boot_time_seconds fixed it.
I haven't looked for other cases, though
We found another instance in the cluster overview dashboard where dashboard variables were using node_boot_time, I've fixed it to use node_boot_time_seconds instead.
Thanks for the report! I've gone through and updated dashboards where I've found the legacy metric names in dashboard variables. Please let me know if you find any additional instances.