Page MenuHomePhabricator

Update dashboards to node-exporter 0.16+ metric names
Closed, ResolvedPublic

Description

  • audit/update dashboards to use new metric names
  • audit/update icinga checks
  • retire compatibility recording rules

Event Timeline

colewhite created this task.Apr 1 2019, 6:41 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 1 2019, 6:41 PM
colewhite triaged this task as Low priority.Apr 1 2019, 6:41 PM
colewhite added a comment.EditedApr 3 2019, 10:51 PM

Fundraising dashboards cannot be updated at this time. It looks like the nodes may need upgrading or forwards-compatibility rules.

Fundraising dashboards cannot be updated at this time. It looks like the nodes may need upgrading or forwards-compatibility rules.

Indeed, cc'ing @cwdent re: upgrading node-exporter to 0.16+ in FR and compatibility rules.

cwdent added a comment.EditedApr 4 2019, 5:50 PM

@fgiunchedi @colewhite actually we have a private grafana/prometheus instance up and running in fundraising now so we can disconnect this and remove the fundraising boards. I'd like to export some FR metrics some day but for the time being this will be easier in terms of PCI compliance and data safety.

I guess I can simply remove prometheus from pay-lvs* - also I'll see if I can remove the boards in the UI, unless there's a better way.

Edit: mention T217355

Change 501399 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] grafana: update swift dashboard to use new metric names

https://gerrit.wikimedia.org/r/501399

Change 501519 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] grafana: remove frack datasources

https://gerrit.wikimedia.org/r/501519

@fgiunchedi @colewhite actually we have a private grafana/prometheus instance up and running in fundraising now so we can disconnect this and remove the fundraising boards. I'd like to export some FR metrics some day but for the time being this will be easier in terms of PCI compliance and data safety.

Sounds good to me! We'll remove the grafana datasources and associated dashboards.

Change 501399 merged by Cwhite:
[operations/puppet@production] grafana: update swift dashboard to use new metric names

https://gerrit.wikimedia.org/r/501399

Change 501519 merged by Filippo Giunchedi:
[operations/puppet@production] grafana: remove frack datasources

https://gerrit.wikimedia.org/r/501519

Change 502158 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] grafana: add frack to deleteDatasources

https://gerrit.wikimedia.org/r/502158

Change 502158 merged by Filippo Giunchedi:
[operations/puppet@production] grafana: add frack to deleteDatasources

https://gerrit.wikimedia.org/r/502158

Mentioned in SAL (#wikimedia-operations) [2019-04-08T08:17:26Z] <godog> delete fundraising folder from public grafana - T219825

colewhite moved this task from Backlog to In progress on the observability board.Apr 15 2019, 3:14 PM

Change 504993 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] grafana: update swift dashboard with legacy metric name fallback

https://gerrit.wikimedia.org/r/504993

Change 504993 merged by Filippo Giunchedi:
[operations/puppet@production] grafana: update swift dashboard with legacy metric name fallback

https://gerrit.wikimedia.org/r/504993

colewhite updated the task description. (Show Details)Apr 25 2019, 5:24 PM

I think https://grafana.wikimedia.org/d/000000607/cluster-overview might have been missed here? I see at least some old metrics being used there, e.g. node_memory_Cached in the "Memory per host" section.

@CDanis good catch!

The code I used to detect and update did not look recursively into panels with panels. I've updated the cluster-overview dashboard. Mind having a quick look?

I will make another pass with the updated code to be sure I didn't miss anything.

colewhite updated the task description. (Show Details)May 13 2019, 3:08 PM

Change 510977 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] update node exporter metrics to 0.16+ names

https://gerrit.wikimedia.org/r/510977

Change 511734 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] role: remove prometheus backwards-compatibility rules

https://gerrit.wikimedia.org/r/511734

Change 510977 merged by Cwhite:
[operations/puppet@production] update node exporter metrics to 0.16+ names

https://gerrit.wikimedia.org/r/510977

colewhite updated the task description. (Show Details)May 23 2019, 5:18 PM

Change 511734 merged by Cwhite:
[operations/puppet@production] role: remove prometheus backwards-compatibility rules

https://gerrit.wikimedia.org/r/511734

colewhite closed this task as Resolved.May 29 2019, 6:01 PM
colewhite updated the task description. (Show Details)

Change 513609 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] prometheus: update aggregate metrics to new metric names

https://gerrit.wikimedia.org/r/513609

Change 513609 merged by Cwhite:
[operations/puppet@production] prometheus: update aggregate metrics to new metric names

https://gerrit.wikimedia.org/r/513609

@Marostegui just found something we forgot: the use of Prometheus metrics in Grafana's variable definitions (e.g. by a label_values() query)

For instance, on the Host Overview dashboard, Grafana fills in values for the "$server" variable based on all instances that export a node_boot_time metric -- which meant that all that was available there was labstore1003; apparently that's still running the old node_exporter. Changing the query to use node_boot_time_seconds fixed it.

I haven't looked for other cases, though

We found another instance in the cluster overview dashboard where dashboard variables were using node_boot_time, I've fixed it to use node_boot_time_seconds instead.

Thanks for the report! I've gone through and updated dashboards where I've found the legacy metric names in dashboard variables. Please let me know if you find any additional instances.