Maniphest T219825

Update dashboards to node-exporter 0.16+ metric names
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	colewhite
	Apr 1 2019, 6:41 PM

Description

audit/update dashboards to use new metric names
audit/update icinga checks
retire compatibility recording rules

Details

Subject	Repo	Branch	Lines +/-
prometheus: update aggregate metrics to new metric names	operations/puppet	production	+28 -28
role: remove prometheus backwards-compatibility rules	operations/puppet	production	+1 -401
update node exporter metrics to 0.16+ names	operations/puppet	production	+6 -6
grafana: update swift dashboard with legacy metric name fallback	operations/puppet	production	+5 -5
grafana: add frack to deleteDatasources	operations/puppet	production	+9 -0
grafana: remove frack datasources	operations/puppet	production	+0 -22
grafana: update swift dashboard to use new metric names	operations/puppet	production	+4 -4

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		fgiunchedi	T220104 TEC6: Metrics monitoring infrastructure (Q4 2018/19 goal)
		Resolved		colewhite	T219825 Update dashboards to node-exporter 0.16+ metric names

Event Timeline

colewhite created this task.Apr 1 2019, 6:41 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 1 2019, 6:41 PM

colewhite triaged this task as Low priority.Apr 1 2019, 6:41 PM

colewhite added a parent task: T213288: TEC6: Upgrade metrics monitoring infrastructure core components (Q3 2018/19 goal).Apr 1 2019, 6:44 PM

Fundraising dashboards cannot be updated at this time. It looks like the nodes may need upgrading or forwards-compatibility rules.

In T219825#5083413, @colewhite wrote:

Fundraising dashboards cannot be updated at this time. It looks like the nodes may need upgrading or forwards-compatibility rules.

Indeed, cc'ing @cwdent re: upgrading node-exporter to 0.16+ in FR and compatibility rules.

fgiunchedi edited parent tasks, added: T220104: TEC6: Metrics monitoring infrastructure (Q4 2018/19 goal); removed: T213288: TEC6: Upgrade metrics monitoring infrastructure core components (Q3 2018/19 goal).Apr 4 2019, 1:07 PM

@fgiunchedi @colewhite actually we have a private grafana/prometheus instance up and running in fundraising now so we can disconnect this and remove the fundraising boards. I'd like to export some FR metrics some day but for the time being this will be easier in terms of PCI compliance and data safety.

I guess I can simply remove prometheus from pay-lvs* - also I'll see if I can remove the boards in the UI, unless there's a better way.

Edit: mention T217355

Change 501399 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] grafana: update swift dashboard to use new metric names

https://gerrit.wikimedia.org/r/501399

gerritbot added a project: Patch-For-Review.Apr 4 2019, 8:11 PM

Change 501519 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] grafana: remove frack datasources

https://gerrit.wikimedia.org/r/501519

In T219825#5086037, @cwdent wrote:

@fgiunchedi @colewhite actually we have a private grafana/prometheus instance up and running in fundraising now so we can disconnect this and remove the fundraising boards. I'd like to export some FR metrics some day but for the time being this will be easier in terms of PCI compliance and data safety.

Sounds good to me! We'll remove the grafana datasources and associated dashboards.

Change 501399 merged by Cwhite:
[operations/puppet@production] grafana: update swift dashboard to use new metric names

https://gerrit.wikimedia.org/r/501399

Change 501519 merged by Filippo Giunchedi:
[operations/puppet@production] grafana: remove frack datasources

https://gerrit.wikimedia.org/r/501519

Change 502158 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] grafana: add frack to deleteDatasources

https://gerrit.wikimedia.org/r/502158

Change 502158 merged by Filippo Giunchedi:
[operations/puppet@production] grafana: add frack to deleteDatasources

https://gerrit.wikimedia.org/r/502158

Mentioned in SAL (#wikimedia-operations) [2019-04-08T08:17:26Z] <godog> delete fundraising folder from public grafana - T219825

colewhite moved this task from Inbox to In progress on the observability board.Apr 15 2019, 3:14 PM

Change 504993 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] grafana: update swift dashboard with legacy metric name fallback

https://gerrit.wikimedia.org/r/504993

Change 504993 merged by Filippo Giunchedi:
[operations/puppet@production] grafana: update swift dashboard with legacy metric name fallback

https://gerrit.wikimedia.org/r/504993

colewhite updated the task description. (Show Details)Apr 25 2019, 5:24 PM

I think https://grafana.wikimedia.org/d/000000607/cluster-overview might have been missed here? I see at least some old metrics being used there, e.g. node_memory_Cached in the "Memory per host" section.

@CDanis good catch!

The code I used to detect and update did not look recursively into panels with panels. I've updated the cluster-overview dashboard. Mind having a quick look?

I will make another pass with the updated code to be sure I didn't miss anything.

colewhite updated the task description. (Show Details)May 13 2019, 3:08 PM

Change 510977 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] update node exporter metrics to 0.16+ names

https://gerrit.wikimedia.org/r/510977

Change 511734 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] role: remove prometheus backwards-compatibility rules

https://gerrit.wikimedia.org/r/511734

Change 510977 merged by Cwhite:
[operations/puppet@production] update node exporter metrics to 0.16+ names

https://gerrit.wikimedia.org/r/510977

colewhite updated the task description. (Show Details)May 23 2019, 5:18 PM

Change 511734 merged by Cwhite:
[operations/puppet@production] role: remove prometheus backwards-compatibility rules

https://gerrit.wikimedia.org/r/511734

Maintenance_bot removed a project: Patch-For-Review.May 29 2019, 5:11 PM

colewhite closed this task as Resolved.May 29 2019, 6:01 PM

colewhite updated the task description. (Show Details)

Change 513609 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] prometheus: update aggregate metrics to new metric names

https://gerrit.wikimedia.org/r/513609

gerritbot added a project: Patch-For-Review.May 31 2019, 2:18 PM

Change 513609 merged by Cwhite:
[operations/puppet@production] prometheus: update aggregate metrics to new metric names

https://gerrit.wikimedia.org/r/513609

@Marostegui just found something we forgot: the use of Prometheus metrics in Grafana's variable definitions (e.g. by a label_values() query)

For instance, on the Host Overview dashboard, Grafana fills in values for the "$server" variable based on all instances that export a node_boot_time metric -- which meant that all that was available there was labstore1003; apparently that's still running the old node_exporter. Changing the query to use node_boot_time_seconds fixed it.

I haven't looked for other cases, though

We found another instance in the cluster overview dashboard where dashboard variables were using node_boot_time, I've fixed it to use node_boot_time_seconds instead.

Thanks for the report! I've gone through and updated dashboards where I've found the legacy metric names in dashboard variables. Please let me know if you find any additional instances.

Update dashboards to node-exporter 0.16+ metric namesClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Update dashboards to node-exporter 0.16+ metric names
Closed, ResolvedPublic
Actions

Related Objects
Search...