Page MenuHomePhabricator

WMCS-related dashboards using Diamond metrics
Closed, ResolvedPublic

Description

Diamond is deprecated in favour of prometheus. I checked existing dashboards and there's a number of using metrics which should also be provided by prometheus-node-exporter:

(network and load stats, those should all be covered by prometheus-node-exporter -- fixed by Cole)

(CPU, memory, load, disk stats)

(load, IO, network stats)

These seem obsolete and can probably be removed entirely:

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 30 2018, 3:54 PM
Bstorm triaged this task as Medium priority.Nov 30 2018, 3:59 PM

@MoritzMuehlenhoff thanks for this. Could you give an example of one metric in one dashboard that is using a diamond and what would be the equivalent in prometheus? Just a little example to get us started if it won't take too much time.

As for spotting remaining Diamond metrics, https://phabricator.wikimedia.org/P7680 contains a Paste with remaining Diamond metric references (based on a script by Timo), the patch which lists "Matched" refers to the dashboard in question. This is what I used to create the task.

As for replacing it with Prometheus, there's no good one line summary, but I suggest looking at existing dashboards for starters: On https://grafana.wikimedia.org you can login by clicking on the Wikimedia logo and once logged in, you can search for an existing dashboard, click on the title of a metric and use "Edit" (or alternatively select the wheels icon and select "View JSON"). https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats should be a useful starter, as it already uses several of the metrics in use). If you want to look at the raw metrics provided by an exporter to see what you can use, you can log into the server, figure out the port of the exporter (via Puppet) and have a look at the raw metrics via "curl localhost:$PORT/metrics". When you have more specific questions, don't hesitate to ask!

GTirloni removed a subscriber: GTirloni.Mar 23 2019, 8:46 PM
Andrew claimed this task.May 10 2019, 2:09 PM
Andrew updated the task description. (Show Details)May 10 2019, 2:38 PM
Andrew updated the task description. (Show Details)
MoritzMuehlenhoff updated the task description. (Show Details)

Thanks arturo! I worked on this a bit last week but didn't make a whole lot of progress.

I would like to maintain some version of these two dashboards; I use them from time to time. If there are already updated versions someplace let me know and I'll update my links.

Sorry folks, there are a couple of things that I don't understand. The nova_fullstack_test.py script is sending collected metrics to statsd. Are we deprecating that?
I have some shadows in the different mechanisms we use for collecting/storing/visualizing metrics, I think the one that I understand the most is prometheus.

I'm trying to figure out if the dashboard https://grafana.wikimedia.org/d/000000339/labs-nova-fullstack should be just refreshed from the server names point of view, or we need to rewrite the fullstack script, or something else.

Sorry folks, there are a couple of things that I don't understand. The nova_fullstack_test.py script is sending collected metrics to statsd. Are we deprecating that?
I have some shadows in the different mechanisms we use for collecting/storing/visualizing metrics, I think the one that I understand the most is prometheus.

I'm trying to figure out if the dashboard https://grafana.wikimedia.org/d/000000339/labs-nova-fullstack should be just refreshed from the server names point of view, or we need to rewrite the fullstack script, or something else.

statsd itself isn't being actively deprecated (although new "things" should use Prometheus) only Diamond is getting deprecated in this case. Diamond writes its metrics to the servers hierarchy and I see nova_fullstack_test.py also writes to servers. That is likely one of the reasons why labs-nova-fullstack dashboard came up in the audit. My suggestion is to keep the script for now but move its metrics to a different top level hierarchy so that the only producer of statsd metrics to servers is Diamond itself, does that sound good ?

Sorry folks, there are a couple of things that I don't understand. The nova_fullstack_test.py script is sending collected metrics to statsd. Are we deprecating that?
I have some shadows in the different mechanisms we use for collecting/storing/visualizing metrics, I think the one that I understand the most is prometheus.

I'm trying to figure out if the dashboard https://grafana.wikimedia.org/d/000000339/labs-nova-fullstack should be just refreshed from the server names point of view, or we need to rewrite the fullstack script, or something else.

statsd itself isn't being actively deprecated (although new "things" should use Prometheus) only Diamond is getting deprecated in this case. Diamond writes its metrics to the servers hierarchy and I see nova_fullstack_test.py also writes to servers. That is likely one of the reasons why labs-nova-fullstack dashboard came up in the audit. My suggestion is to keep the script for now but move its metrics to a different top level hierarchy so that the only producer of statsd metrics to servers is Diamond itself, does that sound good ?

Or simply ignore the hierarchy, with Diamond being gone for good it won't matter soon ? The reason it was listed in this task is because it referenced the labnet* servers which are not gone. If the dashboard is still useful from my pov it also seems fine to simply update it to use the relevant metrics from cloudnet* servers.

Sorry folks, there are a couple of things that I don't understand. The nova_fullstack_test.py script is sending collected metrics to statsd. Are we deprecating that?
I have some shadows in the different mechanisms we use for collecting/storing/visualizing metrics, I think the one that I understand the most is prometheus.

I'm trying to figure out if the dashboard https://grafana.wikimedia.org/d/000000339/labs-nova-fullstack should be just refreshed from the server names point of view, or we need to rewrite the fullstack script, or something else.

statsd itself isn't being actively deprecated (although new "things" should use Prometheus) only Diamond is getting deprecated in this case. Diamond writes its metrics to the servers hierarchy and I see nova_fullstack_test.py also writes to servers. That is likely one of the reasons why labs-nova-fullstack dashboard came up in the audit. My suggestion is to keep the script for now but move its metrics to a different top level hierarchy so that the only producer of statsd metrics to servers is Diamond itself, does that sound good ?

Or simply ignore the hierarchy, with Diamond being gone for good it won't matter soon ? The reason it was listed in this task is because it referenced the labnet* servers which are not gone. If the dashboard is still useful from my pov it also seems fine to simply update it to use the relevant metrics from cloudnet* servers.

I see! I'd still opt to move out of servers all metrics that are not generated by Diamond so we can safely say that nothing is writing stats there anymore once Diamond is gone.

Ok! I'm fine renaming the metrics. I would need a suggestion though :-)

cloudvps.novafullstack.*?

Ok! I'm fine renaming the metrics. I would need a suggestion though :-)

cloudvps.novafullstack.*?

Looks good to me

Ok! I'm fine renaming the metrics. I would need a suggestion though :-)

cloudvps.novafullstack.*?

Looks good to me

same, +1

Change 520636 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] nova-fullstack: rename statsd metrics to cloudvps.novafullstack

https://gerrit.wikimedia.org/r/520636

Change 520636 merged by Andrew Bogott:
[operations/puppet@production] nova-fullstack: rename statsd metrics to cloudvps.novafullstack

https://gerrit.wikimedia.org/r/520636

Andrew added a comment.Jul 3 2019, 9:46 PM

I've made a new dashboard, https://grafana.wikimedia.org/d/ebJoA6VWz/nova-fullstack -- once I'm convinced that it's doing what I expect I'll delete the older labs-nova-fullstack board.

Andrew updated the task description. (Show Details)Jul 8 2019, 8:30 PM
Andrew updated the task description. (Show Details)Jul 8 2019, 8:34 PM
Andrew updated the task description. (Show Details)Jul 8 2019, 9:08 PM
MoritzMuehlenhoff updated the task description. (Show Details)

Cole fixed the remaining dashboards. Andrew, can you have a final look whether everything works as expected, then we can close the task?

Andrew closed this task as Resolved.Jul 17 2019, 9:09 PM
Andrew added a subscriber: colewhite.

looks good -- thanks @colewhite