Improvements to Ganglia-equivalent Prometheus dashboards
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	fgiunchedi
	Dec 9 2016, 6:22 PM

Description

The Grafana dashboards we're using to replace Ganglia need some usability / visibility improvements, e.g.

The main per-datacenter dashboard with per-cluster breakdown should be tagged featured
The cluster breakdown dashboard should have graphs of the selected view (e.g. memory/cpu) for all hosts in the cluster
Ideally related breakdowns (memory/load) should have the same scale to do easy comparisons
From a "overview" dashboard it should be possible to select a specific cluster and go to its "drilldown" dashboard easily
The Prometheus-related dashboard don't really need to have "prometheus" in the name. This in itself is easy to do, though it changes the dashboard URL and therefore breaks existing links. See also Grafana issue #7043 for dashboard redirects.

Event Timeline

fgiunchedi created this task.Dec 9 2016, 6:22 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 9 2016, 6:22 PM

ArielGlenn subscribed.Dec 12 2016, 5:47 PM

jcrespo subscribed.Dec 12 2016, 5:52 PM

elukey subscribed.Dec 12 2016, 6:16 PM

I tried repeating a panel per-host (load average) here https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown . Though it isn't clear to me yet if it is possible to change the the metric displayed (to e.g. memory) via a dropdown. Yet another possibility to be explored is grafana scripted dashboards http://docs.grafana.org/reference/scripting/

Adding my 2 cents: I personally don't like the Ganglia way of metrics visualization, because it is difficult imho to compare trends (same metric on multiple hosts or multiple metrics from multiple hosts of the same cluster). In the example of the the load average, I would rather have one single graph with all the hosts in there plus another one with TOP5 outliers (or something similar). This would help a lot building dashboards that show multiple metric trends at the same time, like https://grafana.wikimedia.org/dashboard/db/kafka for example does.

My point is that we have been used to debug problems with Ganglia, but this doesn't mean that we need to have an exact replica of it in prometheus/grafana :)

I am with elukey in which a good stacked graph can be more valuable in terms of expressiveness than multiple servers graphs. E.g. despite having only 4-5 metrics, the following is the best dashboard I have to get a general state of the mysql state: https://grafana.wikimedia.org/dashboard/db/mysql-aggregated

Having said that, there are several things to consider:
If we want to integrate monitoring backends, people have to get used to new-ish workflows. Philosophy is different, so people have to have an open mind on new ways of doing the same.
On the other side, we cannot drop existing functionality. Here functionality means "user stories" a technique many times criticized, but better than nothing. However, it cannot fall in Filippose shoulders to build those stories- we have to come up with a list of things that we need to do, and them to be equally simple (even if they are different).

For example, some of my user stories are:

AS A DBA, I WANT TO check the overall state of all cluster, globally and per mysql group, shard and role SO I CAN detect global trend changes in throughout and latency
AS A DBA, I WANT TO check detailed mysql metrics of each mysql server SO I CAN debug server-specific issues
AS A DBA, I WANT TO see a table-based display of: hostname, ip, host up/down state, mysql up/down state, mysql version, uptime, replication lag, QPS and average latency SO I CAN see quickly the latest information of all servers to detect if issues are still ongoing after a problem has been detected

I have the first 2 covered, the latest one is WIP.

What are the user stories of other ops for ganglia so that they can be replicated?

I tried repeating a panel per-host (load average) here .... Though it isn't clear to me yet if it is possible to change the the metric displayed

Maybe we can put all of them on the same dashboard and make it toggable? (I think that is relatively new on the latest grafana)

Thanks @elukey @jcrespo for chiming in!
I agree with some of the points especially building dashboards for each use case. In particular the way I see it working is having a set of fixed dashboards with 4/5 key metrics like Jaime suggested to pinpoint problems and then be able to drilldown from there to e.g. a single host.
Another important use case I think is one for exploratory purposes, which grafana 4.0 enhances on (ad-hoc filter dashboards) though support for Prometheus isn't yet 100% complete (see also https://github.com/grafana/grafana/pull/6138).

Maybe we can put all of them on the same dashboard and make it toggable? (I think that is relatively new on the latest grafana)

Good point! I've updated https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown with an overview of the cluster like ganglia and I'm playing with per-stat dropdown.

fgiunchedi updated the task description. (Show Details)Dec 16 2016, 6:05 PM

fgiunchedi updated the task description. (Show Details)Dec 16 2016, 6:39 PM

My .02€:
At some point I'd like to put my few snapshot hosts and couple of dataset servers into a cluster for monitoring. Stacked graphs would be virtually useless in this case, since the servers all have different loads (e.g. one is a canary, one runs also misc crons, one runs the huge en wiki dumps, etc). However I can see where in many cases for other clusters, having the stacked graphs would be indispensible.

@ArielGlenn indeed the stacked graphs are meant for cluster-wide overviews, would the breakdown per-host be enough in this case for what you had in mind?

In T152791#2894804, @fgiunchedi wrote:

@ArielGlenn indeed the stacked graphs are meant for cluster-wide overviews, would the breakdown per-host be enough in this case for what you had in mind?

I'd only be able to see one host at a time in that case? Ergh, I'd like to be able to put them all up at once... no go eh?

fgiunchedi updated the task description. (Show Details)Dec 22 2016, 1:02 AM

fgiunchedi updated the task description. (Show Details)Jan 4 2017, 12:13 AM

fgiunchedi moved this task from Backlog to In progress on the Prometheus-metrics-monitoring board.Jan 4 2017, 12:39 AM

In T152791#2894847, @ArielGlenn wrote:

In T152791#2894804, @fgiunchedi wrote:

@ArielGlenn indeed the stacked graphs are meant for cluster-wide overviews, would the breakdown per-host be enough in this case for what you had in mind?

I'd only be able to see one host at a time in that case? Ergh, I'd like to be able to put them all up at once... no go eh?

You'd be able to see and compare all hosts at the same time, e.g. https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown?from=1483479983526&to=1483490783527&var-datasource=eqiad%20prometheus%2Fops&var-cluster=videoscaler&var-instance=All under the "load" dropdown

That looks great and covers my use cases. Thanks!

fgiunchedi added a project: User-fgiunchedi.Apr 25 2017, 8:42 AM

I'm resolving this task as all major use cases have been covered.

Improvements to Ganglia-equivalent Prometheus dashboardsClosed, ResolvedPublicActions

Description

Event Timeline

Improvements to Ganglia-equivalent Prometheus dashboards
Closed, ResolvedPublic
Actions