The SRE grafana dashboards are not consistent with each other, have accumulated cruft over time, and (among other deficiencies) lack a good way to navigate between them.
There have been several ideas on how to improve the situation, this task will be used to collect those ideas and use cases and draft a plan to improve said dashboards.
Filippo's use cases / ideas (limited to "machine level" metrics like cpu/memory/disk/network
- We're using the dashboard to debug a problem or quantify the impact of an ongoing incident
- There are three main components we can drill down/up: site/cluster/host
- Dashboards for said components all present the same high level metrics, aggregated according to the component we're looking at
- To reduce cognitive overhead there are a limited number of graphs per dashboard, and within each graph a limited number of metrics.
- A nice guideline I've found is the USE method (http://www.brendangregg.com/usemethod.html) which I've tested an implementation for the "host dashboard" here: https://grafana.wikimedia.org/dashboard/db/host-overview
- Another approach is the RED method (https://www.weave.works/docs/cloud/latest/tasks/monitor/best-instrumenting/) . The 2 methods are actually complementary, one being systems oriented and the other services oriented.