Page MenuHomePhabricator

Better organization for SRE grafana dashboards
Open, LowPublic

Description

The SRE grafana dashboards are not consistent with each other, have accumulated cruft over time, and (among other deficiencies) lack a good way to navigate between them.

There have been several ideas on how to improve the situation, this task will be used to collect those ideas and use cases and draft a plan to improve said dashboards.

Filippo's use cases / ideas (limited to "machine level" metrics like cpu/memory/disk/network

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 20 2017, 2:29 PM

Another idea for better dashboarding: show vertical lines for events other than deployments, e.g. puppet merges

Volans added a subscriber: Volans.Oct 24 2017, 12:16 PM

Another idea for better dashboarding: show vertical lines for events other than deployments, e.g. puppet merges

I agree, in a setup I've done in the past I had all annotations saved to an Elasticsearch, to allow it to scale quite easily. The smart thing to do here would be to have those other annotations tagged so that each dashboard could maybe by default only show related annotations and optionally show all of them. As a practical example, puppet merges could be tagged and by default in a traffic dashboard we show only traffic-tagged merges, but optionally you can show all of them, because sometimes things correlates in unpredictable ways...

I'd also like to add to:

There are three main components we can drill down/up: site/cluster/host

that it would be nice to be able to both show together all said metric for all hosts in a cluster or alternatively all said metric aggregated for all hosts in a cluster. This add a bit of complexity given that you need to know how to aggregate based on the metric (sum, average, etc...).
For the simple cases that just require a sum, also a stack graph can achieve the same, and for the average case a non-stacked graph gives you an idea of it, but they become quickly unreadable with the number of hosts/metrics plotted.

faidon moved this task from Backlog to Up next on the observability board.Nov 27 2017, 4:15 PM
greg added a subscriber: greg.Dec 11 2017, 10:03 PM

T180784 has some interesting discussion as well.

@ori recently sent his thoughts about this to the ops list, and I found it a very eloquent description of the issues I was thinking of too. His full email was:

Ganglia may have been a buggy and crufty, but when it was accessible anyone could see a high-level overview of Wikimedia's operational metrics at a glance by browsing to https://ganglia.wikimedia.org/.
This was extremely useful for spotting and troubleshooting problems. And the fact that anyone was invited to have a look was a powerful demonstration of what Wikimedia is and what makes it special.
Operational metrics today may be more comprehensive, more accurate, and/or more reliable, but they are not more discoverable.
The list of featured dashboards in https://grafana.wikimedia.org/ is not well-organized. There is a curious mix of important and unimportant dashboards, which are not grouped in any meaningful way. Some dashboards that should be featured aren't, and some featured dashboards shouldn't be.
The names of the dashboards are often obscure, vague, or confusing. What is "Production Logging"? Why do "Prometheus DC overview" and "Prometheus global overview" have "Prometheus" in their name?
Some of the top-level links refer to teams ("Team TCB"), others to topics ("Performance Metrics"), others to specific services ("Swift").
So, this a plea for a good landing page for operational metrics. I'd really love to see a curated selection of dashboards grouped according to some sensible taxonomy, their names standardized and revised for clarity. I think the time investment will pay for itself in time saved when debugging issues and on-boarding new folks.

I think there are a few different dimensions to this problem:

  • Naming (Varnish vs. Traffic vs. HTTPS, "Prometheus" prefixes, etc.)
  • Organization/hierachy (AQS has dashboards named as "AQS :: Cassandra :: CF :: Latency/rate Copy" for instance)
  • Similar to the above: 1:N tagging/hashtags
  • Navigation (drilling up/down, featured dashboards/frontpage)
  • Discoverability, which is an artifact of all to the above
Dzahn added a subscriber: Dzahn.Jan 11 2018, 1:24 AM

We need the following new dashboards / URLs (noticed as part of T183873):

We need the following new dashboards / URLs (noticed as part of T183873):

Mathoid is on SCB, not SCA. only zotero is on SCA. In any case, are these 2 sufficient ?

[eqiad] https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown?orgId=1&var-
datasource=eqiad%20prometheus%2Fops&var-cluster=scb&var-instance=All

[codfw] https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown?orgId=1&var-datasource=codfw%20prometheus%2Fops&var-cluster=scb&var-instance=All

Same as above.

I see no differentiation for "canary" in the links in the wikitech page. So I am guessing it was a mental process for the parsoid deployer. In that case, following the pattern above we have

[codfw] https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown?orgId=1&var-datasource=codfw%20prometheus%2Fops&var-cluster=parsoid&var-instance=All

[eqiad] https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=parsoid&var-instance=All

The maps-varnish does not exist anymore (T164608) so there is nothing to do about that. For maps itself, following the pattern above

[eqiad] https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=maps&var-instance=All

[codfw] https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=maps&var-instance=All

The question

All of the above are practically the exact same link with a bit of a tweaking to set the cluster and datacenter. Should we follow the more brittle approach of updating every page with the cluster+DC specific link or should we go for the more robust approach of just using the base https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown?orgId=1 and let the user figure it out ? At least for the DC part I am pretty sure the latter, I am not so sure about the cluster though.

fgiunchedi moved this task from Backlog to Doing on the User-fgiunchedi board.Feb 1 2018, 1:39 PM
akosiaris updated the task description. (Show Details)Mar 5 2018, 4:33 PM
fgiunchedi moved this task from Up next to In progress on the observability board.Apr 16 2018, 3:23 PM

I've put together a sample dashboard to play around with some concepts/ideas emerged in this task at https://grafana.wikimedia.org/dashboard/db/dashboard-redesign-proposal . Notably missing is the navigation story among different dashboards, but tl;dr it would be based on dashboards tags to create dropdowns. Which grouping/dropdown menus make sense is still TBD.

The dashboard has been put together by me by point-and-click, though the idea is to have the same dashboard generated by code and thus making it simple to create multiple consistent dashboards for different services and purposes.

As discussed in the monitoring meeting here some feedback:

  • while the limit on the number of rows/panels/metrics is understandable, it could make harder to make generic dashboard for some services, and splitting them into multiple dashboards might made harder their discoverability. One option is to exclude the rows/panels/metrics that are hidden by default from the "limit".
  • I'm not sure if we should come up with some best practice for the values to show near a single label (min/max/avg/current) and whether they should be inline/as table/as table on the right. Different graphs might need different values/layout based on the data shown.
  • I'm personally a big fan of the shared crosshair, maybe we could set it on by default.

As discussed in the monitoring meeting here some feedback:

  • while the limit on the number of rows/panels/metrics is understandable, it could make harder to make generic dashboard for some services, and splitting them into multiple dashboards might made harder their discoverability. One option is to exclude the rows/panels/metrics that are hidden by default from the "limit".

Agreed. From what I understood anyway, the limit is suggested for performance reasons and given "hidden" rows do not get evaluated at dashboard load time but rather on "unhiding".

  • I'm not sure if we should come up with some best practice for the values to show near a single label (min/max/avg/current) and whether they should be inline/as table/as table on the right. Different graphs might need different values/layout based on the data shown.

I am very ambivalent on the legend as well. I tend to create it, but I have no rule yet and rather play it by ear. I 'd say we say SHOULD instead of MUST for any kind of guideline here and leave it to the graph creator.

  • I'm personally a big fan of the shared crosshair, maybe we could set it on by default.

+1

  • I also think we should add a RED/4 golden signals method example to the proposal before we make it to a template. Granted SRE graph will probably use the USE method (pun intended) but still, it'd be great to have an example of that as well

Thanks for the feedback!

As discussed in the monitoring meeting here some feedback:

  • while the limit on the number of rows/panels/metrics is understandable, it could make harder to make generic dashboard for some services, and splitting them into multiple dashboards might made harder their discoverability. One option is to exclude the rows/panels/metrics that are hidden by default from the "limit".

Agreed. From what I understood anyway, the limit is suggested for performance reasons and given "hidden" rows do not get evaluated at dashboard load time but rather on "unhiding".

I suggested the limit of 5/6 rows per dashboard to avoid too much information per dashboard, though performance is also a concern of course. I think having one "overview" dashboard that is canonical and one/more dashboards for drilldown(s) could work. I expanded on this point in the dashboard example.

  • I'm not sure if we should come up with some best practice for the values to show near a single label (min/max/avg/current) and whether they should be inline/as table/as table on the right. Different graphs might need different values/layout based on the data shown.

I am very ambivalent on the legend as well. I tend to create it, but I have no rule yet and rather play it by ear. I 'd say we say SHOULD instead of MUST for any kind of guideline here and leave it to the graph creator.

Indeed, my guideline generally is to display whichever summary aids in issue debugging e.g. max for utilization, max/total for errors, min for availability, etc. I'd say ideally no more than two summaries per graph, added an explanation to the dashboard sample for this too.

  • I'm personally a big fan of the shared crosshair, maybe we could set it on by default.

+1

+1 too, added to the dashboard

  • I also think we should add a RED/4 golden signals method example to the proposal before we make it to a template. Granted SRE graph will probably use the USE method (pun intended) but still, it'd be great to have an example of that as well

Agreed, I'll try to come up with a sample dashboard for those too. Our USE cases (ha ha) I think depend a whole lot if we're diagnosing performance problems (USE) and/or looking at a service as a whole (RED/4GS)

elukey added a subscriber: elukey.May 3 2018, 3:50 PM

Change 442301 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] WIP grafana: host overview dashboard as code

https://gerrit.wikimedia.org/r/442301

I coded a strawman using grafanalib at https://gerrit.wikimedia.org/r/c/operations/puppet/+/442301 and looks good to me so far, please take a look too. I'll expand it to multiple dashboards and use cases as well.

I coded a strawman using grafanalib at https://gerrit.wikimedia.org/r/c/operations/puppet/+/442301 and looks good to me so far, please take a look too. I'll expand it to multiple dashboards and use cases as well.

Note that "dashboards as code" is in scope for T171482: Programmatic generation of grafana dashboards not for this task, which is about dashboard organization in general instead.

Change 444219 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] grafana: use host-overview in favour of server-board for featured dashboard

https://gerrit.wikimedia.org/r/444219

@ori recently sent his thoughts about this to the ops list, and I found it a very eloquent description of the issues I was thinking of too. His full email was:

Ganglia may have been a buggy and crufty, but when it was accessible anyone could see a high-level overview of Wikimedia's operational metrics at a glance by browsing to https://ganglia.wikimedia.org/.
This was extremely useful for spotting and troubleshooting problems. And the fact that anyone was invited to have a look was a powerful demonstration of what Wikimedia is and what makes it special.
Operational metrics today may be more comprehensive, more accurate, and/or more reliable, but they are not more discoverable.
The list of featured dashboards in https://grafana.wikimedia.org/ is not well-organized. There is a curious mix of important and unimportant dashboards, which are not grouped in any meaningful way. Some dashboards that should be featured aren't, and some featured dashboards shouldn't be.
The names of the dashboards are often obscure, vague, or confusing. What is "Production Logging"? Why do "Prometheus DC overview" and "Prometheus global overview" have "Prometheus" in their name?
Some of the top-level links refer to teams ("Team TCB"), others to topics ("Performance Metrics"), others to specific services ("Swift").
So, this a plea for a good landing page for operational metrics. I'd really love to see a curated selection of dashboards grouped according to some sensible taxonomy, their names standardized and revised for clarity. I think the time investment will pay for itself in time saved when debugging issues and on-boarding new folks.

I think there are a few different dimensions to this problem:

  • Naming (Varnish vs. Traffic vs. HTTPS, "Prometheus" prefixes, etc.)
  • Organization/hierachy (AQS has dashboards named as "AQS :: Cassandra :: CF :: Latency/rate Copy" for instance)
  • Similar to the above: 1:N tagging/hashtags
  • Navigation (drilling up/down, featured dashboards/frontpage)
  • Discoverability, which is an artifact of all to the above

I agree with the above, with the intent of understanding better what's in grafana over the last couple of days I went through the current list of dashboards (~350) and took some notes:

  • Some dashboards are user/private/temporary
  • "Rotting" dashboards, e.g. metrics disappeared, dashboard not functional, etc.
  • I've tagged with obsolete dashboards that seem like they can be deleted (e.g. are graphite-based but we have their prometheus equivalent)
  • I've tagged with needs-review dashboards that need some action to decide whether to delete or keep (and eventually update or consolidate in some other dashboard)
  • restbase (and restbase staging) dashboards I believe can be deleted for the most part as we're now using their prometheus counterpart (cc @Eevans)
  • kubernetes staging dashboards can be folded into their k8s counterparts, since we can vary the datasource to get k8s staging data
  • I will remove the "Prometheus" prefix from "prometheus dc-overview / cluster-breakdown / global-overview" and update their navigation/drilldown accordingly
  • Some dashboard have a text panel at the top that explains the dashboard a little bit and contains some pointers e.g. to wikitech, this is very nice and we should extend it to all main/important dashboards.

For the featured dashboards list I would say to begin with "four golden signals" for the systems/services that get most user requests: gdns, nginx, varnish, lvs, apache, mediawiki, mysql, swift (likely more, off top of my head and looking at https://wikitech.wikimedia.org/wiki/Category:Wikimedia_infrastructure#/media/File:Infrastructure_overview.png)

elukey awarded a token.Jul 6 2018, 2:42 PM
Eevans added a comment.Jul 6 2018, 3:16 PM

[ ... ]

  • restbase (and restbase staging) dashboards I believe can be deleted for the most part as we're now using their prometheus counterpart (cc @Eevans)

{{done}}

Joe added a subscriber: Joe.Jul 9 2018, 4:29 PM

Change 444219 merged by Alexandros Kosiaris:
[operations/puppet@production] grafana: use host-overview in favour of server-board for featured dashboard

https://gerrit.wikimedia.org/r/444219

herron added a subscriber: herron.Aug 6 2018, 3:05 PM
fgiunchedi moved this task from In progress to Up next on the observability board.Aug 20 2018, 3:03 PM
fgiunchedi renamed this task from Better organization for ops grafana dashboards to Better organization for SRE grafana dashboards.Sep 26 2018, 8:14 AM
fgiunchedi updated the task description. (Show Details)
fgiunchedi moved this task from Up next to In progress on the observability board.Oct 2 2018, 8:36 AM
fgiunchedi moved this task from In progress to Up next on the observability board.Oct 15 2018, 3:06 PM

Somewhat related, Grafana upstream has this issue for feedback on dashboard provisioning workflows https://github.com/grafana/grafana/issues/13823

CDanis added a subscriber: CDanis.Nov 8 2018, 2:25 PM
CDanis moved this task from Backlog to Radar on the User-CDanis board.
fgiunchedi moved this task from Up next to In progress on the observability board.Nov 26 2018, 4:14 PM
CDanis moved this task from Radar to Doing on the User-CDanis board.Dec 17 2018, 3:00 PM
CDanis added a comment.EditedJan 2 2019, 6:56 PM

I've generated a list of Grafana dashboards sorted by their last modification time. Details below.

While this isn't a perfect proxy for "is in active use", I think it's probably a pretty good signal.

I suspect we should be trying to find owners for any very old dashboard, and just deleting them if we can't.

sqlite> select d.title, dashboard_id, max(v.created), printf("https://grafana.wikimedia.org/d/%09d/%s", d.id, d.slug) as url from dashboard_version as v left join dashboard as d on d.id = v.dashboard_id group by dashboard_id order by v.created limit 10;

List is here:
https://phabricator.wikimedia.org/P7951

1title dashboard_ max(v.created) url
2---------------------------------------- ---------- -------------------- --------------------------------------------------
3Activity 1 0001-01-01 00:00:00 https://grafana.wikimedia.org/d/000000001/activity
4echo-job-pickup 16 0001-01-01 00:00:00 https://grafana.wikimedia.org/d/000000016/echo-job
5RelEng :: Gerrit 63 0001-01-01 00:00:00 https://grafana.wikimedia.org/d/000000063/releng-g
6Releng :: Main page 64 0001-01-01 00:00:00 https://grafana.wikimedia.org/d/000000064/releng-m
7Authentications 131 0001-01-01 00:00:00 https://grafana.wikimedia.org/d/000000131/authenti
8Service :: Tilerator 90 2015-12-22 04:19:41 https://grafana.wikimedia.org/d/000000090/service-
9Scap 86 2016-01-12 18:52:57 https://grafana.wikimedia.org/d/000000086/scap
10Abuse 217 2016-02-16 21:22:00 https://grafana.wikimedia.org/d/000000217/abuse
11echoflyout 15 2016-02-18 11:57:41 https://grafana.wikimedia.org/d/000000015/echoflyo
12Client Connections 12 2016-07-11 15:04:19 https://grafana.wikimedia.org/d/000000012/client-c
13Service :: Kartotherian 30 2016-09-29 20:40:44 https://grafana.wikimedia.org/d/000000030/service-
14RelEng :: KPIs 108 2016-10-12 12:57:41 https://grafana.wikimedia.org/d/000000108/releng-k
15Service :: Graphoid 21 2016-10-20 07:15:33 https://grafana.wikimedia.org/d/000000021/service-
16Prometheus Stats 271 2016-11-01 23:57:38 https://grafana.wikimedia.org/d/000000271/promethe
17Varnish: HTTP Errors (datacenters) 166 2016-11-02 01:42:42 https://grafana.wikimedia.org/d/000000166/varnish-
18HTTP/2 319 2016-12-20 15:06:38 https://grafana.wikimedia.org/d/000000319/http-2
19Continuous Integration 284 2016-12-21 23:21:26 https://grafana.wikimedia.org/d/000000284/continuo
20Parsoid Heap Usage 281 2017-01-20 01:01:14 https://grafana.wikimedia.org/d/000000281/parsoid-
21Nodepool Migration 324 2017-02-16 15:39:26 https://grafana.wikimedia.org/d/000000324/nodepool
22Nodepool Tasks 323 2017-03-15 05:04:23 https://grafana.wikimedia.org/d/000000323/nodepool
23BetaFeatures 259 2017-03-16 17:53:27 https://grafana.wikimedia.org/d/000000259/betafeat
24MediaWiki Catwatch Feature 189 2017-03-16 18:01:39 https://grafana.wikimedia.org/d/000000189/mediawik
25MediaWiki WatchedItemStore 237 2017-03-16 18:34:41 https://grafana.wikimedia.org/d/000000237/mediawik
26Reading Web :: mobileview API 333 2017-03-22 08:13:35 https://grafana.wikimedia.org/d/000000333/reading-
27Zuul top jobs 348 2017-03-28 09:05:46 https://grafana.wikimedia.org/d/000000348/zuul-top
28OpenStreetMap 349 2017-04-03 08:32:56 https://grafana.wikimedia.org/d/000000349/openstre
29Extension Distributor Downloads 161 2017-05-10 01:29:45 https://grafana.wikimedia.org/d/000000161/extensio
30MediaWiki Cognate 355 2017-05-23 12:34:34 https://grafana.wikimedia.org/d/000000355/mediawik
31qdisc stats 361 2017-05-24 13:23:19 https://grafana.wikimedia.org/d/000000361/qdisc-st
32Network performances 365 2017-06-13 19:22:22 https://grafana.wikimedia.org/d/000000365/network-
33Nodepool Pool Details 345 2017-06-30 13:44:24 https://grafana.wikimedia.org/d/000000345/nodepool
34CAPTCHA failure rates 370 2017-07-03 10:43:26 https://grafana.wikimedia.org/d/000000370/captcha-
35Service :: Maps - Varnish 190 2017-07-04 02:29:32 https://grafana.wikimedia.org/d/000000190/service-
36OTRS 371 2017-07-11 13:05:52 https://grafana.wikimedia.org/d/000000371/otrs
37Varnish Transient Storage Usage 359 2017-07-27 14:44:52 https://grafana.wikimedia.org/d/000000359/varnish-
38Login timing 383 2017-08-23 00:19:38 https://grafana.wikimedia.org/d/000000383/login-ti
39Network probes 387 2017-08-24 14:26:15 https://grafana.wikimedia.org/d/000000387/network-
40MediaWiki MySQL LoadBalancer 363 2017-08-25 20:00:44 https://grafana.wikimedia.org/d/000000363/mediawik
41Zuul job 283 2017-09-11 21:16:39 https://grafana.wikimedia.org/d/000000283/zuul-job
42MediaWiki ElectronPdfService 309 2017-09-28 17:01:14 https://grafana.wikimedia.org/d/000000309/mediawik
43Zuul :: Gearman 322 2017-10-04 21:35:04 https://grafana.wikimedia.org/d/000000322/zuul-gea
44Varnish Transient Memory Breakdown 367 2017-10-06 06:21:42 https://grafana.wikimedia.org/d/000000367/varnish-
45Maps performances 305 2017-11-08 10:16:25 https://grafana.wikimedia.org/d/000000305/maps-per
46MySQL Replication Lag 303 2017-11-23 18:12:10 https://grafana.wikimedia.org/d/000000303/mysql-re
47Cloud codfw 449 2017-11-28 00:55:55 https://grafana.wikimedia.org/d/000000449/cloud-co
48Zuul 321 2017-12-07 21:57:31 https://grafana.wikimedia.org/d/000000321/zuul
49Etherpad 193 2017-12-13 14:35:49 https://grafana.wikimedia.org/d/000000193/etherpad
50AQS Wikistats 2 Traffic 456 2017-12-14 10:44:08 https://grafana.wikimedia.org/d/000000456/aqs-wiki
51Nutcracker 216 2017-12-20 11:36:19 https://grafana.wikimedia.org/d/000000216/nutcrack
52Kubernetes Kubelets 436 2017-12-20 15:38:33 https://grafana.wikimedia.org/d/000000436/kubernet
53Postgres 469 2017-12-21 11:09:06 https://grafana.wikimedia.org/d/000000469/postgres
54Kubernetes Staging Kubelets 472 2017-12-21 11:58:54 https://grafana.wikimedia.org/d/000000472/kubernet
55Kubernetes Staging API 471 2017-12-21 12:00:52 https://grafana.wikimedia.org/d/000000471/kubernet
56Kubernetes Staging Pods 473 2017-12-21 12:04:32 https://grafana.wikimedia.org/d/000000473/kubernet
57Kubernetes Pods 445 2017-12-21 12:06:53 https://grafana.wikimedia.org/d/000000445/kubernet
58CI Docker Jobs 420 2017-12-21 12:56:40 https://grafana.wikimedia.org/d/000000420/ci-docke
59Redis 174 2018-01-02 09:37:05 https://grafana.wikimedia.org/d/000000174/redis
60PyBal instances 426 2018-01-08 15:39:01 https://grafana.wikimedia.org/d/000000426/pybal-in
61PyBal service 422 2018-01-08 15:39:41 https://grafana.wikimedia.org/d/000000422/pybal-se
62IPVS Backend Connections 395 2018-01-08 15:45:43 https://grafana.wikimedia.org/d/000000395/ipvs-bac
63RCFilters performance 419 2018-01-09 00:25:05 https://grafana.wikimedia.org/d/000000419/rcfilter
64Kernel deployment 302 2018-01-09 15:20:46 https://grafana.wikimedia.org/d/000000302/kernel-d
65VisualEditor load / save 94 2018-02-02 23:11:08 https://grafana.wikimedia.org/d/000000094/visualed
66Parsoid Timing - html2wt 46 2018-02-17 15:56:12 https://grafana.wikimedia.org/d/000000046/parsoid-
67Parsoid Timing - wt2html 48 2018-02-17 15:57:23 https://grafana.wikimedia.org/d/000000048/parsoid-
68PyBal BGP 488 2018-03-13 16:10:31 https://grafana.wikimedia.org/d/000000488/pybal-bg
69EventLogging-schema Jumbo 494 2018-03-21 19:59:23 https://grafana.wikimedia.org/d/000000494/eventlog
70ResourceLoader: feature-test 238 2018-03-21 22:49:13 https://grafana.wikimedia.org/d/000000238/resource
71Cache Hosts Software Versions 474 2018-03-22 18:02:06 https://grafana.wikimedia.org/d/000000474/cache-ho
72TLS Ciphers by Data Center 452 2018-03-23 08:26:07 https://grafana.wikimedia.org/d/000000452/tls-ciph
73Puppetdb 477 2018-03-26 08:34:54 https://grafana.wikimedia.org/d/000000477/puppetdb
74Reading List Service 457 2018-03-29 12:50:53 https://grafana.wikimedia.org/d/000000457/reading-
75Mobile WebPageTest 130 2018-04-03 08:27:56 https://grafana.wikimedia.org/d/000000130/mobile-w
76Performance - Singapore caching center 502 2018-04-05 13:36:41 https://grafana.wikimedia.org/d/000000502/performa
77WDQS Paper data 514 2018-04-06 20:17:19 https://grafana.wikimedia.org/d/000000514/wdqs-pap
78Cassandra 418 2018-04-10 20:43:44 https://grafana.wikimedia.org/d/000000418/cassandr
79Varnish Mailbox Lag 478 2018-04-15 21:39:25 https://grafana.wikimedia.org/d/000000478/varnish-
80Zookeeper 261 2018-04-18 09:28:04 https://grafana.wikimedia.org/d/000000261/zookeepe
81Kubernetes 519 2018-04-18 12:54:55 https://grafana.wikimedia.org/d/000000519/kubernet
82Elasticsearch Per-node Percentiles 486 2018-04-20 16:44:09 https://grafana.wikimedia.org/d/000000486/elastics
83Load Balancers 343 2018-04-23 17:19:57 https://grafana.wikimedia.org/d/000000343/load-bal
84Elasticsearch Node Comparison - Promethe 460 2018-04-24 18:14:29 https://grafana.wikimedia.org/d/000000460/elastics
85Analytics NUMA 525 2018-04-25 09:18:37 https://grafana.wikimedia.org/d/000000525/analytic
86AQS 526 2018-04-25 10:49:04 https://grafana.wikimedia.org/d/000000526/aqs
87Elasticsearch 14 2018-04-30 18:02:01 https://grafana.wikimedia.org/d/000000014/elastics
88Caches NUMA stats 539 2018-05-02 14:37:16 https://grafana.wikimedia.org/d/000000539/caches-n
89Kafka By Topic 234 2018-05-02 16:20:30 https://grafana.wikimedia.org/d/000000234/kafka-by
90Cassandra Client Request 483 2018-05-03 09:12:12 https://grafana.wikimedia.org/d/000000483/cassandr
91Cassandra Read-repair 497 2018-05-03 09:12:35 https://grafana.wikimedia.org/d/000000497/cassandr
92Cassandra Tables 453 2018-05-03 09:14:45 https://grafana.wikimedia.org/d/000000453/cassandr
93Cassandra Threadpools 433 2018-05-03 09:15:20 https://grafana.wikimedia.org/d/000000433/cassandr
94Varnish Failed Fetches 352 2018-05-03 14:10:47 https://grafana.wikimedia.org/d/000000352/varnish-
95Varnish Daemons Hitrate 443 2018-05-07 10:21:15 https://grafana.wikimedia.org/d/000000443/varnish-
96Ganeti 545 2018-05-09 12:30:35 https://grafana.wikimedia.org/d/000000545/ganeti
97Kafka MirrorMaker 521 2018-05-14 17:42:45 https://grafana.wikimedia.org/d/000000521/kafka-mi
98Kafka (graphite) 523 2018-05-14 20:15:03 https://grafana.wikimedia.org/d/000000523/kafka-gr
99Kafka Consumer Lag 484 2018-05-17 06:31:11 https://grafana.wikimedia.org/d/000000484/kafka-co
100Prometheus Varnish HTTP Requests 501 2018-06-05 10:11:51 https://grafana.wikimedia.org/d/000000501/promethe
101TLS CipherSuite Explorer 458 2018-06-08 00:18:25 https://grafana.wikimedia.org/d/000000458/tls-ciph
102Cassandra System 417 2018-06-15 15:15:46 https://grafana.wikimedia.org/d/000000417/cassandr
103Service Endpoint performance 358 2018-06-21 10:53:13 https://grafana.wikimedia.org/d/000000358/service-
104Synthetic performance sdev/mdev 554 2018-06-27 08:49:02 https://grafana.wikimedia.org/d/000000554/syntheti
105Kubernetes API 435 2018-06-28 15:25:37 https://grafana.wikimedia.org/d/000000435/kubernet
106VarnishKafka 253 2018-06-29 09:07:38 https://grafana.wikimedia.org/d/000000253/varnishk
107Elasticsearch Percentiles - Prometheus 455 2018-07-05 12:41:26 https://grafana.wikimedia.org/d/000000455/elastics
108Elasticsearch Percentiles Beta 250 2018-07-05 12:41:46 https://grafana.wikimedia.org/d/000000250/elastics
109Hook calls 24 2018-07-05 12:55:01 https://grafana.wikimedia.org/d/000000024/hook-cal
110Interactive team KPI 285 2018-07-05 12:58:25 https://grafana.wikimedia.org/d/000000285/interact
111Interactive team KPI (backup) 300 2018-07-05 13:02:28 https://grafana.wikimedia.org/d/000000300/interact
112LoginNotify 385 2018-07-05 14:01:58 https://grafana.wikimedia.org/d/000000385/loginnot
113Maps Dashboard - draft 314 2018-07-05 14:03:50 https://grafana.wikimedia.org/d/000000314/maps-das
114Maps KPI 310 2018-07-05 14:04:12 https://grafana.wikimedia.org/d/000000310/maps-kpi
115Maps :: Cassandra 233 2018-07-05 14:05:05 https://grafana.wikimedia.org/d/000000233/maps-cas
116MediaWiki BounceHandler 142 2018-07-05 15:34:57 https://grafana.wikimedia.org/d/000000142/mediawik
117Mediawiki AdvancedSearch 434 2018-07-06 08:46:55 https://grafana.wikimedia.org/d/000000434/mediawik
118Node Exporter Server Metrics 342 2018-07-06 09:05:03 https://grafana.wikimedia.org/d/000000342/node-exp
119PAWS 235 2018-07-06 09:07:14 https://grafana.wikimedia.org/d/000000235/paws
120Cluster hardware specs differences 332 2018-07-06 09:25:49 https://grafana.wikimedia.org/d/000000332/cluster-
121Prometheus Varnish: HTTP Errors (datacen 508 2018-07-06 09:28:51 https://grafana.wikimedia.org/d/000000508/promethe
122Apache/HHVM 327 2018-07-06 09:33:16 https://grafana.wikimedia.org/d/000000327/apache-h
123RecDNS 375 2018-07-06 10:01:05 https://grafana.wikimedia.org/d/000000375/recdns
124Site power usage 397 2018-07-06 10:08:01 https://grafana.wikimedia.org/d/000000397/site-pow
125Trending Service 315 2018-07-06 10:12:06 https://grafana.wikimedia.org/d/000000315/trending
126machine disk I/O 236 2018-07-06 10:37:26 https://grafana.wikimedia.org/d/000000236/machine-
127Microcode Updates 556 2018-07-09 09:01:47 https://grafana.wikimedia.org/d/000000556/microcod
128Job Queue Health 107 2018-07-09 21:41:31 https://grafana.wikimedia.org/d/000000107/job-queu
129Job Queue Rate 105 2018-07-09 21:42:32 https://grafana.wikimedia.org/d/000000105/job-queu
130prometheus-varnish-http-Errors 557 2018-07-13 04:37:35 https://grafana.wikimedia.org/d/000000557/promethe
131Reading Web :: Page Previews 340 2018-07-16 14:49:25 https://grafana.wikimedia.org/d/000000340/reading-
132EventStreams 336 2018-07-25 07:07:44 https://grafana.wikimedia.org/d/000000336/eventstr
133API frontend summary 202 2018-07-27 01:26:49 https://grafana.wikimedia.org/d/000000202/api-fron
134Service :: Mathoid 187 2018-07-31 12:48:50 https://grafana.wikimedia.org/d/000000187/service-
135Save Timing Alerts 362 2018-08-05 20:33:46 https://grafana.wikimedia.org/d/000000362/save-tim
136API backend summary 2 2018-08-06 06:08:46 https://grafana.wikimedia.org/d/000000002/api-back
137Varnish machine stats 330 2018-08-14 14:34:09 https://grafana.wikimedia.org/d/000000330/varnish-
138Nodepool 276 2018-08-15 03:58:13 https://grafana.wikimedia.org/d/000000276/nodepool
139HHVM APC Usage breakdown 499 2018-08-20 22:42:58 https://grafana.wikimedia.org/d/000000499/hhvm-apc
140HHVM APC Usage 496 2018-08-20 22:43:30 https://grafana.wikimedia.org/d/000000496/hhvm-apc
141Varnish Caching 500 2018-08-22 01:11:29 https://grafana.wikimedia.org/d/000000500/varnish-
142JobQueue EventBus 400 2018-08-22 17:33:53 https://grafana.wikimedia.org/d/000000400/jobqueue
143Elasticsearch Indexing - prometheus 461 2018-08-22 17:41:08 https://grafana.wikimedia.org/d/000000461/elastics
144Mobile Dashboard 35 2018-08-23 18:59:39 https://grafana.wikimedia.org/d/000000035/mobile-d
145ResourceLoader Modules 430 2018-08-24 00:29:47 https://grafana.wikimedia.org/d/000000430/resource
146ResourceLoaderModule 67 2018-08-24 00:31:47 https://grafana.wikimedia.org/d/000000067/resource
147Edit Count 208 2018-09-05 15:52:19 https://grafana.wikimedia.org/d/000000208/edit-cou
148Thumbor 291 2018-09-06 13:26:47 https://grafana.wikimedia.org/d/000000291/thumbor
149Varnish Backend Connections 439 2018-09-11 14:24:17 https://grafana.wikimedia.org/d/000000439/varnish-
150Production Logging 102 2018-09-12 15:00:38 https://grafana.wikimedia.org/d/000000102/producti
151ORES extension 263 2018-09-19 15:58:27 https://grafana.wikimedia.org/d/000000263/ores-ext
152MediaWiki Static 212 2018-09-19 23:45:09 https://grafana.wikimedia.org/d/000000212/mediawik
153Echo Mention Errors 254 2018-09-25 09:45:46 https://grafana.wikimedia.org/d/000000254/echo-men
154Echo Mention Status Notifications 270 2018-09-25 09:46:12 https://grafana.wikimedia.org/d/000000270/echo-men
155MediaWiki Edit Conflicts 213 2018-09-25 09:47:05 https://grafana.wikimedia.org/d/000000213/mediawik
156Mediawiki TwoColConflict 346 2018-09-25 09:52:13 https://grafana.wikimedia.org/d/000000346/mediawik
157Mediawiki RevisionSlider 260 2018-09-25 09:53:46 https://grafana.wikimedia.org/d/000000260/mediawik
158RESTBase external overview 577 2018-09-25 22:33:28 https://grafana.wikimedia.org/d/000000577/restbase
159MediaWiki FileImporter 553 2018-09-26 22:24:30 https://grafana.wikimedia.org/d/000000553/mediawik
160Navigation Timing by Browser 218 2018-09-28 08:58:30 https://grafana.wikimedia.org/d/000000218/navigati
161Piwik 354 2018-10-02 07:54:14 https://grafana.wikimedia.org/d/000000354/piwik
162MySQL 273 2018-10-02 09:23:54 https://grafana.wikimedia.org/d/000000273/mysql
163Parsoid http status codes 42 2018-10-03 16:27:11 https://grafana.wikimedia.org/d/000000042/parsoid-
164MySQL Aggregated 278 2018-10-03 19:12:17 https://grafana.wikimedia.org/d/000000278/mysql-ag
165WebPageTest Portals 146 2018-10-03 19:13:44 https://grafana.wikimedia.org/d/000000146/webpaget
166Mobile 2G 205 2018-10-03 19:14:34 https://grafana.wikimedia.org/d/000000205/mobile-2
167HTTPS 25 2018-10-03 19:14:58 https://grafana.wikimedia.org/d/000000025/https
168Media 34 2018-10-09 09:44:06 https://grafana.wikimedia.org/d/000000034/media
169PyBal 421 2018-10-09 15:21:09 https://grafana.wikimedia.org/d/000000421/pybal
170Varnish: HTTP Errors 503 2018-10-10 21:01:02 https://grafana.wikimedia.org/d/000000503/varnish-
171Host overview grafanalib 555 2018-10-13 16:48:20 https://grafana.wikimedia.org/d/000000555/host-ove
172MediaWiki AbuseFilter Profiling 393 2018-10-14 01:53:20 https://grafana.wikimedia.org/d/000000393/mediawik
173Memcache-historic-data 586 2018-10-18 06:54:03 https://grafana.wikimedia.org/d/000000586/memcache
174Proton 563 2018-10-18 18:31:23 https://grafana.wikimedia.org/d/000000563/proton
175Article Placeholder 244 2018-10-18 18:37:45 https://grafana.wikimedia.org/d/000000244/article-
176Analytics Hadoop 258 2018-10-19 06:54:36 https://grafana.wikimedia.org/d/000000258/analytic
177EventLogging 505 2018-10-23 16:59:43 https://grafana.wikimedia.org/d/000000505/eventlog
178Kafka 27 2018-10-24 15:56:40 https://grafana.wikimedia.org/d/000000027/kafka
179Hive 379 2018-10-25 14:04:51 https://grafana.wikimedia.org/d/000000379/hive
180Elasticsearch - Mjolnir Bulk Updates 591 2018-10-30 21:01:28 https://grafana.wikimedia.org/d/000000591/elastics
181EventLogging-schema 18 2018-10-31 17:00:42 https://grafana.wikimedia.org/d/000000018/eventlog
182MediaWiki Graphite Alerts 438 2018-11-02 01:07:33 https://grafana.wikimedia.org/d/000000438/mediawik
183Varnish Traffic - Instance Breakdown 450 2018-11-05 14:54:07 https://grafana.wikimedia.org/d/000000450/varnish-
184NTP time servers 228 2018-11-06 18:58:41 https://grafana.wikimedia.org/d/000000228/ntp-time
185parsoid servers cpu usage 44 2018-11-06 19:17:44 https://grafana.wikimedia.org/d/000000044/parsoid-
186Zuul :: Pipeline 594 2018-11-12 11:03:53 https://grafana.wikimedia.org/d/000000594/zuul-pip
187Swift 4GS 584 2018-11-16 10:14:42 https://grafana.wikimedia.org/d/000000584/swift-4g
188Parsoid: perf trends 135 2018-11-16 22:04:14 https://grafana.wikimedia.org/d/000000135/parsoid-
189parsoid times vs doc size 45 2018-11-16 22:06:29 https://grafana.wikimedia.org/d/000000045/parsoid-
190TCP Fast Open 257 2018-11-19 21:35:05 https://grafana.wikimedia.org/d/000000257/tcp-fast
191Elasticsearch Memory - prometheus 462 2018-11-20 13:07:10 https://grafana.wikimedia.org/d/000000462/elastics
192Graphite (eqiad) 20 2018-11-20 14:27:22 https://grafana.wikimedia.org/d/000000020/graphite
193Graphite (codfw) 337 2018-11-20 17:27:38 https://grafana.wikimedia.org/d/000000337/graphite
194Rsyslog 596 2018-11-21 16:33:14 https://grafana.wikimedia.org/d/000000596/rsyslog
195API requests Breakdown 559 2018-11-21 18:21:13 https://grafana.wikimedia.org/d/000000559/api-requ
196Datacenter global overview 605 2018-11-26 14:50:04 https://grafana.wikimedia.org/d/000000605/datacent
197EventBus 201 2018-11-27 15:13:12 https://grafana.wikimedia.org/d/000000201/eventbus
198RESTBase 68 2018-11-27 15:29:41 https://grafana.wikimedia.org/d/000000068/restbase
199mw-js-deprecate 37 2018-11-30 19:17:13 https://grafana.wikimedia.org/d/000000037/mw-js-de
200Elasticsearch - Mjolnir msearch 616 2018-11-30 19:44:25 https://grafana.wikimedia.org/d/000000616/elastics
201Druid 538 2018-12-03 08:23:04 https://grafana.wikimedia.org/d/000000538/druid
202MediaWiki Application servers 550 2018-12-04 16:06:11 https://grafana.wikimedia.org/d/000000550/mediawik
203Navigation Timing by Country 232 2018-12-05 14:19:02 https://grafana.wikimedia.org/d/000000232/navigati
204Navigation Timing by Continent 230 2018-12-05 14:19:05 https://grafana.wikimedia.org/d/000000230/navigati
205Hadoop 585 2018-12-05 15:59:07 https://grafana.wikimedia.org/d/000000585/hadoop
206Service :: Citoid 11 2018-12-05 19:17:12 https://grafana.wikimedia.org/d/000000011/service-
207Varnish HTTP Requests 180 2018-12-06 14:49:00 https://grafana.wikimedia.org/d/000000180/varnish-
208Varnish traffic 93 2018-12-06 20:06:56 https://grafana.wikimedia.org/d/000000093/varnish-
209Arc Lamp 578 2018-12-07 21:27:43 https://grafana.wikimedia.org/d/000000578/arc-lamp
210Traffic 621 2018-12-10 21:06:15 https://grafana.wikimedia.org/d/000000621/traffic
211ATS Cache Operations 569 2018-12-10 21:08:03 https://grafana.wikimedia.org/d/000000569/ats-cach
212Prometheus Varnish DC stats 304 2018-12-10 21:08:05 https://grafana.wikimedia.org/d/000000304/promethe
213Prometheus Varnish Aggregate Client Stat 464 2018-12-10 21:08:05 https://grafana.wikimedia.org/d/000000464/promethe
214Varnish Caching Last Week Comparison 541 2018-12-10 21:08:05 https://grafana.wikimedia.org/d/000000541/varnish-
215Performance Metrics 50 2018-12-10 23:39:54 https://grafana.wikimedia.org/d/000000050/performa
216ResourceLoader 66 2018-12-11 00:00:49 https://grafana.wikimedia.org/d/000000066/resource
217Navigation Timing by Platform 38 2018-12-11 01:41:19 https://grafana.wikimedia.org/d/000000038/navigati
218ResourceLoader Alerts 402 2018-12-11 01:48:44 https://grafana.wikimedia.org/d/000000402/resource
219Mcrouter 549 2018-12-11 08:35:36 https://grafana.wikimedia.org/d/000000549/mcrouter
220Swift 622 2018-12-11 14:47:39 https://grafana.wikimedia.org/d/000000622/swift
221Varnish: Aggregate Client Status Codes 623 2018-12-11 14:47:40 https://grafana.wikimedia.org/d/000000623/varnish-
222ATS Instance Drilldown 610 2018-12-11 16:51:08 https://grafana.wikimedia.org/d/000000610/ats-inst
223WebPageReplay 431 2018-12-12 06:06:19 https://grafana.wikimedia.org/d/000000431/webpager
224WebPageReplay drilldown 572 2018-12-12 06:13:23 https://grafana.wikimedia.org/d/000000572/webpager
225WebPageReplay Desktop Alerts 491 2018-12-12 06:19:10 https://grafana.wikimedia.org/d/000000491/webpager
226WebPageReplay Mobile Alerts 490 2018-12-12 06:23:12 https://grafana.wikimedia.org/d/000000490/webpager
227WebPageTest 210 2018-12-12 06:26:44 https://grafana.wikimedia.org/d/000000210/webpaget
228WebPageTest drilldown 95 2018-12-12 06:28:44 https://grafana.wikimedia.org/d/000000095/webpaget
229WebPageTest alerts 318 2018-12-12 06:30:23 https://grafana.wikimedia.org/d/000000318/webpaget
230mobileapps 183 2018-12-12 19:08:07 https://grafana.wikimedia.org/d/000000183/mobileap
231Navigation Timing alerts 326 2018-12-12 23:02:58 https://grafana.wikimedia.org/d/000000326/navigati
232xxxx Zotero debugging kubernetes 620 2018-12-13 12:01:32 https://grafana.wikimedia.org/d/000000620/xxxx-zot
233ORES 255 2018-12-13 14:41:24 https://grafana.wikimedia.org/d/000000255/ores
234Service :: CXServer 593 2018-12-14 06:10:27 https://grafana.wikimedia.org/d/000000593/service-
235Navigation Timing 143 2018-12-14 23:52:15 https://grafana.wikimedia.org/d/000000143/navigati
236Joal Kafka Test 26 2018-12-17 13:47:14 https://grafana.wikimedia.org/d/000000026/joal-kaf
237Experimental - backend 5xx 219 2018-12-17 13:47:14 https://grafana.wikimedia.org/d/000000219/experime
238EventLogging-schema - to be deleted 506 2018-12-17 13:47:14 https://grafana.wikimedia.org/d/000000506/eventlog
239Kafka MirrorMaker old to delete 520 2018-12-17 13:47:15 https://grafana.wikimedia.org/d/000000520/kafka-mi
240Kafka By Topic (graphite) 524 2018-12-17 13:47:15 https://grafana.wikimedia.org/d/000000524/kafka-by
241Labs Project Board 112 2018-12-17 13:47:16 https://grafana.wikimedia.org/d/000000112/labs-pro
242Dashboards to be deleted (T178690) 627 2018-12-17 13:47:44 https://grafana.wikimedia.org/d/000000627/dashboar
243User dashboards 628 2018-12-17 13:54:31 https://grafana.wikimedia.org/d/000000628/user-das
244bd808-test 5 2018-12-17 13:54:32 https://grafana.wikimedia.org/d/000000005/bd808-te
245dashboard-redesign-proposal-4gs 552 2018-12-17 13:54:33 https://grafana.wikimedia.org/d/000000552/dashboar
246imarlier db debug 487 2018-12-17 13:54:34 https://grafana.wikimedia.org/d/000000487/imarlier
247jgreen frdev1001 414 2018-12-17 13:54:36 https://grafana.wikimedia.org/d/000000414/jgreen-f
248Julien Maps Dashboard 311 2018-12-17 13:54:37 https://grafana.wikimedia.org/d/000000311/julien-m
249Joal NUMA 527 2018-12-17 13:54:37 https://grafana.wikimedia.org/d/000000527/joal-num
250Krinkle Dashboard 31 2018-12-17 13:54:38 https://grafana.wikimedia.org/d/000000031/krinkle-
251JVM overview - work in progress - gehel 537 2018-12-17 13:54:38 https://grafana.wikimedia.org/d/000000537/jvm-over
252Ladsgroup-test 378 2018-12-17 13:54:39 https://grafana.wikimedia.org/d/000000378/ladsgrou
253Krinkle Sanbox 558 2018-12-17 13:54:39 https://grafana.wikimedia.org/d/000000558/krinkle-
254Logstash (herron WIP) 564 2018-12-17 13:54:40 https://grafana.wikimedia.org/d/000000564/logstash
255Lucas Sandbox 592 2018-12-17 13:54:40 https://grafana.wikimedia.org/d/000000592/lucas-sa
256memcache-elukey 614 2018-12-17 13:54:40 https://grafana.wikimedia.org/d/000000614/memcache
257Niedzielski Sandbox - Mobile 2G 518 2018-12-17 13:54:41 https://grafana.wikimedia.org/d/000000518/niedziel
258xxxx cdanis Host overview Copy 600 2018-12-17 13:54:41 https://grafana.wikimedia.org/d/000000600/xxxx-cda
259xxxx cdanis test 595 2018-12-17 13:54:42 https://grafana.wikimedia.org/d/000000595/xxxx-cda
260xxxx cdanis thermal health 597 2018-12-17 13:54:42 https://grafana.wikimedia.org/d/000000597/xxxx-cda
261xxxx cwhite temp 603 2018-12-17 13:54:42 https://grafana.wikimedia.org/d/000000603/xxxx-cwh
262xxxxx Kubernetes Pods (Fsero) 619 2018-12-17 13:54:43 https://grafana.wikimedia.org/d/000000619/xxxxx-ku
263Content translation 598 2018-12-18 04:01:13 https://grafana.wikimedia.org/d/000000598/content-
264Save Timing 85 2018-12-18 20:40:14 https://grafana.wikimedia.org/d/000000085/save-tim
265Edit Stash 249 2018-12-19 01:19:57 https://grafana.wikimedia.org/d/000000249/edit-sta
266Team TCB 288 2018-12-19 16:40:52 https://grafana.wikimedia.org/d/000000288/team-tcb
267Scribunto 139 2018-12-19 16:41:42 https://grafana.wikimedia.org/d/000000139/scribunt
268DNS 341 2018-12-20 11:43:18 https://grafana.wikimedia.org/d/000000341/dns
269DNS recursors 399 2018-12-20 11:43:35 https://grafana.wikimedia.org/d/000000399/dns-recu
270Frontend Traffic 479 2018-12-20 11:44:31 https://grafana.wikimedia.org/d/000000479/frontend
271Host overview 377 2018-12-20 11:46:52 https://grafana.wikimedia.org/d/000000377/host-ove
272Filippo home test 630 2018-12-20 11:47:04 https://grafana.wikimedia.org/d/000000630/filippo-
273Cluster overview 607 2018-12-20 15:20:52 https://grafana.wikimedia.org/d/000000607/cluster-
274Drafts 631 2018-12-20 15:41:41 https://grafana.wikimedia.org/d/000000631/drafts
275Network Performances Global 366 2018-12-20 15:42:10 https://grafana.wikimedia.org/d/000000366/network-
276Apache Backend-Timing 580 2018-12-20 15:42:10 https://grafana.wikimedia.org/d/000000580/apache-b
277Parser Cache 106 2018-12-20 15:42:11 https://grafana.wikimedia.org/d/000000106/parser-c
278Ping offload 513 2018-12-20 15:42:11 https://grafana.wikimedia.org/d/000000513/ping-off
279Fundraising 632 2018-12-20 15:45:12 https://grafana.wikimedia.org/d/000000632/fundrais
280fundraising database 403 2018-12-20 15:45:39 https://grafana.wikimedia.org/d/000000403/fundrais
281fundraising overview 408 2018-12-20 15:45:39 https://grafana.wikimedia.org/d/000000408/fundrais
282fundraising database (all) 412 2018-12-20 15:45:39 https://grafana.wikimedia.org/d/000000412/fundrais
283fundraising mariadb 424 2018-12-20 15:45:39 https://grafana.wikimedia.org/d/000000424/fundrais
284fundraising host overview 425 2018-12-20 15:45:39 https://grafana.wikimedia.org/d/000000425/fundrais
285fundraising redis 401 2018-12-20 15:45:40 https://grafana.wikimedia.org/d/000000401/fundrais
286WMCS 633 2018-12-20 15:47:41 https://grafana.wikimedia.org/d/000000633/wmcs
287CloudVPS eqiad1 571 2018-12-20 15:47:58 https://grafana.wikimedia.org/d/000000571/cloudvps
288cloud-capacity-planning 576 2018-12-20 15:47:58 https://grafana.wikimedia.org/d/000000576/cloud-ca
289cloudvps-rabbitmq 617 2018-12-20 15:47:58 https://grafana.wikimedia.org/d/000000617/cloudvps
290Labs Monitoring 32 2018-12-20 15:47:59 https://grafana.wikimedia.org/d/000000032/labs-mon
291Labs DNS Dashboard 240 2018-12-20 15:47:59 https://grafana.wikimedia.org/d/000000240/labs-dns
292Labs Nova Fullstack 339 2018-12-20 15:48:00 https://grafana.wikimedia.org/d/000000339/labs-nov
293labs-capacity-planning 225 2018-12-20 15:48:01 https://grafana.wikimedia.org/d/000000225/labs-cap
294Labstore - NFS Directory Sizes 338 2018-12-20 15:48:01 https://grafana.wikimedia.org/d/000000338/labstore
295labvirt node disk stats 33 2018-12-20 15:48:02 https://grafana.wikimedia.org/d/000000033/labvirt-
296OpenLDAP Labs 181 2018-12-20 15:48:02 https://grafana.wikimedia.org/d/000000181/openldap
297labstore1004/1005 568 2018-12-20 15:48:02 https://grafana.wikimedia.org/d/000000568/labstore
298WMCS API uptimes 405 2018-12-20 15:48:03 https://grafana.wikimedia.org/d/000000405/wmcs-api
299WMCS OpenStack eqiad1 579 2018-12-20 15:48:03 https://grafana.wikimedia.org/d/000000579/wmcs-ope
300WMCS openstack eqiad1 hypervisor 624 2018-12-20 15:48:03 https://grafana.wikimedia.org/d/000000624/wmcs-ope
301BlockNotices Alex 629 2018-12-20 15:48:56 https://grafana.wikimedia.org/d/000000629/blocknot
302frdb2001 394 2018-12-20 16:25:24 https://grafana.wikimedia.org/d/000000394/frdb2001
303frtechmail dashboard 567 2018-12-20 16:25:24 https://grafana.wikimedia.org/d/000000567/frtechma
304Jobqueues-elukey 360 2018-12-20 16:25:40 https://grafana.wikimedia.org/d/000000360/jobqueue
305Pageviews API 196 2018-12-20 16:28:44 https://grafana.wikimedia.org/d/000000196/pageview
306PHP metrics 609 2018-12-20 16:29:36 https://grafana.wikimedia.org/d/000000609/php-metr
307Datacenter overview 608 2018-12-20 16:33:06 https://grafana.wikimedia.org/d/000000608/datacent
308Prometheus machine stats 274 2018-12-20 16:33:41 https://grafana.wikimedia.org/d/000000274/promethe
309Frontend Responses NGINX vs Varnish 612 2018-12-20 16:35:14 https://grafana.wikimedia.org/d/000000612/frontend
310Logstash 561 2018-12-20 16:38:02 https://grafana.wikimedia.org/d/000000561/logstash
311Mail 451 2018-12-20 16:38:25 https://grafana.wikimedia.org/d/000000451/mail
312MySQL core 272 2018-12-20 16:40:44 https://grafana.wikimedia.org/d/000000272/mysql-co
313Network Errors by Cluster 562 2018-12-20 16:45:43 https://grafana.wikimedia.org/d/000000562/network-
314Wikidata dashboards 634 2018-12-20 16:48:39 https://grafana.wikimedia.org/d/000000634/wikidata
315Wikidata 154 2018-12-20 16:48:42 https://grafana.wikimedia.org/d/000000154/wikidata
316Wikidata API 169 2018-12-20 16:48:43 https://grafana.wikimedia.org/d/000000169/wikidata
317Wikidata change propagation 485 2018-12-20 16:48:43 https://grafana.wikimedia.org/d/000000485/wikidata
318Wikidata Addshore Monitoring 601 2018-12-20 16:48:43 https://grafana.wikimedia.org/d/000000601/wikidata
319Wikidata Datamodel 167 2018-12-20 16:48:44 https://grafana.wikimedia.org/d/000000167/wikidata
320Wikidata Datamodel References 182 2018-12-20 16:48:44 https://grafana.wikimedia.org/d/000000182/wikidata
321Wikidata Co-Editors 560 2018-12-20 16:48:44 https://grafana.wikimedia.org/d/000000560/wikidata
322Wikidata Datamodel Terms 168 2018-12-20 16:48:45 https://grafana.wikimedia.org/d/000000168/wikidata
323Wikidata Datamodel Statements 175 2018-12-20 16:48:45 https://grafana.wikimedia.org/d/000000175/wikidata
324Wikidata Dispatch 156 2018-12-20 16:48:46 https://grafana.wikimedia.org/d/000000156/wikidata
325Wikidata Dispatch Script 239 2018-12-20 16:48:46 https://grafana.wikimedia.org/d/000000239/wikidata
326Wikidata Dump Downloads 264 2018-12-20 16:48:46 https://grafana.wikimedia.org/d/000000264/wikidata
327Wikidata Edits 170 2018-12-20 16:48:47 https://grafana.wikimedia.org/d/000000170/wikidata
328Wikidata EditEntity 615 2018-12-20 16:48:47 https://grafana.wikimedia.org/d/000000615/wikidata
329Wikidata Page views (per domain) 158 2018-12-20 16:48:48 https://grafana.wikimedia.org/d/000000158/wikidata
330Wikidata Entity Usage 160 2018-12-20 16:48:48 https://grafana.wikimedia.org/d/000000160/wikidata
331Wikidata Entity Usage Project 176 2018-12-20 16:48:48 https://grafana.wikimedia.org/d/000000176/wikidata
332Wikidata Quality 344 2018-12-20 16:48:49 https://grafana.wikimedia.org/d/000000344/wikidata
333Wikidata Query Service 489 2018-12-20 16:48:49 https://grafana.wikimedia.org/d/000000489/wikidata
334Wikidata Site Stats 162 2018-12-20 16:48:50 https://grafana.wikimedia.org/d/000000162/wikidata
335Wikidata Query Service UI 290 2018-12-20 16:48:50 https://grafana.wikimedia.org/d/000000290/wikidata
336Wikidata Query Service Frontend 522 2018-12-20 16:48:50 https://grafana.wikimedia.org/d/000000522/wikidata
337Wikidata Social Followers 159 2018-12-20 16:48:51 https://grafana.wikimedia.org/d/000000159/wikidata
338Wikidata Tasks 172 2018-12-20 16:48:51 https://grafana.wikimedia.org/d/000000172/wikidata
339Wikidata Special:EntityData 188 2018-12-20 16:48:51 https://grafana.wikimedia.org/d/000000188/wikidata
340Wikidata top page views 163 2018-12-20 16:48:52 https://grafana.wikimedia.org/d/000000163/wikidata
341Wikidata WebPageTest 209 2018-12-20 16:48:52 https://grafana.wikimedia.org/d/000000209/wikidata
342WikidataClient change handling (WikiPage 227 2018-12-20 16:48:53 https://grafana.wikimedia.org/d/000000227/wikidata
343T204083 investigation 574 2018-12-20 16:52:59 https://grafana.wikimedia.org/d/000000574/t204083-
344Switchover/Switchback 581 2018-12-20 16:57:54 https://grafana.wikimedia.org/d/000000581/switchov
345Wikibase API error rate 226 2018-12-20 17:19:58 https://grafana.wikimedia.org/d/000000226/wikibase
346Wikibase API wbgetentities 265 2018-12-20 17:19:58 https://grafana.wikimedia.org/d/000000265/wikibase
347Wikibase docker images 516 2018-12-20 17:19:58 https://grafana.wikimedia.org/d/000000516/wikibase
348Wikibase wb_terms 548 2018-12-20 17:19:59 https://grafana.wikimedia.org/d/000000548/wikibase
349Wikibase FormatterCache 626 2018-12-20 17:19:59 https://grafana.wikimedia.org/d/000000626/wikibase
350Wikibase wb_terms newItemIdFormatter 599 2018-12-20 17:20:00 https://grafana.wikimedia.org/d/000000599/wikibase
351Ciphers 10 2018-12-20 17:20:26 https://grafana.wikimedia.org/d/000000010/ciphers
352WMCS - Node Exporter Full 590 2018-12-20 17:20:28 https://grafana.wikimedia.org/d/000000590/wmcs-nod
353Backend Save Timing Breakdown 429 2018-12-20 23:13:32 https://grafana.wikimedia.org/d/000000429/backend-
354Phabricator 587 2018-12-20 23:20:50 https://grafana.wikimedia.org/d/000000587/phabrica
355BlockNotices 618 2018-12-21 05:15:50 https://grafana.wikimedia.org/d/000000618/blocknot
356Performance perception survey 551 2018-12-21 10:08:37 https://grafana.wikimedia.org/d/000000551/performa
357Reading Web Dashboard 566 2018-12-21 11:36:07 https://grafana.wikimedia.org/d/000000566/reading-
358Dashboard redesign proposal 536 2018-12-21 14:20:53 https://grafana.wikimedia.org/d/000000536/dashboar
359Authentication metrics 4 2018-12-28 14:45:25 https://grafana.wikimedia.org/d/000000004/authenti
360Memcache 316 2018-12-30 15:24:22 https://grafana.wikimedia.org/d/000000316/memcache
361Elasticsearch Indexing - prometheus (Add 635 2019-01-02 07:40:20 https://grafana.wikimedia.org/d/000000635/elastics
362Memcache-Slabs 636 2019-01-02 14:15:56 https://grafana.wikimedia.org/d/000000636/memcache

I've just seen a dashboard I use is scheduled for deletion. I don't see the replacement as particularly better and lacking. Could you have a look at how other people are doing those such as https://pmmdemo.percona.com/graph/d/qyzrQGHmk/system-overview They can be downloaded at https://github.com/percona/grafana-dashboards

CDanis added a comment.EditedJan 14 2019, 12:56 PM

Jaime, going to have to guess here; are you referring to "Prometheus machine stats" (marked for deletion) vs "Host overview"?

Jaime, going to have to guess here; are you referring to "Prometheus machine stats" (marked for deletion) vs "Host overview"?

Yes.

greg removed a subscriber: greg.Jan 14 2019, 6:28 PM

Jaime, going to have to guess here; are you referring to "Prometheus machine stats" (marked for deletion) vs "Host overview"?

Yes.

I 've kind of just met the same problem, and had the same immediate reaction as @jcrespo. But the more I am looking at "Host overview" the more I like it. @jcrespo, I think it 's import ant to spell out exactly what you don't like or find lacking in the new dashboard. It does use the "USE" [1] method so it takes a bit to wrap you mind around the methodology, which might or might not be suited for your use case. But in that case we should know.

AFAICT, my gripe is the misc section where it's not at all clear to me what misc: errors is about. I also find that we might need to add some icmp/tcp/udp graphs, perhaps in a by default collapsed row?

[1] http://www.brendangregg.com/usemethod.html

akosiaris triaged this task as Low priority.Jan 25 2019, 2:54 PM
CDanis added a comment.EditedJan 25 2019, 2:57 PM

Ah, forgot to update the task, but at the time @jcrespo and @fgiunchedi and I talked, and Jaime's biggest gripe was that iostat-reported "disk IO utilization" is not a very useful metric: it's the fraction of time that at least one oustanding iop was in the disk's queue. On a server that has any load at all, this metric will generally be "100%" all the time; what you actually care about are stats like "queue depth" and "request latencies".

https://www.percona.com/blog/2017/08/28/looking-disk-utilization-and-saturation/ had some good thoughts on the issue as well.

@akosiaris we had some chat about details, I don't mind the USE pattern, but a poor graph using USE doesn't mean it is good, if the chosen metrics are poor, like the above example. Note also they were probably in a worse state before my comments 0:-)

@akosiaris we had some chat about details, I don't mind the USE pattern, but a poor graph using USE doesn't mean it is good, if the chosen metrics are poor, like the above example. Note also they were probably in a worse state before my comments 0:-)

Sure. Seems like I was just missing context. Thanks for the update

@fgiunchedi I had a stab at a RED pattern dashboard for mathoid. Let me know what you think.

https://grafana.wikimedia.org/d/000000187/service-mathoid?refresh=1m&orgId=1

I still have on the TODO list to add p50, p90, p99 for latency that is currently missing

fgiunchedi moved this task from In progress to Up next on the observability board.Mar 18 2019, 1:58 PM
fgiunchedi moved this task from Doing to Up next on the User-fgiunchedi board.May 13 2019, 8:57 AM
Tgr added a subscriber: Tgr.May 15 2019, 11:08 AM

It would be nice to make it a little clearer what the intended replacement is (either put it in the task description or the description of the dashboard to be deleted) so one does not have to read through the conversation in this task to know what to update links to.

fgiunchedi moved this task from Up next to Backlog on the User-fgiunchedi board.Wed, Oct 9, 11:31 PM