Better organization for SRE grafana dashboards
Open, LowPublic
Actions

Assigned To

None

Authored By

	fgiunchedi
	Oct 20 2017, 2:28 PM

Description

The SRE grafana dashboards are not consistent with each other, have accumulated cruft over time, and (among other deficiencies) lack a good way to navigate between them.

There have been several ideas on how to improve the situation, this task will be used to collect those ideas and use cases and draft a plan to improve said dashboards.

Filippo's use cases / ideas (limited to "machine level" metrics like cpu/memory/disk/network

We're using the dashboard to debug a problem or quantify the impact of an ongoing incident
There are three main components we can drill down/up: site/cluster/host
Dashboards for said components all present the same high level metrics, aggregated according to the component we're looking at
To reduce cognitive overhead there are a limited number of graphs per dashboard, and within each graph a limited number of metrics.
A nice guideline I've found is the USE method (http://www.brendangregg.com/usemethod.html) which I've tested an implementation for the "host dashboard" here: https://grafana.wikimedia.org/dashboard/db/host-overview
Another approach is the RED method (https://www.weave.works/docs/cloud/latest/tasks/monitor/best-instrumenting/) . The 2 methods are actually complementary, one being systems oriented and the other services oriented.

Details

	Subject	Repo	Branch	Lines +/-
	grafana: use host-overview in favour of server-board for featured dashboard	operations/puppet	production	+0 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	fgiunchedi	T205862 Expand modern metrics infrastructure coverage (2018-19 Q2 goal)
Open	None	T178690 Better organization for SRE grafana dashboards
Resolved	• Mathew.onipe	T212839 Remove "prometheus" from elasticsearch grafana dashboard names

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 20 2017, 2:29 PM

Another idea for better dashboarding: show vertical lines for events other than deployments, e.g. puppet merges

In T178690#3706215, @fgiunchedi wrote:

Another idea for better dashboarding: show vertical lines for events other than deployments, e.g. puppet merges

I agree, in a setup I've done in the past I had all annotations saved to an Elasticsearch, to allow it to scale quite easily. The smart thing to do here would be to have those other annotations tagged so that each dashboard could maybe by default only show related annotations and optionally show all of them. As a practical example, puppet merges could be tagged and by default in a traffic dashboard we show only traffic-tagged merges, but optionally you can show all of them, because sometimes things correlates in unpredictable ways...

I'd also like to add to:

There are three main components we can drill down/up: site/cluster/host

that it would be nice to be able to both show together all said metric for all hosts in a cluster or alternatively all said metric aggregated for all hosts in a cluster. This add a bit of complexity given that you need to know how to aggregate based on the metric (sum, average, etc...).
For the simple cases that just require a sum, also a stack graph can achieve the same, and for the average case a non-stacked graph gives you an idea of it, but they become quickly unreadable with the number of hosts/metrics plotted.

faidon moved this task from Inbox to Up next on the observability board.Nov 27 2017, 4:15 PM

Bawolff awarded a token.Dec 11 2017, 4:46 PM

ayounsi subscribed.Dec 11 2017, 4:52 PM

Jdforrester-WMF subscribed.Dec 11 2017, 4:56 PM

• Mholloway subscribed.Dec 11 2017, 5:35 PM

Eevans awarded a token.Dec 11 2017, 7:27 PM

greg subscribed.Dec 11 2017, 10:03 PM

Quiddity subscribed.Dec 12 2017, 8:09 PM

fgiunchedi claimed this task.Jan 8 2018, 4:33 PM

T180784 has some interesting discussion as well.

fgiunchedi added a project: User-fgiunchedi.Jan 9 2018, 2:20 PM

@ori recently sent his thoughts about this to the ops list, and I found it a very eloquent description of the issues I was thinking of too. His full email was:

Ganglia may have been a buggy and crufty, but when it was accessible anyone could see a high-level overview of Wikimedia's operational metrics at a glance by browsing to https://ganglia.wikimedia.org/.

This was extremely useful for spotting and troubleshooting problems. And the fact that anyone was invited to have a look was a powerful demonstration of what Wikimedia is and what makes it special.

Operational metrics today may be more comprehensive, more accurate, and/or more reliable, but they are not more discoverable.
The list of featured dashboards in https://grafana.wikimedia.org/ is not well-organized. There is a curious mix of important and unimportant dashboards, which are not grouped in any meaningful way. Some dashboards that should be featured aren't, and some featured dashboards shouldn't be.

The names of the dashboards are often obscure, vague, or confusing. What is "Production Logging"? Why do "Prometheus DC overview" and "Prometheus global overview" have "Prometheus" in their name?

Some of the top-level links refer to teams ("Team TCB"), others to topics ("Performance Metrics"), others to specific services ("Swift").

So, this a plea for a good landing page for operational metrics. I'd really love to see a curated selection of dashboards grouped according to some sensible taxonomy, their names standardized and revised for clarity. I think the time investment will pay for itself in time saved when debugging issues and on-boarding new folks.

I think there are a few different dimensions to this problem:

Naming (Varnish vs. Traffic vs. HTTPS, "Prometheus" prefixes, etc.)
Organization/hierachy (AQS has dashboards named as "AQS :: Cassandra :: CF :: Latency/rate Copy" for instance)
Similar to the above: 1:N tagging/hashtags
Navigation (drilling up/down, featured dashboards/frontpage)
Discoverability, which is an artifact of all to the above

• Mholloway unsubscribed.Jan 10 2018, 4:18 PM

We need the following new dashboards / URLs (noticed as part of T183873):

service cluster A overview (single link) (replace link on https://wikitech.wikimedia.org/wiki/Mathoid)
service cluster B overview (single link) (replace link on https://wikitech.wikimedia.org/wiki/Citoid)
parsoid (canary) machines (replace link " Watch the canary machines on ganglia for eqiad and codfw" on https://wikitech.wikimedia.org/wiki/Parsoid#Deploying_the_latest_version_of_Parsoid)
maps cluster / maps-varnish cluster (replace links on https://wikitech.wikimedia.org/wiki/Maps)
bits caches, apache cluster, fatals in last hour as used on https://www.mediawiki.org/wiki/Performance_profiling_for_Wikimedia_code#Ganglia

Dzahn mentioned this in T183873: Update ganglia mentions in prominent documentation.Jan 11 2018, 1:25 AM

In T178690#3891946, @Dzahn wrote:

We need the following new dashboards / URLs (noticed as part of T183873):

service cluster A overview (single link) (replace link on https://wikitech.wikimedia.org/wiki/Mathoid)

Mathoid is on SCB, not SCA. only zotero is on SCA. In any case, are these 2 sufficient ?

[eqiad] https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown?orgId=1&var-
datasource=eqiad%20prometheus%2Fops&var-cluster=scb&var-instance=All

[codfw] https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown?orgId=1&var-datasource=codfw%20prometheus%2Fops&var-cluster=scb&var-instance=All

service cluster B overview (single link) (replace link on https://wikitech.wikimedia.org/wiki/Citoid)

Same as above.

parsoid (canary) machines (replace link " Watch the canary machines on ganglia for eqiad and codfw" on https://wikitech.wikimedia.org/wiki/Parsoid#Deploying_the_latest_version_of_Parsoid)

I see no differentiation for "canary" in the links in the wikitech page. So I am guessing it was a mental process for the parsoid deployer. In that case, following the pattern above we have

[codfw] https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown?orgId=1&var-datasource=codfw%20prometheus%2Fops&var-cluster=parsoid&var-instance=All

[eqiad] https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=parsoid&var-instance=All

maps cluster / maps-varnish cluster (replace links on https://wikitech.wikimedia.org/wiki/Maps)

The maps-varnish does not exist anymore (T164608) so there is nothing to do about that. For maps itself, following the pattern above

[eqiad] https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=maps&var-instance=All

[codfw] https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=maps&var-instance=All

The question

All of the above are practically the exact same link with a bit of a tweaking to set the cluster and datacenter. Should we follow the more brittle approach of updating every page with the cluster+DC specific link or should we go for the more robust approach of just using the base https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown?orgId=1 and let the user figure it out ? At least for the DC part I am pretty sure the latter, I am not so sure about the cluster though.

fgiunchedi moved this task from Backlog to Doing on the User-fgiunchedi board.Feb 1 2018, 1:39 PM

akosiaris updated the task description. (Show Details)Mar 5 2018, 4:33 PM

fgiunchedi moved this task from Up next to In progress on the observability board.Apr 16 2018, 3:23 PM

I've put together a sample dashboard to play around with some concepts/ideas emerged in this task at https://grafana.wikimedia.org/dashboard/db/dashboard-redesign-proposal . Notably missing is the navigation story among different dashboards, but tl;dr it would be based on dashboards tags to create dropdowns. Which grouping/dropdown menus make sense is still TBD.

The dashboard has been put together by me by point-and-click, though the idea is to have the same dashboard generated by code and thus making it simple to create multiple consistent dashboards for different services and purposes.

As discussed in the monitoring meeting here some feedback:

while the limit on the number of rows/panels/metrics is understandable, it could make harder to make generic dashboard for some services, and splitting them into multiple dashboards might made harder their discoverability. One option is to exclude the rows/panels/metrics that are hidden by default from the "limit".
I'm not sure if we should come up with some best practice for the values to show near a single label (min/max/avg/current) and whether they should be inline/as table/as table on the right. Different graphs might need different values/layout based on the data shown.
I'm personally a big fan of the shared crosshair, maybe we could set it on by default.

In T178690#4168673, @Volans wrote:

As discussed in the monitoring meeting here some feedback:

while the limit on the number of rows/panels/metrics is understandable, it could make harder to make generic dashboard for some services, and splitting them into multiple dashboards might made harder their discoverability. One option is to exclude the rows/panels/metrics that are hidden by default from the "limit".

Agreed. From what I understood anyway, the limit is suggested for performance reasons and given "hidden" rows do not get evaluated at dashboard load time but rather on "unhiding".

I'm not sure if we should come up with some best practice for the values to show near a single label (min/max/avg/current) and whether they should be inline/as table/as table on the right. Different graphs might need different values/layout based on the data shown.

I am very ambivalent on the legend as well. I tend to create it, but I have no rule yet and rather play it by ear. I 'd say we say SHOULD instead of MUST for any kind of guideline here and leave it to the graph creator.

I'm personally a big fan of the shared crosshair, maybe we could set it on by default.

I also think we should add a RED/4 golden signals method example to the proposal before we make it to a template. Granted SRE graph will probably use the USE method (pun intended) but still, it'd be great to have an example of that as well

Thanks for the feedback!

In T178690#4171433, @akosiaris wrote:

In T178690#4168673, @Volans wrote:

As discussed in the monitoring meeting here some feedback:

while the limit on the number of rows/panels/metrics is understandable, it could make harder to make generic dashboard for some services, and splitting them into multiple dashboards might made harder their discoverability. One option is to exclude the rows/panels/metrics that are hidden by default from the "limit".

Agreed. From what I understood anyway, the limit is suggested for performance reasons and given "hidden" rows do not get evaluated at dashboard load time but rather on "unhiding".

I suggested the limit of 5/6 rows per dashboard to avoid too much information per dashboard, though performance is also a concern of course. I think having one "overview" dashboard that is canonical and one/more dashboards for drilldown(s) could work. I expanded on this point in the dashboard example.

I'm not sure if we should come up with some best practice for the values to show near a single label (min/max/avg/current) and whether they should be inline/as table/as table on the right. Different graphs might need different values/layout based on the data shown.

I am very ambivalent on the legend as well. I tend to create it, but I have no rule yet and rather play it by ear. I 'd say we say SHOULD instead of MUST for any kind of guideline here and leave it to the graph creator.

Indeed, my guideline generally is to display whichever summary aids in issue debugging e.g. max for utilization, max/total for errors, min for availability, etc. I'd say ideally no more than two summaries per graph, added an explanation to the dashboard sample for this too.

I'm personally a big fan of the shared crosshair, maybe we could set it on by default.

+1

+1 too, added to the dashboard

I also think we should add a RED/4 golden signals method example to the proposal before we make it to a template. Granted SRE graph will probably use the USE method (pun intended) but still, it'd be great to have an example of that as well

Agreed, I'll try to come up with a sample dashboard for those too. Our USE cases (ha ha) I think depend a whole lot if we're diagnosing performance problems (USE) and/or looking at a service as a whole (RED/4GS)

elukey subscribed.May 3 2018, 3:50 PM

Change 442301 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] WIP grafana: host overview dashboard as code

https://gerrit.wikimedia.org/r/442301

gerritbot added a project: Patch-For-Review.Jun 27 2018, 1:51 PM

I coded a strawman using grafanalib at https://gerrit.wikimedia.org/r/c/operations/puppet/+/442301 and looks good to me so far, please take a look too. I'll expand it to multiple dashboards and use cases as well.

In T178690#4319336, @fgiunchedi wrote:

I coded a strawman using grafanalib at https://gerrit.wikimedia.org/r/c/operations/puppet/+/442301 and looks good to me so far, please take a look too. I'll expand it to multiple dashboards and use cases as well.

Note that "dashboards as code" is in scope for T171482: Programmatic generation of grafana dashboards not for this task, which is about dashboard organization in general instead.

Change 444219 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] grafana: use host-overview in favour of server-board for featured dashboard

https://gerrit.wikimedia.org/r/444219

In T178690#3890148, @faidon wrote:

@ori recently sent his thoughts about this to the ops list, and I found it a very eloquent description of the issues I was thinking of too. His full email was:

Ganglia may have been a buggy and crufty, but when it was accessible anyone could see a high-level overview of Wikimedia's operational metrics at a glance by browsing to https://ganglia.wikimedia.org/.

This was extremely useful for spotting and troubleshooting problems. And the fact that anyone was invited to have a look was a powerful demonstration of what Wikimedia is and what makes it special.

Operational metrics today may be more comprehensive, more accurate, and/or more reliable, but they are not more discoverable.
The list of featured dashboards in https://grafana.wikimedia.org/ is not well-organized. There is a curious mix of important and unimportant dashboards, which are not grouped in any meaningful way. Some dashboards that should be featured aren't, and some featured dashboards shouldn't be.

The names of the dashboards are often obscure, vague, or confusing. What is "Production Logging"? Why do "Prometheus DC overview" and "Prometheus global overview" have "Prometheus" in their name?

Some of the top-level links refer to teams ("Team TCB"), others to topics ("Performance Metrics"), others to specific services ("Swift").

So, this a plea for a good landing page for operational metrics. I'd really love to see a curated selection of dashboards grouped according to some sensible taxonomy, their names standardized and revised for clarity. I think the time investment will pay for itself in time saved when debugging issues and on-boarding new folks.

I think there are a few different dimensions to this problem:

Naming (Varnish vs. Traffic vs. HTTPS, "Prometheus" prefixes, etc.)

Organization/hierachy (AQS has dashboards named as "AQS :: Cassandra :: CF :: Latency/rate Copy" for instance)

Similar to the above: 1:N tagging/hashtags

Navigation (drilling up/down, featured dashboards/frontpage)

Discoverability, which is an artifact of all to the above

I agree with the above, with the intent of understanding better what's in grafana over the last couple of days I went through the current list of dashboards (~350) and took some notes:

Some dashboards are user/private/temporary
"Rotting" dashboards, e.g. metrics disappeared, dashboard not functional, etc.
I've tagged with obsolete dashboards that seem like they can be deleted (e.g. are graphite-based but we have their prometheus equivalent)
I've tagged with needs-review dashboards that need some action to decide whether to delete or keep (and eventually update or consolidate in some other dashboard)
restbase (and restbase staging) dashboards I believe can be deleted for the most part as we're now using their prometheus counterpart (cc @Eevans)
kubernetes staging dashboards can be folded into their k8s counterparts, since we can vary the datasource to get k8s staging data
I will remove the "Prometheus" prefix from "prometheus dc-overview / cluster-breakdown / global-overview" and update their navigation/drilldown accordingly
Some dashboard have a text panel at the top that explains the dashboard a little bit and contains some pointers e.g. to wikitech, this is very nice and we should extend it to all main/important dashboards.

For the featured dashboards list I would say to begin with "four golden signals" for the systems/services that get most user requests: gdns, nginx, varnish, lvs, apache, mediawiki, mysql, swift (likely more, off top of my head and looking at https://wikitech.wikimedia.org/wiki/Category:Wikimedia_infrastructure#/media/File:Infrastructure_overview.png)

elukey awarded a token.Jul 6 2018, 2:42 PM

In T178690#4403664, @fgiunchedi wrote:

[ ... ]

restbase (and restbase staging) dashboards I believe can be deleted for the most part as we're now using their prometheus counterpart (cc @Eevans)

{{done}}

Quiddity awarded a token.Jul 6 2018, 10:00 PM

Joe subscribed.Jul 9 2018, 4:29 PM

Change 444219 merged by Alexandros Kosiaris:
[operations/puppet@production] grafana: use host-overview in favour of server-board for featured dashboard

https://gerrit.wikimedia.org/r/444219

herron subscribed.Aug 6 2018, 3:05 PM

fgiunchedi moved this task from In progress to Up next on the observability board.Aug 20 2018, 3:03 PM

fgiunchedi renamed this task from Better organization for ops grafana dashboards to Better organization for SRE grafana dashboards.Sep 26 2018, 8:14 AM

fgiunchedi updated the task description. (Show Details)

fgiunchedi added a parent task: T205862: Expand modern metrics infrastructure coverage (2018-19 Q2 goal).Oct 1 2018, 1:42 PM

fgiunchedi moved this task from Up next to In progress on the observability board.Oct 2 2018, 8:36 AM

fgiunchedi moved this task from In progress to Up next on the observability board.Oct 15 2018, 3:06 PM

Somewhat related, Grafana upstream has this issue for feedback on dashboard provisioning workflows https://github.com/grafana/grafana/issues/13823

CDanis subscribed.Nov 8 2018, 2:25 PM

CDanis added a project: User-CDanis.Nov 9 2018, 7:31 PM

CDanis moved this task from Backlog to Radar on the User-CDanis board.

fgiunchedi moved this task from Up next to In progress on the observability board.Nov 26 2018, 4:14 PM

CDanis mentioned this in T210416: Upgrade grafana to 5.x.Dec 6 2018, 10:38 PM

CDanis moved this task from Radar to Doing on the User-CDanis board.Dec 17 2018, 3:00 PM

I've generated a list of Grafana dashboards sorted by their last modification time. Details below.

While this isn't a perfect proxy for "is in active use", I think it's probably a pretty good signal.

I suspect we should be trying to find owners for any very old dashboard, and just deleting them if we can't.

sqlite> select d.title, dashboard_id, max(v.created), printf("https://grafana.wikimedia.org/d/%09d/%s", d.id, d.slug) as url from dashboard_version as v left join dashboard as d on d.id = v.dashboard_id group by dashboard_id order by v.created limit 10;

List is here:
https://phabricator.wikimedia.org/P7951

P7951 Grafana dashboards sorted by last modification 2-------------------- 3Activity 4echo-job-pickup 5RelEng :: Gerrit 6Releng :: Main page 7Authentications 8Service :: Tilerator 9Scap 10Abuse 11echoflyout 12Client Connections 13Service :: Kartotherian 14RelEng :: KPIs 15Service :: Graphoid 16Prometheus Stats 17Varnish: HTTP 18HTTP/2 19Continuous Integration 20Parsoid Heap Usage 21Nodepool Migration 22Nodepool Tasks 23BetaFeatures 24MediaWiki Catwatch Feature 25MediaWiki WatchedItemStore 26Reading Web :: mobileview API 27Zuul top jobs 28OpenStreetMap 29Extension Distributor Downloads 30MediaWiki Cognate 31qdisc stats 32Network performances 33Nodepool Pool Details 34CAPTCHA failure rates 35Service :: Maps - Varnish 36OTRS 37Varnish Transient Storage Usage 38Login timing 39Network probes 40MediaWiki MySQL LoadBalancer 41Zuul job 42MediaWiki ElectronPdfService 43Zuul :: Gearman 44Varnish Transient 45Maps performances 46MySQL Replication Lag 47Cloud codfw 48Zuul 49Etherpad 50AQS Wikistats 2 Traffic 51Nutcracker 52Kubernetes Kubelets 53Postgres 54Kubernetes Staging Kubelets 55Kubernetes Staging API 56Kubernetes Staging Pods 57Kubernetes Pods 58CI Docker Jobs 59Redis 60PyBal instances 61PyBal service 62IPVS Backend Connections 63RCFilters performance 64Kernel deployment 65VisualEditor load / save 66Parsoid Timing - html2wt 67Parsoid Timing - wt2html 68PyBal BGP 69EventLogging-schema Jumbo 70ResourceLoader: feature-test 71Cache Hosts Software Versions 72TLS Ciphers by Data Center 73Puppetdb 74Reading List Service 75Mobile WebPageTest 76Performance 77WDQS Paper data 78Cassandra 79Varnish Mailbox Lag 80Zookeeper 81Kubernetes 82Elasticsearch 83Load Balancers 84Elasticsearch 85Analytics NUMA 86AQS 87Elasticsearch 88Caches NUMA stats 89Kafka By Topic 90Cassandra Client Request 91Cassandra Read-repair 92Cassandra Tables 93Cassandra Threadpools 94Varnish Failed Fetches 95Varnish Daemons Hitrate 96Ganeti 97Kafka MirrorMaker 98Kafka (graphite) 99Kafka Consumer Lag 100Prometheus 101TLS CipherSuite Explorer 102Cassandra System 103Service Endpoint performance 104Synthetic 105Kubernetes API 106VarnishKafka 107Elasticsearch 108Elasticsearch Percentiles Beta 109Hook calls 110Interactive team KPI 111Interactive team KPI (backup) 112LoginNotify 113Maps Dashboard - draft 114Maps KPI 115Maps :: Cassandra 116MediaWiki BounceHandler 117Mediawiki AdvancedSearch 118Node Exporter Server Metrics 119PAWS 120Cluster hardware 121Prometheus 122Apache/HHVM 123RecDNS 124Site power usage 125Trending Service 126machine disk I/O 127Microcode Updates 128Job Queue Health 129Job Queue Rate 130prometheus-varnish-http-Errors 131Reading Web :: Page Previews 132EventStreams 133API frontend summary 134Service :: Mathoid 135Save Timing Alerts 136API backend summary 137Varnish machine stats 138Nodepool 139HHVM APC Usage breakdown 140HHVM APC Usage 141Varnish Caching 142JobQueue EventBus 143Elasticsearch 144Mobile Dashboard 145ResourceLoader Modules 146ResourceLoaderModule 147Edit Count 148Thumbor 149Varnish Backend Connections 150Production Logging 151ORES extension 152MediaWiki Static 153Echo Mention Errors 154Echo Mention 155MediaWiki Edit Conflicts 156Mediawiki TwoColConflict 157Mediawiki RevisionSlider 158RESTBase external overview 159MediaWiki FileImporter 160Navigation Timing by Browser 161Piwik 162MySQL 163Parsoid http status codes 164MySQL Aggregated 165WebPageTest Portals 166Mobile 2G 167HTTPS 168Media 169PyBal 170Varnish: HTTP Errors 171Host overview grafanalib 172MediaWiki 173Memcache-historic-data 174Proton 175Article Placeholder 176Analytics Hadoop 177EventLogging 178Kafka 179Hive 180Elasticsearch 181EventLogging-schema 182MediaWiki Graphite Alerts 183Varnish Traffic 184NTP time servers 185parsoid servers cpu usage 186Zuul :: Pipeline 187Swift 4GS 188Parsoid: perf trends 189parsoid times vs doc size 190TCP Fast Open 191Elasticsearch 192Graphite (eqiad) 193Graphite (codfw) 194Rsyslog 195API requests Breakdown 196Datacenter global overview 197EventBus 198RESTBase 199mw-js-deprecate 200Elasticsearch 201Druid 202MediaWiki Application servers 203Navigation Timing by Country 204Navigation Timing by Continent 205Hadoop 206Service :: Citoid 207Varnish HTTP Requests 208Varnish traffic 209Arc Lamp 210Traffic 211ATS Cache Operations 212Prometheus Varnish DC stats 213Prometheus 214Varnish Caching 215Performance Metrics 216ResourceLoader 217Navigation Timing by Platform 218ResourceLoader Alerts 219Mcrouter 220Swift 221Varnish: 222ATS Instance Drilldown 223WebPageReplay 224WebPageReplay drilldown 225WebPageReplay Desktop Alerts 226WebPageReplay Mobile Alerts 227WebPageTest 228WebPageTest drilldown 229WebPageTest alerts 230mobileapps 231Navigation Timing alerts 232xxxx Zotero 233ORES 234Service :: CXServer 235Navigation Timing 236Joal Kafka Test 237Experimental - backend 5xx 238EventLogging-schema 239Kafka MirrorMaker 240Kafka By Topic (graphite) 241Labs Project Board 242Dashboards 243User dashboards 244bd808-test 245dashboard-redesign-proposal-4gs 246imarlier db debug 247jgreen frdev1001 248Julien Maps Dashboard 249Joal NUMA 250Krinkle Dashboard 251JVM overview 252Ladsgroup-test 253Krinkle Sanbox 254Logstash (herron WIP) 255Lucas Sandbox 256memcache-elukey 257Niedzielski 258xxxx cdanis Host overview Copy 259xxxx cdanis test 260xxxx cdanis thermal health 261xxxx cwhite temp 262xxxxx Kubernetes Pods (Fsero) 263Content translation 264Save Timing 265Edit Stash 266Team TCB 267Scribunto 268DNS 269DNS recursors 270Frontend Traffic 271Host overview 272Filippo home test 273Cluster overview 274Drafts 275Network Performances Global 276Apache Backend-Timing 277Parser Cache 278Ping offload 279Fundraising 280fundraising database 281fundraising overview 282fundraising database (all) 283fundraising mariadb 284fundraising host overview 285fundraising redis 286WMCS 287CloudVPS eqiad1 288cloud-capacity-planning 289cloudvps-rabbitmq 290Labs Monitoring 291Labs DNS Dashboard 292Labs Nova Fullstack 293labs-capacity-planning 294Labstore - NFS Directory Sizes 295labvirt node disk stats 296OpenLDAP Labs 297labstore1004/1005 298WMCS API uptimes 299WMCS OpenStack eqiad1 300WMCS openstack 301BlockNotices Alex 302frdb2001 303frtechmail dashboard 304Jobqueues-elukey 305Pageviews API 306PHP metrics 307Datacenter overview 308Prometheus machine stats 309Frontend 310Logstash 311Mail 312MySQL core 313Network Errors by Cluster 314Wikidata dashboards 315Wikidata 316Wikidata API 317Wikidata change propagation 318Wikidata Addshore Monitoring 319Wikidata Datamodel 320Wikidata Datamodel References 321Wikidata Co-Editors 322Wikidata Datamodel Terms 323Wikidata Datamodel Statements 324Wikidata Dispatch 325Wikidata Dispatch Script 326Wikidata Dump Downloads 327Wikidata Edits 328Wikidata EditEntity 329Wikidata 330Wikidata Entity Usage 331Wikidata Entity Usage Project 332Wikidata Quality 333Wikidata Query Service 334Wikidata Site Stats 335Wikidata Query Service UI 336Wikidata 337Wikidata Social Followers 338Wikidata Tasks 339Wikidata Special:EntityData 340Wikidata top page views 341Wikidata WebPageTest 342WikidataClient 343T204083 investigation 344Switchover/Switchback 345Wikibase API error rate 346Wikibase API wbgetentities 347Wikibase docker images 348Wikibase wb_terms 349Wikibase FormatterCache 350Wikibase 351Ciphers 352WMCS - Node Exporter Full 353Backend Save Timing Breakdown 354Phabricator 355BlockNotices 356Performance perception survey 357Reading Web Dashboard 358Dashboard redesign proposal 359Authentication metrics 360Memcache 361Elasticsearch 362Memcache-Slabs

time; generated 2019-01-02

title dashboard_ max(v.created) url -------------------- ---------- -------------------- -------------------------------------------------- 1 0001-01-01 00:00:00 https://grafana.wikimedia.org/d/000000001/activity 16 0001-01-01 00:00:00 https://grafana.wikimedia.org/d/000000016/echo-job 63 0001-01-01 00:00:00 https://grafana.wikimedia.org/d/000000063/releng-g 64 0001-01-01 00:00:00 https://grafana.wikimedia.org/d/000000064/releng-m 131 0001-01-01 00:00:00 https://grafana.wikimedia.org/d/000000131/authenti 90 2015-12-22 04:19:41 https://grafana.wikimedia.org/d/000000090/service- 86 2016-01-12 18:52:57 https://grafana.wikimedia.org/d/000000086/scap 217 2016-02-16 21:22:00 https://grafana.wikimedia.org/d/000000217/abuse 15 2016-02-18 11:57:41 https://grafana.wikimedia.org/d/000000015/echoflyo 12 2016-07-11 15:04:19 https://grafana.wikimedia.org/d/000000012/client-c 30 2016-09-29 20:40:44 https://grafana.wikimedia.org/d/000000030/service- 108 2016-10-12 12:57:41 https://grafana.wikimedia.org/d/000000108/releng-k 21 2016-10-20 07:15:33 https://grafana.wikimedia.org/d/000000021/service- 271 2016-11-01 23:57:38 https://grafana.wikimedia.org/d/000000271/promethe Errors (datacenters) 166 2016-11-02 01:42:42 https://grafana.wikimedia.org/d/000000166/varnish- 319 2016-12-20 15:06:38 https://grafana.wikimedia.org/d/000000319/http-2 284 2016-12-21 23:21:26 https://grafana.wikimedia.org/d/000000284/continuo 281 2017-01-20 01:01:14 https://grafana.wikimedia.org/d/000000281/parsoid- 324 2017-02-16 15:39:26 https://grafana.wikimedia.org/d/000000324/nodepool 323 2017-03-15 05:04:23 https://grafana.wikimedia.org/d/000000323/nodepool 259 2017-03-16 17:53:27 https://grafana.wikimedia.org/d/000000259/betafeat 189 2017-03-16 18:01:39 https://grafana.wikimedia.org/d/000000189/mediawik 237 2017-03-16 18:34:41 https://grafana.wikimedia.org/d/000000237/mediawik 333 2017-03-22 08:13:35 https://grafana.wikimedia.org/d/000000333/reading- 348 2017-03-28 09:05:46 https://grafana.wikimedia.org/d/000000348/zuul-top 349 2017-04-03 08:32:56 https://grafana.wikimedia.org/d/000000349/openstre 161 2017-05-10 01:29:45 https://grafana.wikimedia.org/d/000000161/extensio 355 2017-05-23 12:34:34 https://grafana.wikimedia.org/d/000000355/mediawik 361 2017-05-24 13:23:19 https://grafana.wikimedia.org/d/000000361/qdisc-st 365 2017-06-13 19:22:22 https://grafana.wikimedia.org/d/000000365/network- 345 2017-06-30 13:44:24 https://grafana.wikimedia.org/d/000000345/nodepool 370 2017-07-03 10:43:26 https://grafana.wikimedia.org/d/000000370/captcha- 190 2017-07-04 02:29:32 https://grafana.wikimedia.org/d/000000190/service- 371 2017-07-11 13:05:52 https://grafana.wikimedia.org/d/000000371/otrs 359 2017-07-27 14:44:52 https://grafana.wikimedia.org/d/000000359/varnish- 383 2017-08-23 00:19:38 https://grafana.wikimedia.org/d/000000383/login-ti 387 2017-08-24 14:26:15 https://grafana.wikimedia.org/d/000000387/network- 363 2017-08-25 20:00:44 https://grafana.wikimedia.org/d/000000363/mediawik 283 2017-09-11 21:16:39 https://grafana.wikimedia.org/d/000000283/zuul-job 309 2017-09-28 17:01:14 https://grafana.wikimedia.org/d/000000309/mediawik 322 2017-10-04 21:35:04 https://grafana.wikimedia.org/d/000000322/zuul-gea Memory Breakdown 367 2017-10-06 06:21:42 https://grafana.wikimedia.org/d/000000367/varnish- 305 2017-11-08 10:16:25 https://grafana.wikimedia.org/d/000000305/maps-per 303 2017-11-23 18:12:10 https://grafana.wikimedia.org/d/000000303/mysql-re 449 2017-11-28 00:55:55 https://grafana.wikimedia.org/d/000000449/cloud-co 321 2017-12-07 21:57:31 https://grafana.wikimedia.org/d/000000321/zuul 193 2017-12-13 14:35:49 https://grafana.wikimedia.org/d/000000193/etherpad 456 2017-12-14 10:44:08 https://grafana.wikimedia.org/d/000000456/aqs-wiki 216 2017-12-20 11:36:19 https://grafana.wikimedia.org/d/000000216/nutcrack 436 2017-12-20 15:38:33 https://grafana.wikimedia.org/d/000000436/kubernet 469 2017-12-21 11:09:06 https://grafana.wikimedia.org/d/000000469/postgres 472 2017-12-21 11:58:54 https://grafana.wikimedia.org/d/000000472/kubernet 471 2017-12-21 12:00:52 https://grafana.wikimedia.org/d/000000471/kubernet 473 2017-12-21 12:04:32 https://grafana.wikimedia.org/d/000000473/kubernet 445 2017-12-21 12:06:53 https://grafana.wikimedia.org/d/000000445/kubernet 420 2017-12-21 12:56:40 https://grafana.wikimedia.org/d/000000420/ci-docke 174 2018-01-02 09:37:05 https://grafana.wikimedia.org/d/000000174/redis 426 2018-01-08 15:39:01 https://grafana.wikimedia.org/d/000000426/pybal-in 422 2018-01-08 15:39:41 https://grafana.wikimedia.org/d/000000422/pybal-se 395 2018-01-08 15:45:43 https://grafana.wikimedia.org/d/000000395/ipvs-bac 419 2018-01-09 00:25:05 https://grafana.wikimedia.org/d/000000419/rcfilter 302 2018-01-09 15:20:46 https://grafana.wikimedia.org/d/000000302/kernel-d 94 2018-02-02 23:11:08 https://grafana.wikimedia.org/d/000000094/visualed 46 2018-02-17 15:56:12 https://grafana.wikimedia.org/d/000000046/parsoid- 48 2018-02-17 15:57:23 https://grafana.wikimedia.org/d/000000048/parsoid- 488 2018-03-13 16:10:31 https://grafana.wikimedia.org/d/000000488/pybal-bg 494 2018-03-21 19:59:23 https://grafana.wikimedia.org/d/000000494/eventlog 238 2018-03-21 22:49:13 https://grafana.wikimedia.org/d/000000238/resource 474 2018-03-22 18:02:06 https://grafana.wikimedia.org/d/000000474/cache-ho 452 2018-03-23 08:26:07 https://grafana.wikimedia.org/d/000000452/tls-ciph 477 2018-03-26 08:34:54 https://grafana.wikimedia.org/d/000000477/puppetdb 457 2018-03-29 12:50:53 https://grafana.wikimedia.org/d/000000457/reading- 130 2018-04-03 08:27:56 https://grafana.wikimedia.org/d/000000130/mobile-w - Singapore caching center 502 2018-04-05 13:36:41 https://grafana.wikimedia.org/d/000000502/performa 514 2018-04-06 20:17:19 https://grafana.wikimedia.org/d/000000514/wdqs-pap 418 2018-04-10 20:43:44 https://grafana.wikimedia.org/d/000000418/cassandr 478 2018-04-15 21:39:25 https://grafana.wikimedia.org/d/000000478/varnish- 261 2018-04-18 09:28:04 https://grafana.wikimedia.org/d/000000261/zookeepe 519 2018-04-18 12:54:55 https://grafana.wikimedia.org/d/000000519/kubernet Per-node Percentiles 486 2018-04-20 16:44:09 https://grafana.wikimedia.org/d/000000486/elastics 343 2018-04-23 17:19:57 https://grafana.wikimedia.org/d/000000343/load-bal Node Comparison - Promethe 460 2018-04-24 18:14:29 https://grafana.wikimedia.org/d/000000460/elastics 525 2018-04-25 09:18:37 https://grafana.wikimedia.org/d/000000525/analytic 526 2018-04-25 10:49:04 https://grafana.wikimedia.org/d/000000526/aqs 14 2018-04-30 18:02:01 https://grafana.wikimedia.org/d/000000014/elastics 539 2018-05-02 14:37:16 https://grafana.wikimedia.org/d/000000539/caches-n 234 2018-05-02 16:20:30 https://grafana.wikimedia.org/d/000000234/kafka-by 483 2018-05-03 09:12:12 https://grafana.wikimedia.org/d/000000483/cassandr 497 2018-05-03 09:12:35 https://grafana.wikimedia.org/d/000000497/cassandr 453 2018-05-03 09:14:45 https://grafana.wikimedia.org/d/000000453/cassandr 433 2018-05-03 09:15:20 https://grafana.wikimedia.org/d/000000433/cassandr 352 2018-05-03 14:10:47 https://grafana.wikimedia.org/d/000000352/varnish- 443 2018-05-07 10:21:15 https://grafana.wikimedia.org/d/000000443/varnish- 545 2018-05-09 12:30:35 https://grafana.wikimedia.org/d/000000545/ganeti 521 2018-05-14 17:42:45 https://grafana.wikimedia.org/d/000000521/kafka-mi 523 2018-05-14 20:15:03 https://grafana.wikimedia.org/d/000000523/kafka-gr 484 2018-05-17 06:31:11 https://grafana.wikimedia.org/d/000000484/kafka-co Varnish HTTP Requests 501 2018-06-05 10:11:51 https://grafana.wikimedia.org/d/000000501/promethe 458 2018-06-08 00:18:25 https://grafana.wikimedia.org/d/000000458/tls-ciph 417 2018-06-15 15:15:46 https://grafana.wikimedia.org/d/000000417/cassandr 358 2018-06-21 10:53:13 https://grafana.wikimedia.org/d/000000358/service- performance sdev/mdev 554 2018-06-27 08:49:02 https://grafana.wikimedia.org/d/000000554/syntheti 435 2018-06-28 15:25:37 https://grafana.wikimedia.org/d/000000435/kubernet 253 2018-06-29 09:07:38 https://grafana.wikimedia.org/d/000000253/varnishk Percentiles - Prometheus 455 2018-07-05 12:41:26 https://grafana.wikimedia.org/d/000000455/elastics 250 2018-07-05 12:41:46 https://grafana.wikimedia.org/d/000000250/elastics 24 2018-07-05 12:55:01 https://grafana.wikimedia.org/d/000000024/hook-cal 285 2018-07-05 12:58:25 https://grafana.wikimedia.org/d/000000285/interact 300 2018-07-05 13:02:28 https://grafana.wikimedia.org/d/000000300/interact 385 2018-07-05 14:01:58 https://grafana.wikimedia.org/d/000000385/loginnot 314 2018-07-05 14:03:50 https://grafana.wikimedia.org/d/000000314/maps-das 310 2018-07-05 14:04:12 https://grafana.wikimedia.org/d/000000310/maps-kpi 233 2018-07-05 14:05:05 https://grafana.wikimedia.org/d/000000233/maps-cas 142 2018-07-05 15:34:57 https://grafana.wikimedia.org/d/000000142/mediawik 434 2018-07-06 08:46:55 https://grafana.wikimedia.org/d/000000434/mediawik 342 2018-07-06 09:05:03 https://grafana.wikimedia.org/d/000000342/node-exp 235 2018-07-06 09:07:14 https://grafana.wikimedia.org/d/000000235/paws specs differences 332 2018-07-06 09:25:49 https://grafana.wikimedia.org/d/000000332/cluster- Varnish: HTTP Errors (datacen 508 2018-07-06 09:28:51 https://grafana.wikimedia.org/d/000000508/promethe 327 2018-07-06 09:33:16 https://grafana.wikimedia.org/d/000000327/apache-h 375 2018-07-06 10:01:05 https://grafana.wikimedia.org/d/000000375/recdns 397 2018-07-06 10:08:01 https://grafana.wikimedia.org/d/000000397/site-pow 315 2018-07-06 10:12:06 https://grafana.wikimedia.org/d/000000315/trending 236 2018-07-06 10:37:26 https://grafana.wikimedia.org/d/000000236/machine- 556 2018-07-09 09:01:47 https://grafana.wikimedia.org/d/000000556/microcod 107 2018-07-09 21:41:31 https://grafana.wikimedia.org/d/000000107/job-queu 105 2018-07-09 21:42:32 https://grafana.wikimedia.org/d/000000105/job-queu 557 2018-07-13 04:37:35 https://grafana.wikimedia.org/d/000000557/promethe 340 2018-07-16 14:49:25 https://grafana.wikimedia.org/d/000000340/reading- 336 2018-07-25 07:07:44 https://grafana.wikimedia.org/d/000000336/eventstr 202 2018-07-27 01:26:49 https://grafana.wikimedia.org/d/000000202/api-fron 187 2018-07-31 12:48:50 https://grafana.wikimedia.org/d/000000187/service- 362 2018-08-05 20:33:46 https://grafana.wikimedia.org/d/000000362/save-tim 2 2018-08-06 06:08:46 https://grafana.wikimedia.org/d/000000002/api-back 330 2018-08-14 14:34:09 https://grafana.wikimedia.org/d/000000330/varnish- 276 2018-08-15 03:58:13 https://grafana.wikimedia.org/d/000000276/nodepool 499 2018-08-20 22:42:58 https://grafana.wikimedia.org/d/000000499/hhvm-apc 496 2018-08-20 22:43:30 https://grafana.wikimedia.org/d/000000496/hhvm-apc 500 2018-08-22 01:11:29 https://grafana.wikimedia.org/d/000000500/varnish- 400 2018-08-22 17:33:53 https://grafana.wikimedia.org/d/000000400/jobqueue Indexing - prometheus 461 2018-08-22 17:41:08 https://grafana.wikimedia.org/d/000000461/elastics 35 2018-08-23 18:59:39 https://grafana.wikimedia.org/d/000000035/mobile-d 430 2018-08-24 00:29:47 https://grafana.wikimedia.org/d/000000430/resource 67 2018-08-24 00:31:47 https://grafana.wikimedia.org/d/000000067/resource 208 2018-09-05 15:52:19 https://grafana.wikimedia.org/d/000000208/edit-cou 291 2018-09-06 13:26:47 https://grafana.wikimedia.org/d/000000291/thumbor 439 2018-09-11 14:24:17 https://grafana.wikimedia.org/d/000000439/varnish- 102 2018-09-12 15:00:38 https://grafana.wikimedia.org/d/000000102/producti 263 2018-09-19 15:58:27 https://grafana.wikimedia.org/d/000000263/ores-ext 212 2018-09-19 23:45:09 https://grafana.wikimedia.org/d/000000212/mediawik 254 2018-09-25 09:45:46 https://grafana.wikimedia.org/d/000000254/echo-men Status Notifications 270 2018-09-25 09:46:12 https://grafana.wikimedia.org/d/000000270/echo-men 213 2018-09-25 09:47:05 https://grafana.wikimedia.org/d/000000213/mediawik 346 2018-09-25 09:52:13 https://grafana.wikimedia.org/d/000000346/mediawik 260 2018-09-25 09:53:46 https://grafana.wikimedia.org/d/000000260/mediawik 577 2018-09-25 22:33:28 https://grafana.wikimedia.org/d/000000577/restbase 553 2018-09-26 22:24:30 https://grafana.wikimedia.org/d/000000553/mediawik 218 2018-09-28 08:58:30 https://grafana.wikimedia.org/d/000000218/navigati 354 2018-10-02 07:54:14 https://grafana.wikimedia.org/d/000000354/piwik 273 2018-10-02 09:23:54 https://grafana.wikimedia.org/d/000000273/mysql 42 2018-10-03 16:27:11 https://grafana.wikimedia.org/d/000000042/parsoid- 278 2018-10-03 19:12:17 https://grafana.wikimedia.org/d/000000278/mysql-ag 146 2018-10-03 19:13:44 https://grafana.wikimedia.org/d/000000146/webpaget 205 2018-10-03 19:14:34 https://grafana.wikimedia.org/d/000000205/mobile-2 25 2018-10-03 19:14:58 https://grafana.wikimedia.org/d/000000025/https 34 2018-10-09 09:44:06 https://grafana.wikimedia.org/d/000000034/media 421 2018-10-09 15:21:09 https://grafana.wikimedia.org/d/000000421/pybal 503 2018-10-10 21:01:02 https://grafana.wikimedia.org/d/000000503/varnish- 555 2018-10-13 16:48:20 https://grafana.wikimedia.org/d/000000555/host-ove AbuseFilter Profiling 393 2018-10-14 01:53:20 https://grafana.wikimedia.org/d/000000393/mediawik 586 2018-10-18 06:54:03 https://grafana.wikimedia.org/d/000000586/memcache 563 2018-10-18 18:31:23 https://grafana.wikimedia.org/d/000000563/proton 244 2018-10-18 18:37:45 https://grafana.wikimedia.org/d/000000244/article- 258 2018-10-19 06:54:36 https://grafana.wikimedia.org/d/000000258/analytic 505 2018-10-23 16:59:43 https://grafana.wikimedia.org/d/000000505/eventlog 27 2018-10-24 15:56:40 https://grafana.wikimedia.org/d/000000027/kafka 379 2018-10-25 14:04:51 https://grafana.wikimedia.org/d/000000379/hive - Mjolnir Bulk Updates 591 2018-10-30 21:01:28 https://grafana.wikimedia.org/d/000000591/elastics 18 2018-10-31 17:00:42 https://grafana.wikimedia.org/d/000000018/eventlog 438 2018-11-02 01:07:33 https://grafana.wikimedia.org/d/000000438/mediawik - Instance Breakdown 450 2018-11-05 14:54:07 https://grafana.wikimedia.org/d/000000450/varnish- 228 2018-11-06 18:58:41 https://grafana.wikimedia.org/d/000000228/ntp-time 44 2018-11-06 19:17:44 https://grafana.wikimedia.org/d/000000044/parsoid- 594 2018-11-12 11:03:53 https://grafana.wikimedia.org/d/000000594/zuul-pip 584 2018-11-16 10:14:42 https://grafana.wikimedia.org/d/000000584/swift-4g 135 2018-11-16 22:04:14 https://grafana.wikimedia.org/d/000000135/parsoid- 45 2018-11-16 22:06:29 https://grafana.wikimedia.org/d/000000045/parsoid- 257 2018-11-19 21:35:05 https://grafana.wikimedia.org/d/000000257/tcp-fast Memory - prometheus 462 2018-11-20 13:07:10 https://grafana.wikimedia.org/d/000000462/elastics 20 2018-11-20 14:27:22 https://grafana.wikimedia.org/d/000000020/graphite 337 2018-11-20 17:27:38 https://grafana.wikimedia.org/d/000000337/graphite 596 2018-11-21 16:33:14 https://grafana.wikimedia.org/d/000000596/rsyslog 559 2018-11-21 18:21:13 https://grafana.wikimedia.org/d/000000559/api-requ 605 2018-11-26 14:50:04 https://grafana.wikimedia.org/d/000000605/datacent 201 2018-11-27 15:13:12 https://grafana.wikimedia.org/d/000000201/eventbus 68 2018-11-27 15:29:41 https://grafana.wikimedia.org/d/000000068/restbase 37 2018-11-30 19:17:13 https://grafana.wikimedia.org/d/000000037/mw-js-de - Mjolnir msearch 616 2018-11-30 19:44:25 https://grafana.wikimedia.org/d/000000616/elastics 538 2018-12-03 08:23:04 https://grafana.wikimedia.org/d/000000538/druid 550 2018-12-04 16:06:11 https://grafana.wikimedia.org/d/000000550/mediawik 232 2018-12-05 14:19:02 https://grafana.wikimedia.org/d/000000232/navigati 230 2018-12-05 14:19:05 https://grafana.wikimedia.org/d/000000230/navigati 585 2018-12-05 15:59:07 https://grafana.wikimedia.org/d/000000585/hadoop 11 2018-12-05 19:17:12 https://grafana.wikimedia.org/d/000000011/service- 180 2018-12-06 14:49:00 https://grafana.wikimedia.org/d/000000180/varnish- 93 2018-12-06 20:06:56 https://grafana.wikimedia.org/d/000000093/varnish- 578 2018-12-07 21:27:43 https://grafana.wikimedia.org/d/000000578/arc-lamp 621 2018-12-10 21:06:15 https://grafana.wikimedia.org/d/000000621/traffic 569 2018-12-10 21:08:03 https://grafana.wikimedia.org/d/000000569/ats-cach 304 2018-12-10 21:08:05 https://grafana.wikimedia.org/d/000000304/promethe Varnish Aggregate Client Stat 464 2018-12-10 21:08:05 https://grafana.wikimedia.org/d/000000464/promethe Last Week Comparison 541 2018-12-10 21:08:05 https://grafana.wikimedia.org/d/000000541/varnish- 50 2018-12-10 23:39:54 https://grafana.wikimedia.org/d/000000050/performa 66 2018-12-11 00:00:49 https://grafana.wikimedia.org/d/000000066/resource 38 2018-12-11 01:41:19 https://grafana.wikimedia.org/d/000000038/navigati 402 2018-12-11 01:48:44 https://grafana.wikimedia.org/d/000000402/resource 549 2018-12-11 08:35:36 https://grafana.wikimedia.org/d/000000549/mcrouter 622 2018-12-11 14:47:39 https://grafana.wikimedia.org/d/000000622/swift Aggregate Client Status Codes 623 2018-12-11 14:47:40 https://grafana.wikimedia.org/d/000000623/varnish- 610 2018-12-11 16:51:08 https://grafana.wikimedia.org/d/000000610/ats-inst 431 2018-12-12 06:06:19 https://grafana.wikimedia.org/d/000000431/webpager 572 2018-12-12 06:13:23 https://grafana.wikimedia.org/d/000000572/webpager 491 2018-12-12 06:19:10 https://grafana.wikimedia.org/d/000000491/webpager 490 2018-12-12 06:23:12 https://grafana.wikimedia.org/d/000000490/webpager 210 2018-12-12 06:26:44 https://grafana.wikimedia.org/d/000000210/webpaget 95 2018-12-12 06:28:44 https://grafana.wikimedia.org/d/000000095/webpaget 318 2018-12-12 06:30:23 https://grafana.wikimedia.org/d/000000318/webpaget 183 2018-12-12 19:08:07 https://grafana.wikimedia.org/d/000000183/mobileap 326 2018-12-12 23:02:58 https://grafana.wikimedia.org/d/000000326/navigati debugging kubernetes 620 2018-12-13 12:01:32 https://grafana.wikimedia.org/d/000000620/xxxx-zot 255 2018-12-13 14:41:24 https://grafana.wikimedia.org/d/000000255/ores 593 2018-12-14 06:10:27 https://grafana.wikimedia.org/d/000000593/service- 143 2018-12-14 23:52:15 https://grafana.wikimedia.org/d/000000143/navigati 26 2018-12-17 13:47:14 https://grafana.wikimedia.org/d/000000026/joal-kaf 219 2018-12-17 13:47:14 https://grafana.wikimedia.org/d/000000219/experime - to be deleted 506 2018-12-17 13:47:14 https://grafana.wikimedia.org/d/000000506/eventlog old to delete 520 2018-12-17 13:47:15 https://grafana.wikimedia.org/d/000000520/kafka-mi 524 2018-12-17 13:47:15 https://grafana.wikimedia.org/d/000000524/kafka-by 112 2018-12-17 13:47:16 https://grafana.wikimedia.org/d/000000112/labs-pro to be deleted (T178690) 627 2018-12-17 13:47:44 https://grafana.wikimedia.org/d/000000627/dashboar 628 2018-12-17 13:54:31 https://grafana.wikimedia.org/d/000000628/user-das 5 2018-12-17 13:54:32 https://grafana.wikimedia.org/d/000000005/bd808-te 552 2018-12-17 13:54:33 https://grafana.wikimedia.org/d/000000552/dashboar 487 2018-12-17 13:54:34 https://grafana.wikimedia.org/d/000000487/imarlier 414 2018-12-17 13:54:36 https://grafana.wikimedia.org/d/000000414/jgreen-f 311 2018-12-17 13:54:37 https://grafana.wikimedia.org/d/000000311/julien-m 527 2018-12-17 13:54:37 https://grafana.wikimedia.org/d/000000527/joal-num 31 2018-12-17 13:54:38 https://grafana.wikimedia.org/d/000000031/krinkle- - work in progress - gehel 537 2018-12-17 13:54:38 https://grafana.wikimedia.org/d/000000537/jvm-over 378 2018-12-17 13:54:39 https://grafana.wikimedia.org/d/000000378/ladsgrou 558 2018-12-17 13:54:39 https://grafana.wikimedia.org/d/000000558/krinkle- 564 2018-12-17 13:54:40 https://grafana.wikimedia.org/d/000000564/logstash 592 2018-12-17 13:54:40 https://grafana.wikimedia.org/d/000000592/lucas-sa 614 2018-12-17 13:54:40 https://grafana.wikimedia.org/d/000000614/memcache Sandbox - Mobile 2G 518 2018-12-17 13:54:41 https://grafana.wikimedia.org/d/000000518/niedziel 600 2018-12-17 13:54:41 https://grafana.wikimedia.org/d/000000600/xxxx-cda 595 2018-12-17 13:54:42 https://grafana.wikimedia.org/d/000000595/xxxx-cda 597 2018-12-17 13:54:42 https://grafana.wikimedia.org/d/000000597/xxxx-cda 603 2018-12-17 13:54:42 https://grafana.wikimedia.org/d/000000603/xxxx-cwh 619 2018-12-17 13:54:43 https://grafana.wikimedia.org/d/000000619/xxxxx-ku 598 2018-12-18 04:01:13 https://grafana.wikimedia.org/d/000000598/content- 85 2018-12-18 20:40:14 https://grafana.wikimedia.org/d/000000085/save-tim 249 2018-12-19 01:19:57 https://grafana.wikimedia.org/d/000000249/edit-sta 288 2018-12-19 16:40:52 https://grafana.wikimedia.org/d/000000288/team-tcb 139 2018-12-19 16:41:42 https://grafana.wikimedia.org/d/000000139/scribunt 341 2018-12-20 11:43:18 https://grafana.wikimedia.org/d/000000341/dns 399 2018-12-20 11:43:35 https://grafana.wikimedia.org/d/000000399/dns-recu 479 2018-12-20 11:44:31 https://grafana.wikimedia.org/d/000000479/frontend 377 2018-12-20 11:46:52 https://grafana.wikimedia.org/d/000000377/host-ove 630 2018-12-20 11:47:04 https://grafana.wikimedia.org/d/000000630/filippo- 607 2018-12-20 15:20:52 https://grafana.wikimedia.org/d/000000607/cluster- 631 2018-12-20 15:41:41 https://grafana.wikimedia.org/d/000000631/drafts 366 2018-12-20 15:42:10 https://grafana.wikimedia.org/d/000000366/network- 580 2018-12-20 15:42:10 https://grafana.wikimedia.org/d/000000580/apache-b 106 2018-12-20 15:42:11 https://grafana.wikimedia.org/d/000000106/parser-c 513 2018-12-20 15:42:11 https://grafana.wikimedia.org/d/000000513/ping-off 632 2018-12-20 15:45:12 https://grafana.wikimedia.org/d/000000632/fundrais 403 2018-12-20 15:45:39 https://grafana.wikimedia.org/d/000000403/fundrais 408 2018-12-20 15:45:39 https://grafana.wikimedia.org/d/000000408/fundrais 412 2018-12-20 15:45:39 https://grafana.wikimedia.org/d/000000412/fundrais 424 2018-12-20 15:45:39 https://grafana.wikimedia.org/d/000000424/fundrais 425 2018-12-20 15:45:39 https://grafana.wikimedia.org/d/000000425/fundrais 401 2018-12-20 15:45:40 https://grafana.wikimedia.org/d/000000401/fundrais 633 2018-12-20 15:47:41 https://grafana.wikimedia.org/d/000000633/wmcs 571 2018-12-20 15:47:58 https://grafana.wikimedia.org/d/000000571/cloudvps 576 2018-12-20 15:47:58 https://grafana.wikimedia.org/d/000000576/cloud-ca 617 2018-12-20 15:47:58 https://grafana.wikimedia.org/d/000000617/cloudvps 32 2018-12-20 15:47:59 https://grafana.wikimedia.org/d/000000032/labs-mon 240 2018-12-20 15:47:59 https://grafana.wikimedia.org/d/000000240/labs-dns 339 2018-12-20 15:48:00 https://grafana.wikimedia.org/d/000000339/labs-nov 225 2018-12-20 15:48:01 https://grafana.wikimedia.org/d/000000225/labs-cap 338 2018-12-20 15:48:01 https://grafana.wikimedia.org/d/000000338/labstore 33 2018-12-20 15:48:02 https://grafana.wikimedia.org/d/000000033/labvirt- 181 2018-12-20 15:48:02 https://grafana.wikimedia.org/d/000000181/openldap 568 2018-12-20 15:48:02 https://grafana.wikimedia.org/d/000000568/labstore 405 2018-12-20 15:48:03 https://grafana.wikimedia.org/d/000000405/wmcs-api 579 2018-12-20 15:48:03 https://grafana.wikimedia.org/d/000000579/wmcs-ope eqiad1 hypervisor 624 2018-12-20 15:48:03 https://grafana.wikimedia.org/d/000000624/wmcs-ope 629 2018-12-20 15:48:56 https://grafana.wikimedia.org/d/000000629/blocknot 394 2018-12-20 16:25:24 https://grafana.wikimedia.org/d/000000394/frdb2001 567 2018-12-20 16:25:24 https://grafana.wikimedia.org/d/000000567/frtechma 360 2018-12-20 16:25:40 https://grafana.wikimedia.org/d/000000360/jobqueue 196 2018-12-20 16:28:44 https://grafana.wikimedia.org/d/000000196/pageview 609 2018-12-20 16:29:36 https://grafana.wikimedia.org/d/000000609/php-metr 608 2018-12-20 16:33:06 https://grafana.wikimedia.org/d/000000608/datacent 274 2018-12-20 16:33:41 https://grafana.wikimedia.org/d/000000274/promethe Responses NGINX vs Varnish 612 2018-12-20 16:35:14 https://grafana.wikimedia.org/d/000000612/frontend 561 2018-12-20 16:38:02 https://grafana.wikimedia.org/d/000000561/logstash 451 2018-12-20 16:38:25 https://grafana.wikimedia.org/d/000000451/mail 272 2018-12-20 16:40:44 https://grafana.wikimedia.org/d/000000272/mysql-co 562 2018-12-20 16:45:43 https://grafana.wikimedia.org/d/000000562/network- 634 2018-12-20 16:48:39 https://grafana.wikimedia.org/d/000000634/wikidata 154 2018-12-20 16:48:42 https://grafana.wikimedia.org/d/000000154/wikidata 169 2018-12-20 16:48:43 https://grafana.wikimedia.org/d/000000169/wikidata 485 2018-12-20 16:48:43 https://grafana.wikimedia.org/d/000000485/wikidata 601 2018-12-20 16:48:43 https://grafana.wikimedia.org/d/000000601/wikidata 167 2018-12-20 16:48:44 https://grafana.wikimedia.org/d/000000167/wikidata 182 2018-12-20 16:48:44 https://grafana.wikimedia.org/d/000000182/wikidata 560 2018-12-20 16:48:44 https://grafana.wikimedia.org/d/000000560/wikidata 168 2018-12-20 16:48:45 https://grafana.wikimedia.org/d/000000168/wikidata 175 2018-12-20 16:48:45 https://grafana.wikimedia.org/d/000000175/wikidata 156 2018-12-20 16:48:46 https://grafana.wikimedia.org/d/000000156/wikidata 239 2018-12-20 16:48:46 https://grafana.wikimedia.org/d/000000239/wikidata 264 2018-12-20 16:48:46 https://grafana.wikimedia.org/d/000000264/wikidata 170 2018-12-20 16:48:47 https://grafana.wikimedia.org/d/000000170/wikidata 615 2018-12-20 16:48:47 https://grafana.wikimedia.org/d/000000615/wikidata Page views (per domain) 158 2018-12-20 16:48:48 https://grafana.wikimedia.org/d/000000158/wikidata 160 2018-12-20 16:48:48 https://grafana.wikimedia.org/d/000000160/wikidata 176 2018-12-20 16:48:48 https://grafana.wikimedia.org/d/000000176/wikidata 344 2018-12-20 16:48:49 https://grafana.wikimedia.org/d/000000344/wikidata 489 2018-12-20 16:48:49 https://grafana.wikimedia.org/d/000000489/wikidata 162 2018-12-20 16:48:50 https://grafana.wikimedia.org/d/000000162/wikidata 290 2018-12-20 16:48:50 https://grafana.wikimedia.org/d/000000290/wikidata Query Service Frontend 522 2018-12-20 16:48:50 https://grafana.wikimedia.org/d/000000522/wikidata 159 2018-12-20 16:48:51 https://grafana.wikimedia.org/d/000000159/wikidata 172 2018-12-20 16:48:51 https://grafana.wikimedia.org/d/000000172/wikidata 188 2018-12-20 16:48:51 https://grafana.wikimedia.org/d/000000188/wikidata 163 2018-12-20 16:48:52 https://grafana.wikimedia.org/d/000000163/wikidata 209 2018-12-20 16:48:52 https://grafana.wikimedia.org/d/000000209/wikidata change handling (WikiPage 227 2018-12-20 16:48:53 https://grafana.wikimedia.org/d/000000227/wikidata 574 2018-12-20 16:52:59 https://grafana.wikimedia.org/d/000000574/t204083- 581 2018-12-20 16:57:54 https://grafana.wikimedia.org/d/000000581/switchov 226 2018-12-20 17:19:58 https://grafana.wikimedia.org/d/000000226/wikibase 265 2018-12-20 17:19:58 https://grafana.wikimedia.org/d/000000265/wikibase 516 2018-12-20 17:19:58 https://grafana.wikimedia.org/d/000000516/wikibase 548 2018-12-20 17:19:59 https://grafana.wikimedia.org/d/000000548/wikibase 626 2018-12-20 17:19:59 https://grafana.wikimedia.org/d/000000626/wikibase wb_terms newItemIdFormatter 599 2018-12-20 17:20:00 https://grafana.wikimedia.org/d/000000599/wikibase 10 2018-12-20 17:20:26 https://grafana.wikimedia.org/d/000000010/ciphers 590 2018-12-20 17:20:28 https://grafana.wikimedia.org/d/000000590/wmcs-nod 429 2018-12-20 23:13:32 https://grafana.wikimedia.org/d/000000429/backend- 587 2018-12-20 23:20:50 https://grafana.wikimedia.org/d/000000587/phabrica 618 2018-12-21 05:15:50 https://grafana.wikimedia.org/d/000000618/blocknot 551 2018-12-21 10:08:37 https://grafana.wikimedia.org/d/000000551/performa 566 2018-12-21 11:36:07 https://grafana.wikimedia.org/d/000000566/reading- 536 2018-12-21 14:20:53 https://grafana.wikimedia.org/d/000000536/dashboar 4 2018-12-28 14:45:25 https://grafana.wikimedia.org/d/000000004/authenti 316 2018-12-30 15:24:22 https://grafana.wikimedia.org/d/000000316/memcache Indexing - prometheus (Add 635 2019-01-02 07:40:20 https://grafana.wikimedia.org/d/000000635/elastics 636 2019-01-02 14:15:56 https://grafana.wikimedia.org/d/000000636/memcache

fgiunchedi mentioned this in T212839: Remove "prometheus" from elasticsearch grafana dashboard names.Jan 3 2019, 9:56 AM

• Mathew.onipe closed subtask T212839: Remove "prometheus" from elasticsearch grafana dashboard names as Resolved.Jan 4 2019, 1:55 AM

I've just seen a dashboard I use is scheduled for deletion. I don't see the replacement as particularly better and lacking. Could you have a look at how other people are doing those such as https://pmmdemo.percona.com/graph/d/qyzrQGHmk/system-overview They can be downloaded at https://github.com/percona/grafana-dashboards

Jaime, going to have to guess here; are you referring to "Prometheus machine stats" (marked for deletion) vs "Host overview"?

In T178690#4876994, @CDanis wrote:

Jaime, going to have to guess here; are you referring to "Prometheus machine stats" (marked for deletion) vs "Host overview"?

Yes.

greg unsubscribed.Jan 14 2019, 6:28 PM

In T178690#4877021, @jcrespo wrote:

In T178690#4876994, @CDanis wrote:

Jaime, going to have to guess here; are you referring to "Prometheus machine stats" (marked for deletion) vs "Host overview"?

Yes.

I 've kind of just met the same problem, and had the same immediate reaction as @jcrespo. But the more I am looking at "Host overview" the more I like it. @jcrespo, I think it 's import ant to spell out exactly what you don't like or find lacking in the new dashboard. It does use the "USE" [1] method so it takes a bit to wrap you mind around the methodology, which might or might not be suited for your use case. But in that case we should know.

AFAICT, my gripe is the misc section where it's not at all clear to me what misc: errors is about. I also find that we might need to add some icmp/tcp/udp graphs, perhaps in a by default collapsed row?

[1] http://www.brendangregg.com/usemethod.html

akosiaris triaged this task as Low priority.Jan 25 2019, 2:54 PM

Ah, forgot to update the task, but at the time @jcrespo and @fgiunchedi and I talked, and Jaime's biggest gripe was that iostat-reported "disk IO utilization" is not a very useful metric: it's the fraction of time that at least one oustanding iop was in the disk's queue. On a server that has any load at all, this metric will generally be "100%" all the time; what you actually care about are stats like "queue depth" and "request latencies".

https://www.percona.com/blog/2017/08/28/looking-disk-utilization-and-saturation/ had some good thoughts on the issue as well.

@akosiaris we had some chat about details, I don't mind the USE pattern, but a poor graph using USE doesn't mean it is good, if the chosen metrics are poor, like the above example. Note also they were probably in a worse state before my comments 0:-)

In T178690#4909073, @jcrespo wrote:

@akosiaris we had some chat about details, I don't mind the USE pattern, but a poor graph using USE doesn't mean it is good, if the chosen metrics are poor, like the above example. Note also they were probably in a worse state before my comments 0:-)

Sure. Seems like I was just missing context. Thanks for the update

• Phabricator_maintenance moved this task from Backlog to Acknowledged on the SRE board.Jan 26 2019, 9:31 PM

@fgiunchedi I had a stab at a RED pattern dashboard for mathoid. Let me know what you think.

https://grafana.wikimedia.org/d/000000187/service-mathoid?refresh=1m&orgId=1

I still have on the TODO list to add p50, p90, p99 for latency that is currently missing

fgiunchedi moved this task from In progress to Up next on the observability board.Mar 18 2019, 1:58 PM

fgiunchedi moved this task from Doing to Up next on the User-fgiunchedi board.May 13 2019, 8:57 AM

It would be nice to make it a little clearer what the intended replacement is (either put it in the task description or the description of the dashboard to be deleted) so one does not have to read through the conversation in this task to know what to update links to.

fgiunchedi moved this task from Up next to Backlog on the User-fgiunchedi board.Oct 9 2019, 11:31 PM

fgiunchedi moved this task from Up next to Inbox on the observability board.Oct 28 2019, 2:16 PM

fgiunchedi moved this task from Inbox to Backlog on the observability board.Apr 6 2020, 12:35 PM

Paladox subscribed.Apr 12 2020, 8:12 PM

CDanis mentioned this in T253655: Document and/or improve navigation of the various HTTP frontend Grafana dashboards.May 26 2020, 4:54 PM

lmata subscribed.Jun 30 2020, 4:25 PM

Aklapper removed a project: Patch-For-Review.Oct 15 2020, 1:38 PM

fgiunchedi removed fgiunchedi as the assignee of this task.Oct 27 2020, 3:39 PM

hashar mentioned this in T161227: Prometheus graph incorrectly sums CPU user and CPU guest.Jun 9 2021, 9:08 AM

lmata edited projects, added SRE Observability; removed observability.Jul 12 2021, 2:22 AM

lmata moved this task from Inbox to Backlog on the SRE Observability board.Jul 15 2021, 4:09 AM

lmata edited projects, added Observability-Metrics; removed SRE Observability.Aug 9 2021, 1:10 AM

lmata moved this task from Inbox to Backlog on the Observability-Metrics board.Sep 6 2022, 7:32 PM

Mentioned in SAL (#wikimedia-operations) [2023-01-10T13:44:57Z] <godog> delete grafana dashboards from "sre dashboards for deletion" folder - T178690

fgiunchedi removed a project: User-fgiunchedi.Mar 8 2023, 12:29 PM

Better organization for SRE grafana dashboardsOpen, LowPublicActions