Page MenuHomePhabricator

test prometheus mysqld-exporter
Closed, ResolvedPublic

Description

I've played with https://github.com/prometheus/mysqld_exporter in labs to monitor some labsdb boxes, e.g. https://prometheus.wmflabs.org/grafana/dashboard/db/mysql-labsdb-cluster reads from https://prometheus.wmflabs.org prometheus server which asks several mysqld_exporter endpoints, one per labsdb box.

tracking here what's missing:

  • package mysqld_exporter into prometheus-mysqld-exporter package
  • add puppetization for mysqld_exporter
  • pick a sample of db hosts to monitor
  • read only mysql user access to sample db hosts
  • enable all relevant flags from https://github.com/prometheus/mysqld_exporter#collector-flags to configuration (Note: some flags will have to be handled separately due to privacy concerns)
  • add monitored db hosts to prometheus config

Details

Related Changes in Gerrit:
SubjectRepoBranchLines +/-
operations/puppetproduction+1 -0
operations/puppetproduction+0 -1
operations/puppetproduction+95 -0
operations/puppetproduction+63 -0
operations/puppetproduction+2 -1
operations/puppetproduction+5 -5
operations/puppetproduction+6 -0
operations/puppetproduction+54 -0
operations/puppetproduction+6 -2
operations/puppetproduction+107 -1
operations/puppetproduction+33 -1
operations/puppetproduction+19 -1
operations/puppetproduction+23 -0
operations/puppetproduction+18 -0
operations/puppetproduction+13 -0
operations/puppetproduction+2 -5
operations/puppetproduction+5 -20
operations/puppetproduction+244 -6
operations/puppetproduction+10 -20
operations/puppetproduction+11 -3
operations/puppetproduction+11 -0
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+40 -1
operations/puppetproduction+120 -21
operations/puppetproduction+11 -1
operations/puppetproduction+10 -1
operations/puppetproduction+76 -0
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 302680 merged by Jcrespo:
Add prometheus's mysql-exporter to all jessie's production codfw dbs

https://gerrit.wikimedia.org/r/302680

Change 302904 had a related patch set uploaded (by Filippo Giunchedi):
prometheus: add mysql job configuration

https://gerrit.wikimedia.org/r/302904

Change 302904 merged by Filippo Giunchedi:
prometheus: add mysql job configuration

https://gerrit.wikimedia.org/r/302904

Change 305972 had a related patch set uploaded (by Filippo Giunchedi):
mariadb: add mysql/node prometheus metrics for db2034

https://gerrit.wikimedia.org/r/305972

Change 305972 merged by Filippo Giunchedi:
mariadb: add mysql/node prometheus metrics for db2034

https://gerrit.wikimedia.org/r/305972

Change 305976 had a related patch set uploaded (by Filippo Giunchedi):
mariadb: collect prometheus stats in codfw

https://gerrit.wikimedia.org/r/305976

Change 305976 merged by Filippo Giunchedi:
mariadb: collect prometheus stats in codfw

https://gerrit.wikimedia.org/r/305976

Change 306174 had a related patch set uploaded (by Filippo Giunchedi):
mariadb: install node/mysql exporters in eqiad too

https://gerrit.wikimedia.org/r/306174

So, some impressions I got after using grafana as a frontend for prometheus:

  • We are missing key metrics: SHOW ENGINE INNODB STATUS, SHOW PROCESSLIST and latency options. Latency is more complex issue, I will talk about it separatelly.

    "-collect.global_status -collect.global_variables -collect.info_schema.innodb_metrics -collect.info_schema.processlist -collect.info_schema.processlist.min_time X -collect.slave_status" I do not know that X should be (0?). innodb_metrics and processlist should solve the above problem

    Let's test this before adding the deployment to eqiad for performance implications. Once we have that, we will have covered all current tendril graphing options.
  • I do not think we will have proper support for multi-source replication. This is not a blocker for core or for the goal, but we definitely need support for that. We should try dbstore2* hosts to see what happens
  • pt-heartbeat: we need support for it- currently, mediawiki uses this rather than Seconds_behind_master(!). It is more reliable (if replication is not running, SHOW SLAVE STATUS returns null) and easier to handle with multi-source. Also, less prone to getting blocked.
  • table statistics: currently, we gather table statistics for all available tables; as prometheus mysql user only has access to mysql and heartbeat tables, it only provides those. This is good, because runing those on s3 hosts will be almost impossible (dozens of thousands of tables), bad because it doesn't run on the interesting tables. These metrics are interesting, but maybe they should run on a single server, or on all servers, every 1 day or so. They are very static, but we do not care about the number of rows every minute.
  • latency metrics: this is the most problematic part- there is some performance_Schema support, but we talked about not exposing queries publicly, as they could contain private data. There are easier alternatives as extra metrics, but they require plugins installed on mysql that (in most cases) are not very taxing, but the whole idea of performance_schema is not to have those plugins. I do not know how to go with this, but we definitely need an "average latency" and a "latency of queries by group" measurement.

I have created a sample dashboard (per server), similar to that of tendril:
https://grafana-admin.wikimedia.org/dashboard/db/mysql?var-server=db2019

Performance seems to be very good.

I will create next a dashboard with aggregated values to push it to the limits.

Change 306404 had a related patch set uploaded (by Jcrespo):
Configuration changes regarding firewall and mysql for prometheus

https://gerrit.wikimedia.org/r/306404

Change 306404 merged by Jcrespo:
Configuration changes regarding mysql exporter for prometheus

https://gerrit.wikimedia.org/r/306404

Change 306470 had a related patch set uploaded (by Jcrespo):
Fix mysqld exporter prometheus config not working in trusty

https://gerrit.wikimedia.org/r/306470

Change 306470 merged by Jcrespo:
Fix mysqld exporter prometheus config not working in trusty

https://gerrit.wikimedia.org/r/306470

So the changes on the defaults do not work on trusty, I will revert that to:
"ARGS="

I cannot find there log of why it fails, so I cannot debug, but probably the syntaxs is different.
The above patch at least will keep it working until a more permanent solution is figured out.

Another thing I cannot have easily in grafana is the ability to create dashboards; that together with the fact that handling text is not very friendly, may require some alternative solutions for a dashboard. The idea is having a table for quick reference in case of ongoing issues with:

hostname, ip, mysql version, memory, uptime, QPS, average latency, replication state (running/stopped) and replication lag

On click, it would lead me to the detailed view (https://grafana-admin.wikimedia.org/dashboard/db/mysql?var-server=$hostname)

Change 306645 had a related patch set uploaded (by Jcrespo):
Quote ARGS parameters for trusty compatibility

https://gerrit.wikimedia.org/r/306645

Change 306645 merged by Jcrespo:
Quote ARGS parameters for trusty compatibility

https://gerrit.wikimedia.org/r/306645

Change 306675 had a related patch set uploaded (by Jcrespo):
Puppetize static configuration for prometheus-mysqld-exporter

https://gerrit.wikimedia.org/r/306675

Change 306675 merged by Jcrespo:
Puppetize static configuration for prometheus-mysqld-exporter

https://gerrit.wikimedia.org/r/306675

Change 306906 had a related patch set uploaded (by Jcrespo):
Fix puppet issues generating empty files for mysql configuration

https://gerrit.wikimedia.org/r/306906

Change 306906 merged by Jcrespo:
Fix puppet issues generating empty files for mysql configuration

https://gerrit.wikimedia.org/r/306906

Change 306174 merged by Filippo Giunchedi:
mariadb: install node/mysql exporters in eqiad too

https://gerrit.wikimedia.org/r/306174

Change 306928 had a related patch set uploaded (by Jcrespo):
prometheus: Test mysqld-exporter on s6 slaves to check load impact

https://gerrit.wikimedia.org/r/306928

Change 306928 merged by Jcrespo:
prometheus: Test mysqld-exporter on s6 slaves to check load impact

https://gerrit.wikimedia.org/r/306928

Change 306936 had a related patch set uploaded (by Jcrespo):
Labsdb: include labs salt groups and prometheus monitoring for dbs

https://gerrit.wikimedia.org/r/306936

Change 306937 had a related patch set uploaded (by Jcrespo):
dbproxy: add prometheus node monitoring

https://gerrit.wikimedia.org/r/306937

Change 306939 had a related patch set uploaded (by Jcrespo):
es2001-4: add node exporter to this standalones hosts

https://gerrit.wikimedia.org/r/306939

Change 306936 merged by Jcrespo:
Labsdb: include labs salt groups and prometheus monitoring for dbs

https://gerrit.wikimedia.org/r/306936

Change 307249 had a related patch set uploaded (by Jcrespo):
prometheus: add labsdb eqiad hosts to monitoring

https://gerrit.wikimedia.org/r/307249

Change 307250 had a related patch set uploaded (by Jcrespo):
prometheus: Add parsercaches on eqiad (and fix the ones on codfw)

https://gerrit.wikimedia.org/r/307250

Change 307249 merged by Jcrespo:
prometheus: add labsdb eqiad hosts to monitoring

https://gerrit.wikimedia.org/r/307249

Change 307250 merged by Jcrespo:
prometheus: Add parsercaches on eqiad (and fix the ones on codfw)

https://gerrit.wikimedia.org/r/307250

Change 307254 had a related patch set uploaded (by Jcrespo):
prometheus: add misc eqiad hosts to mysqld exporter

https://gerrit.wikimedia.org/r/307254

Change 296596 abandoned by Filippo Giunchedi:
prometheus: add mysql mediawiki production db discovery

Reason:
we'll be doing discovery through puppet data, not mw

https://gerrit.wikimedia.org/r/296596

Change 296595 abandoned by Filippo Giunchedi:
prometheus: generate mysql targets from mw config

Reason:
we'll be doing discovery through puppet data, not mw

https://gerrit.wikimedia.org/r/296595

Change 307285 had a related patch set uploaded (by Jcrespo):
Workaround still existing, but irrelevant, precise hosts

https://gerrit.wikimedia.org/r/307285

Change 307285 merged by Jcrespo:
prometheus exporter: avoid still existing precise hosts

https://gerrit.wikimedia.org/r/307285

Change 307293 had a related patch set uploaded (by Jcrespo):
prometheus mysqld exporter: Add dbstore-eqiad hosts

https://gerrit.wikimedia.org/r/307293

@fgiunchedi So there are things that do not work: many labsdb hosts use a different configuration than production for the location of its socket and other exotic options, like datadir. For me that is a bug, and it makes many things complex, but they wanted it like that, and they told me are they would maintain themselves the labs/db roles (that is why they are separate from the other roles), so talk to them to solve that.

The other thing was a known issue: db1069 has many instances running at the same time. I mentioned this was not a hard requirement for the goal, but we want to support that at some point; as probably we will increase the number of instances like that at some point in the future. The probable way to go is to monitor each instance on a separate port.

Change 307298 had a related patch set uploaded (by Jcrespo):
prometheus mysqld exporter: disable labsdb1005 because "precise"

https://gerrit.wikimedia.org/r/307298

Change 307254 merged by Jcrespo:
prometheus: add misc eqiad hosts to mysqld exporter

https://gerrit.wikimedia.org/r/307254

Change 307293 merged by Jcrespo:
prometheus mysqld exporter: Add dbstore-eqiad hosts

https://gerrit.wikimedia.org/r/307293

Change 307298 merged by Jcrespo:
prometheus mysqld exporter: disable labsdb1005 because "precise"

https://gerrit.wikimedia.org/r/307298

Change 306939 merged by Jcrespo:
es2001-4: add node exporter to this standalones hosts

https://gerrit.wikimedia.org/r/306939

Change 307479 had a related patch set uploaded (by Jcrespo):
prometheus mysqld exporter: add a bunch of selected slaves from core

https://gerrit.wikimedia.org/r/307479

Change 307479 merged by Jcrespo:
prometheus mysqld exporter: add a bunch of selected slaves from core

https://gerrit.wikimedia.org/r/307479

Change 307503 had a related patch set uploaded (by Jcrespo):
prometheus mysqld exporter: add all pending database instances

https://gerrit.wikimedia.org/r/307503

Change 307503 merged by Jcrespo:
prometheus mysqld exporter: add all pending database instances

https://gerrit.wikimedia.org/r/307503

Change 309241 had a related patch set uploaded (by Jcrespo):
prometheus: Remove db1075 from the s3 slaves; it was duplicated

https://gerrit.wikimedia.org/r/309241

Change 309241 merged by Jcrespo:
prometheus: Remove db1075 from the s3 slaves; it was duplicated

https://gerrit.wikimedia.org/r/309241

Change 306937 abandoned by Jcrespo:
dbproxy: add prometheus node monitoring

Reason:
Already included on standard

https://gerrit.wikimedia.org/r/306937