Page MenuHomePhabricator

Prepare mysql account and options for prometheus
Closed, ResolvedPublic

Description

  • a read only mysql user, allowing connection from localhost only is fine as mysqld_exporter would be co-located on the db box
  • a jessie db host where to run mysqld_exporter
  • (optional) a list of desired flags for mysqld_exporter to be enabled, from https://github.com/prometheus/mysqld_exporter#collector-flags

Event Timeline

jcrespo created this task.Feb 26 2016, 12:53 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 26 2016, 12:53 PM
jcrespo moved this task from Triage to Backlog on the DBA board.Feb 26 2016, 1:02 PM

I've been testing creating a prometheus user with default options for prometheus-mysqld-exporter

create user 'prometheus'@'::1';
grant replication client on *.* to 'prometheus'@'::1';

this yields 800 metrics

$ curl localhost:9104/metrics -s | grep -v '^#' | wc -l
836

enabling extra metrics yields more results

-collect.info_schema.tablestats=true
-collect.info_schema.userstats=true
-collect.info_schema.query_response_time=true
-collect.auto_increment.columns=true
-collect.binlog_size=true
-collect.perf_schema.eventsstatements=true
-collect.perf_schema.eventswaits=true
-collect.perf_schema.file_events=true
-collect.perf_schema.indexiowaits=true
-collect.perf_schema.tableiowaits=true
-collect.perf_schema.tablelocks=true

[assuming performance schema is already enabled in my.cnf]

grant select on performance_schema.* to 'prometheus'@'::1';
INSTALL SONAME 'query_response_time';
SET GLOBAL query_response_time_stats = on

$ curl localhost:9104/metrics -s | grep -v '^#' | wc -l
2027

though some extra metrics can be of very high cardinality, e.g.

$ curl localhost:9104/metrics -s | grep -v '^#' | grep 'SHOW TABLES'
mysql_perf_schema_events_statements_errors_total{digest="455a5a9375b3816e3b9651f74efc81a2",digest_text="SHOW TABLES ",schema="foo"} 0
mysql_perf_schema_events_statements_no_index_used_total{digest="455a5a9375b3816e3b9651f74efc81a2",digest_text="SHOW TABLES ",schema="foo"} 1

or those per-schema / per-table / per-index, e.g.

mysql_perf_schema_external_lock_waits_total{name="pet",operation="read",schema="foo"} 0
mysql_perf_schema_index_io_waits_seconds_total{index="NONE",name="pet",operation="fetch",schema="foo"}
mysql_perf_schema_table_io_waits_seconds_total{name="pet",operation="insert",schema="foo"} 0
mysql_perf_schema_file_events_total{event_name="wait/io/file/sql/slow_log",mode="read"} 0
jcrespo moved this task from Backlog to In progress on the DBA board.Apr 1 2016, 11:15 AM

I've been testing creating a prometheus user with default options for prometheus-mysqld-exporter

Where? Which host?

I've been testing creating a prometheus user with default options for prometheus-mysqld-exporter

Where? Which host?

I tried on monitoring-prometheus2.eqiad.wmflabs with mariadb::config and mariadb::packages_wmf applied and using prometheus-mysqld-exporter package available on jessie-wikimedia apt

I will set it up on all of production, but only actually apply it to one test host (for now). I have to remember with one it was.

Change 280939 had a related patch set uploaded (by Jcrespo):
[WIP]New user for prometheus monitoring

https://gerrit.wikimedia.org/r/280939

jcrespo added a subscriber: Volans.Apr 1 2016, 4:16 PM

Aside from the above, I have created manually a prometheus@localhost user (temporarily with the same password as nagios, and its same grants: PROCESS, REPLICATION CLIENT) on db2070.

db2070 is a *production server* on codfw, but unlike db2030, it has performance_schema already enabled (aside from jessie) and it is closer to a real server than a misc one. Currently, it receives the same load than any other enwiki slave throught replication, but none of the reads. Depool it if you need to do something delicate (@Volans will know how to), it is not essential otherwise. Generate some traffic (T124697) if you need read traffic in addition to write traffic from its master.

Regarding flags, as I mentioned several times, that collector is nice, but not enough from the point of view of production monitoring. We would want all of them enables except performance_schema or user_stats, as we enable one or the other, but not bothm and engine_tokudb_status, which would not be needed for core servers (only on dbstore/labs, etc.). In order to test performance/etc., focus on SHOW GLOBAL STATUS SHOW GLOBAL VARIABLES and SHOW SLAVE STATUS (the first 3 flags on that page) and that will give you already 400 metrics, and it is MySQL monitoring 101, we will manually tailor the rest for our specific setup (mariadb, multisource replication, pt-heartbeat, etc.). Not all 400 are useful, but that is about the number of metrics we will have in the end.

fgiunchedi added a comment.EditedApr 11 2016, 2:01 PM

thanks @jcrespo ! I had to change the password hashing from old to new or it wouldn't work.

I've started a test at 13.54 UTC by curl'ing every 30s for metrics, for 24h (root screen on db2070) so we can evaluate what would be the impact with default flags:

timeout 24h bash -c 'while sleep 30s ; do curl localhost:9104/metrics > /dev/null ; done'

this has finished, the obvious impact on the graphs afaics while the loop was running is a jump in implicit temporary tables from ~340 to ~370.

I've launched another 24h test with sleep 15s

@fgiunchedi it looks that the behaviour was the same, with increased implicit temporary tables.

I'm curious which metric is generating them.

indeed @Volans ! likely some of the queries from https://github.com/prometheus/mysqld_exporter/blob/e33fca4c900b93f46a70648ed9cf46e3d7b85194/mysqld_exporter.go but I can't tell offhand which

other than that it seems the impact is pretty minimal (?) we'll need to test with more flags to the collector i.e. more metrics

Running "show global status" creates a temporary table. I wouldn't worry too much about it (it is only KBytes on memory). I would be worried more about potential locking needed by things such as SHOW SLAVE STATUS (but we are already doing that every 10 seconds).

That server is receiving 222 QPS. I have not investigated yet, but if that is caused by new monitoring (it is the only server of that kind with such traffic, others have 22 QPS from monitoring, watchdog, LVS, etc.) that is non-negible. There are active servers with less traffic. Needs debugging to see the impact (that servers has performance schema so we can analyze the actual impact) and the actual query patterns (are connections reused or do we have 200 new connections per second?).

no queries from user prometheus should be showing up, the tests ran from 11/04 14.00 UTC to 13/04 15 UTC

Then it is something else 0:-)

Mentioned in SAL [2016-05-10T15:48:45Z] <godog> collect mysqld metrics with prometheus-metrics-collector 0.8.1 on db2070 for 24h T128185

INSTALL SONAME 'query_response_time'; That is a new plugin on mysql itself, and needs review.

see also this recommendation about adding WITH MAX_USER_CONNECTIONS 3; to the prometheus user, for good measure

https://github.com/prometheus/mysqld_exporter/commit/88e7062f002a686586c035003cb4281b2a19d24a

Ok, will do.

jcrespo raised the priority of this task from Low to High.Jul 1 2016, 11:08 AM
jcrespo moved this task from Blocked external/Not db team to Next on the DBA board.

Change 299969 had a related patch set uploaded (by Jcrespo):
Add fake prometheus mysql password

https://gerrit.wikimedia.org/r/299969

Change 299969 merged by Jcrespo:
Add fake prometheus mysql password

https://gerrit.wikimedia.org/r/299969

jcrespo moved this task from Next to In progress on the DBA board.Jul 20 2016, 11:03 AM

Change 280939 merged by Jcrespo:
New user for prometheus monitoring

https://gerrit.wikimedia.org/r/280939

Change 301778 had a related patch set uploaded (by Jcrespo):
Restrict prometheus connections to 5 simultaneous connections

https://gerrit.wikimedia.org/r/301778

Change 301778 merged by Jcrespo:
Restrict prometheus connections to 5 simultaneous connections

https://gerrit.wikimedia.org/r/301778

This is done, but only has been deployed to s1. Only pending its deploy to the other shards.

Mentioned in SAL [2016-08-03T09:29:55Z] <jynus> applying prometheus required grants to all databases T128185

jcrespo closed this task as Resolved.Aug 3 2016, 10:01 AM

I would say we close this as resolved - we will use the default options for now, which already contains plenty of information (GLOBAL VARIABLES, GLOBAL STATUS), later tune for things such as performance_schema and pt-heartbeat. The reason is that that probably an always-ongoing task.