Page MenuHomePhabricator

Upgrade mysqld_exporter in production
Closed, ResolvedPublic

Description

Buster will ship with https://github.com/prometheus/mysqld_exporter >= 0.11

## 0.11.0 / 2018-06-29

### BREAKING CHANGES:
* Flags now use the Kingpin library, and require double-dashes. #222

This also changes the behavior of boolean flags.
* Enable: `--collect.global_status`
* Disable: `--no-collect.global_status`

### Changes:
* [CHANGE] Limit number and lifetime of connections #208
* [ENHANCEMENT] Move session params to DSN #259
* [ENHANCEMENT] Use native DB.Ping() instead of self-written implementation #210
* [FEATURE] Add collector duration metrics #197
* [FEATURE] Add 'collect[]' URL parameter to filter enabled collectors #235
* [FEATURE] Set a `lock_wait_timeout` on the MySQL connection #252
* [FEATURE] Set `last_scrape_error` when an error occurs #237
* [FEATURE] Collect metrics from `performance_schema.replication_group_member_stats` #271
* [FEATURE] Add innodb compression statistic #275
* [FEATURE] Add metrics for the output of `SHOW SLAVE HOSTS` #279
* [FEATURE] Support custom CA truststore and client SSL keypair. #255
* [BUGFIX] Fix perfEventsStatementsQuery #213
* [BUGFIX] Fix `file_instances` metric collector #205
* [BUGFIX] Fix prefix removal in `perf_schema_file_instances` #257
* [BUGFIX] Fix 32bit compile issue #273
* [BUGFIX] Ignore boolean keys in my.cnf. #283

## v0.10.0 / 2017-03-22
 
 * [FEATURE] Add read/write query response time #166
 * [FEATURE] Add Galera gcache size metric #169
 * [FEATURE] Add MariaDB multi source replication support #178
 * [FEATURE] Implement heartbeat metrics #183
 * [FEATURE] Add basic file_summary_by_instance metrics #189
 * [BUGFIX] Workaround MySQL bug 79533 #173

Particularly interesting to us is multi-source replication https://github.com/prometheus/mysqld_exporter/pull/178 and heartbeat metrics https://github.com/prometheus/mysqld_exporter/pull/183.

The latter is likely to require some changes due to the way we use pt-heartbeat

Event Timeline

jcrespo subscribed.

Is this blocked on me to configure/upgrade the deployed exporter or has is the package/release not yet available?

@jcrespo the release isn't out yet, though we can test what's in git now on a sample of servers, do you have some we could use?

We can deploy to codfw now, where worse case scenario, it would not cause a visible outage. We are really keen on those new features.

@jcrespo I have a package of mysqld-exporter 0.10.0 built on copper, if you'd like to give it a try

Diff in variables on db2048 (i.e. connection_name is added to mysql_slave metrics effectively renaming them, no other changes)

-mysql_slave_status_connect_retry{channel_name="",master_host="db2016.codfw.wmnet",master_uuid=""}
-mysql_slave_status_exec_master_log_pos{channel_name="",master_host="db2016.codfw.wmnet",master_uuid=""}
-mysql_slave_status_last_errno{channel_name="",master_host="db2016.codfw.wmnet",master_uuid=""}
-mysql_slave_status_last_io_errno{channel_name="",master_host="db2016.codfw.wmnet",master_uuid=""}
-mysql_slave_status_last_sql_errno{channel_name="",master_host="db2016.codfw.wmnet",master_uuid=""}
-mysql_slave_status_master_port{channel_name="",master_host="db2016.codfw.wmnet",master_uuid=""}
-mysql_slave_status_master_server_id{channel_name="",master_host="db2016.codfw.wmnet",master_uuid=""}
-mysql_slave_status_master_ssl_allowed{channel_name="",master_host="db2016.codfw.wmnet",master_uuid=""}
-mysql_slave_status_master_ssl_verify_server_cert{channel_name="",master_host="db2016.codfw.wmnet",master_uuid=""}
-mysql_slave_status_read_master_log_pos{channel_name="",master_host="db2016.codfw.wmnet",master_uuid=""}
-mysql_slave_status_relay_log_pos{channel_name="",master_host="db2016.codfw.wmnet",master_uuid=""}
-mysql_slave_status_relay_log_space{channel_name="",master_host="db2016.codfw.wmnet",master_uuid=""}
-mysql_slave_status_seconds_behind_master{channel_name="",master_host="db2016.codfw.wmnet",master_uuid=""}
-mysql_slave_status_skip_counter{channel_name="",master_host="db2016.codfw.wmnet",master_uuid=""}
-mysql_slave_status_slave_io_running{channel_name="",master_host="db2016.codfw.wmnet",master_uuid=""}
-mysql_slave_status_slave_sql_running{channel_name="",master_host="db2016.codfw.wmnet",master_uuid=""}
-mysql_slave_status_until_log_pos{channel_name="",master_host="db2016.codfw.wmnet",master_uuid=""}
+mysql_slave_status_connect_retry{channel_name="",connection_name="",master_host="db2016.codfw.wmnet",master_uuid=""}
+mysql_slave_status_exec_master_log_pos{channel_name="",connection_name="",master_host="db2016.codfw.wmnet",master_uuid=""}
+mysql_slave_status_executed_log_entries{channel_name="",connection_name="",master_host="db2016.codfw.wmnet",master_uuid=""}
+mysql_slave_status_last_errno{channel_name="",connection_name="",master_host="db2016.codfw.wmnet",master_uuid=""}
+mysql_slave_status_last_io_errno{channel_name="",connection_name="",master_host="db2016.codfw.wmnet",master_uuid=""}
+mysql_slave_status_last_sql_errno{channel_name="",connection_name="",master_host="db2016.codfw.wmnet",master_uuid=""}
+mysql_slave_status_master_port{channel_name="",connection_name="",master_host="db2016.codfw.wmnet",master_uuid=""}
+mysql_slave_status_master_server_id{channel_name="",connection_name="",master_host="db2016.codfw.wmnet",master_uuid=""}
+mysql_slave_status_master_ssl_allowed{channel_name="",connection_name="",master_host="db2016.codfw.wmnet",master_uuid=""}
+mysql_slave_status_master_ssl_verify_server_cert{channel_name="",connection_name="",master_host="db2016.codfw.wmnet",master_uuid=""}
+mysql_slave_status_max_relay_log_size{channel_name="",connection_name="",master_host="db2016.codfw.wmnet",master_uuid=""}
+mysql_slave_status_read_master_log_pos{channel_name="",connection_name="",master_host="db2016.codfw.wmnet",master_uuid=""}
+mysql_slave_status_relay_log_pos{channel_name="",connection_name="",master_host="db2016.codfw.wmnet",master_uuid=""}
+mysql_slave_status_relay_log_space{channel_name="",connection_name="",master_host="db2016.codfw.wmnet",master_uuid=""}
+mysql_slave_status_retried_transactions{channel_name="",connection_name="",master_host="db2016.codfw.wmnet",master_uuid=""}
+mysql_slave_status_seconds_behind_master{channel_name="",connection_name="",master_host="db2016.codfw.wmnet",master_uuid=""}
+mysql_slave_status_skip_counter{channel_name="",connection_name="",master_host="db2016.codfw.wmnet",master_uuid=""}
+mysql_slave_status_slave_heartbeat_period{channel_name="",connection_name="",master_host="db2016.codfw.wmnet",master_uuid=""}
+mysql_slave_status_slave_io_running{channel_name="",connection_name="",master_host="db2016.codfw.wmnet",master_uuid=""}
+mysql_slave_status_slave_received_heartbeats{channel_name="",connection_name="",master_host="db2016.codfw.wmnet",master_uuid=""}
+mysql_slave_status_slave_sql_running{channel_name="",connection_name="",master_host="db2016.codfw.wmnet",master_uuid=""}
+mysql_slave_status_until_log_pos{channel_name="",connection_name="",master_host="db2016.codfw.wmnet",master_uuid=""}

Mentioned in SAL (#wikimedia-operations) [2017-05-16T07:39:01Z] <godog> upload prometheus-mysqld-exporter 0.10.0 to jessie-wikimedia - T161296

So, the metrics, as far as I can see, work without problem without changing the configuration. However, hosts with multi-source now return replication metrics as an "array" (which is a good thing, compared to no metrics at all), but it means we have to change the dashboards to support that.

prometheus_multisource_replication_support.png (462×949 px, 36 KB)

Probably the best option is to upgrade most servers at the same time, rather than having 2 formats.

Probably good enough for now?

prometheus_multisource_replication_support_new.png (414×950 px, 34 KB)

@jcrespo yeah if it works in both cases that's good enough IMO

I have upgraded all jessie mysql servers to the latest version. We have now to look if we can enable the pt-heartbeat monitoring. Probably the best way to handle that is to test in a non-critical host manually and later create some test graphs?

@jcrespo yeah manually testing (i.e. stop puppet) on a test host sounds good, then we can progressively rollout via puppet by changing the mysqld_exporter options

This could be done massively right now. Missing hosts with 0.9.0 still (that are not set as spares, waiting for decommissioning):

  • db[2033-2034,2036-2037,2042,2044,2069-2078,2080-2082,2084-2093].codfw.wmnet
  • db[1051,1053,1055-1056,1059,1063,1065,1073,1096-1099,1101,1103,1105,1107-1108,1113-1115].eqiad.wmnet
  • es[1012-1013,1017].eqiad.wmnet
  • es[2011-2019].codfw.wmnet

However, at least some, if not most of them say: prometheus-mysqld-exporter is already the newest version (0.9.0+ds-3+b2)
I guess the package is not available on stretch?

This could be done massively right now. Missing hosts with 0.9.0 still (that are not set as spares, waiting for decommissioning):

  • db[2033-2034,2036-2037,2042,2044,2069-2078,2080-2082,2084-2093].codfw.wmnet
  • db[1051,1053,1055-1056,1059,1063,1065,1073,1096-1099,1101,1103,1105,1107-1108,1113-1115].eqiad.wmnet
  • es[1012-1013,1017].eqiad.wmnet
  • es[2011-2019].codfw.wmnet

However, at least some, if not most of them say: prometheus-mysqld-exporter is already the newest version (0.9.0+ds-3+b2)
I guess the package is not available on stretch?

Indeed, 0.10.0+git20180201.a71f4bb+ds-2 is in testing. The preferred way to have 0.10 in stretch would be through an official backport, I don't know how easy/hard it is to do that yet. The other way is to import that version internally from testing.

Note I was not asking it, the main improvement of 0.10.0 is multisource support, which we are moving away from. We can wait for buster.

fgiunchedi renamed this task from Upgrade mysqld_exporter to 0.10.0 to Upgrade mysqld_exporter in production.Jan 2 2019, 10:42 AM
fgiunchedi updated the task description. (Show Details)

After some minimal changes, it starts correctly.

e="2019-03-04T14:44:23Z" level=info msg="Starting mysqld_exporter (version=0.11.0+ds, branch=debian/sid, revision=0.11.0
e="2019-03-04T14:44:23Z" level=info msg="Build context (go=go1.7.4, user=pkg-go-maintainers@lists.alioth.debian.org, dat
e="2019-03-04T14:44:23Z" level=info msg="Enabled scrapers:" source="mysqld_exporter.go:218"
e="2019-03-04T14:44:23Z" level=info msg=" --collect.global_status" source="mysqld_exporter.go:222"
e="2019-03-04T14:44:23Z" level=info msg=" --collect.global_variables" source="mysqld_exporter.go:222"
e="2019-03-04T14:44:23Z" level=info msg=" --collect.slave_status" source="mysqld_exporter.go:222"
e="2019-03-04T14:44:23Z" level=info msg=" --collect.info_schema.processlist" source="mysqld_exporter.go:222"
e="2019-03-04T14:44:23Z" level=info msg="Listening on :9104" source="mysqld_exporter.go:232"
jcrespo added a project: DBA.
jcrespo moved this task from Triage to In progress on the DBA board.

Change 494236 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Change the default arguments for buster

https://gerrit.wikimedia.org/r/494236

Change 494469 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] install_server: Reimage db1114 to buster

https://gerrit.wikimedia.org/r/494469

Change 494469 merged by Jcrespo:
[operations/puppet@production] install_server: Reimage db1114 to buster

https://gerrit.wikimedia.org/r/494469

Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts:

['db1114.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201903051304_jynus_206262.log.

Completed auto-reimage of hosts:

['db1114.eqiad.wmnet']

Of which those FAILED:

['db1114.eqiad.wmnet']

Change 494236 merged by Jcrespo:
[operations/puppet@production] mysqld-prometheus-exporter: Change the default arguments for buster

https://gerrit.wikimedia.org/r/494236

Change 494759 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mysqld-prometheus-exporter: Fix typo on configuration

https://gerrit.wikimedia.org/r/494759

Change 494759 merged by Jcrespo:
[operations/puppet@production] mysqld-prometheus-exporter: Fix typo on configuration

https://gerrit.wikimedia.org/r/494759

jcrespo changed the task status from Open to Stalled.Mar 6 2019, 4:37 PM
jcrespo removed jcrespo as the assignee of this task.
jcrespo moved this task from In progress to Meta/Epic on the DBA board.

Fixed configuration for buster, but with no additional metrics (same metrics as before).

We can thing of enabling extra metrics later, but stalling this as the basic work is done (not a blocker anymore)

We can thing of enabling extra metrics later, but stalling this as the basic work is done (not a blocker anymore)

@jcrespo: Could a separate open followup task with low priority for "extra metrics" be created, and this task have "resolved" status?
Asking as I do not want tickets to be in "stalled" status ("If a report is waiting for further input (e.g. from its reporter or a third party) and can currently not be acted on") for years to be forgotten, plus this does not look like a case of "stalled" anyway... Thanks! :)