mysql_exporter_last_scrape_error flag on prometheus-mysqld-exporter increases after 10.4 upgrade
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Marostegui
	Mar 10 2020, 11:24 AM

Description

It was noticed that all the buster 10.4 hosts were having the following prometheus-mysqld-exporter metric with 1 per minute: mysql_exporter_last_scrape_error
As it was seeing on this dashboard: https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&fullscreen&panelId=4&from=now-5m&to=now&var-dc=codfw%20prometheus%2Fops&var-group=core&var-shard=All&var-role=All

A prometheus-mysqld-exporter restart seems to solve the issue and the metric starts reporting 0 errors ever since:

root@db1111:/var/log/prometheus# curl -s localhost:9104/metrics | grep scrape | grep -v "#"
mysql_exporter_last_scrape_error 0

This was confirmed by restarting all the hosts with buster and 10.4 but one. All of them recovered but the restarted one.
My current theory is that the problem is a race condition between when the exporter initially starts and when we run mysql_upgrade to update all the internal MySQL tables to the new 10.4 schemas.

The workflow is normally:

Host reimages to buster and installs 10.4
Puppet runs
MySQL exporter starts
MySQL gets started and the exporter starts connecting to it
mysql_upgrade runs (depending on the host it can take a while) and update all the tables.

The exporter might be reading old non working structures, variables, etc and might need a restart to pick up all the new changes.

Workaround for this issue

Restart the exporter with systemctl restart prometheus-mysqld-exporter.service if it is a multi-instance host: systemctl restart prometheus-mysqld-exporter@sX.service once mysql_upgrade has been run.

Related Objects
Search...

Status	Assigned	Task
Stalled	None	T302086 Set scap minimum python version to 3.7
Resolved	None	T247045 Migrate all of production metal and VMs to Buster or later
Resolved	• Marostegui	T250666 Upgrade WMF database-and-backup-related hosts to buster
Resolved	• Marostegui	T242702 Test MariaDB 10.4 in production
Resolved	• Marostegui	T247290 mysql_exporter_last_scrape_error flag on prometheus-mysqld-exporter increases after 10.4 upgrade

Event Timeline

• Marostegui created this task.Mar 10 2020, 11:24 AM

jcrespo subscribed.Mar 10 2020, 11:25 AM

I still have to install some hosts with buster and 10.4 so I am going to confirm if that's the issue, by making sure to stop the exporter before I start mysql and upgrade the tables. And then allow the exporter to start

Mentioned in SAL (#wikimedia-operations) [2020-03-10T11:26:07Z] <marostegui> Restart mysqld exporter on db2125 to see if the collection errors decrease from 30 T247290

The last host pending to restart from the above dashboard:

root@db2125:~# curl -s localhost:9104/metrics | grep scrape | grep -v "#"
mysql_exporter_last_scrape_error 1
mysql_exporter_scrapes_total 3151
root@db2125:~# systemctl restart prometheus-mysqld-exporter.service ; sleep 60 ; curl -s localhost:9104/metrics | grep scrape | grep -v "#"
mysql_exporter_last_scrape_error 0
mysql_exporter_scrapes_total 3
root@db2125:~#

I have done the workflow in a different way, which is essentially, starting mysql and running mysql_upgrade BEFORE starting the exporter and:

root@db2121:/srv# curl -s localhost:9104/metrics | grep scrape | grep -v "#"
mysql_exporter_last_scrape_error 0
mysql_exporter_scrapes_total 4
root@db2121:/srv#

Going to give it a few hours to see how it goes.

Added to the list of known issues: https://wikitech.wikimedia.org/wiki/MariaDB#Stretch_+_10.1_-%3E_Buster_+_10.4_known_issues
I am going to consider this fixed with the above workaround:

Restart the exporter with systemctl restart prometheus-mysqld-exporter.service if it is a multi-instance host: systemctl restart prometheus-mysqld-exporter@sX.service once mysql_upgrade has been run.

The host is ok and https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&from=1583857057406&to=1583857357406&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=All&var-role=All&fullscreen&panelId=4 and https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&from=1583857057406&to=1583857357406&fullscreen&panelId=4they&var-dc=codfw%20prometheus%2Fops&var-group=core&var-shard=All&var-role=All they both look ok
db1114 can be ignored.

• Marostegui updated the task description. (Show Details)Mar 10 2020, 4:27 PM

mysql_exporter_last_scrape_error flag on prometheus-mysqld-exporter increases after 10.4 upgradeClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

mysql_exporter_last_scrape_error flag on prometheus-mysqld-exporter increases after 10.4 upgrade
Closed, ResolvedPublic
Actions

Related Objects
Search...