It was noticed that all the buster 10.4 hosts were having the following prometheus-mysqld-exporter metric with 1 per minute: mysql_exporter_last_scrape_error
As it was seeing on this dashboard: https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&fullscreen&panelId=4&from=now-5m&to=now&var-dc=codfw%20prometheus%2Fops&var-group=core&var-shard=All&var-role=All
A prometheus-mysqld-exporter restart seems to solve the issue and the metric starts reporting 0 errors ever since:
root@db1111:/var/log/prometheus# curl -s localhost:9104/metrics | grep scrape | grep -v "#" mysql_exporter_last_scrape_error 0
This was confirmed by restarting all the hosts with buster and 10.4 but one. All of them recovered but the restarted one.
My current theory is that the problem is a race condition between when the exporter initially starts and when we run mysql_upgrade to update all the internal MySQL tables to the new 10.4 schemas.
The workflow is normally:
- Host reimages to buster and installs 10.4
- Puppet runs
- MySQL exporter starts
- MySQL gets started and the exporter starts connecting to it
- mysql_upgrade runs (depending on the host it can take a while) and update all the tables.
The exporter might be reading old non working structures, variables, etc and might need a restart to pick up all the new changes.
Workaround for this issue
- Restart the exporter with systemctl restart prometheus-mysqld-exporter.service if it is a multi-instance host: systemctl restart prometheus-mysqld-exporter@sX.service once mysql_upgrade has been run.