Change Details

It was noticed that all the buster 10.4 hosts were having the following prometheus-mysqld-exporter metric with `1` per minute: mysql_exporter_last_scrape_error As it was seeing on this dashboard: https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&fullscreen&panelId=4&from=now-5m&to=now&var-dc=codfw%20prometheus%2Fops&var-group=core&var-shard=All&var-role=All A prometheus-mysqld-exporter restart seems to solve the issue and the metric starts reporting 0 errors ever since: ``` root@db1111:/var/log/prometheus# curl -s localhost:9104/metrics | grep scrape | grep -v "#" mysql_exporter_last_scrape_error 0 ``` This was confirmed by restarting all the hosts with buster and 10.4 but one. All of them recovered but the restarted one. My current theory is that the problem is a race condition between when the exporter initially starts and when we run mysql_upgrade to update all the internal MySQL tables to the new 10.4 schemas. The workflow is normally: * Host reimages to buster and installs 10.4 * Puppet runs * MySQL exporter starts * MySQL gets started and the exporter starts connecting to it * mysql_upgrade runs (depending on the host it can take a while) and update all the tables. The exporter might be reading old non working structures, variables, etc and might need a restart to pick up all the new changes. **Workaround for this issue** - Restart the exporter with `systemctl restart prometheus-mysqld-exporter.service` if it is a multi-instance host: `systemctl restart prometheus-mysqld-exporter@sX.service` once `mysql_upgrade` has been run.