Page MenuHomePhabricator

mysql_exporter_last_scrape_error flag on prometheus-mysqld-exporter increases after 10.4 upgrade
Closed, ResolvedPublic

Description

It was noticed that all the buster 10.4 hosts were having the following prometheus-mysqld-exporter metric with 1 per minute: mysql_exporter_last_scrape_error
As it was seeing on this dashboard: https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&fullscreen&panelId=4&from=now-5m&to=now&var-dc=codfw%20prometheus%2Fops&var-group=core&var-shard=All&var-role=All

A prometheus-mysqld-exporter restart seems to solve the issue and the metric starts reporting 0 errors ever since:

root@db1111:/var/log/prometheus# curl -s localhost:9104/metrics | grep scrape | grep -v "#"
mysql_exporter_last_scrape_error 0

This was confirmed by restarting all the hosts with buster and 10.4 but one. All of them recovered but the restarted one.
My current theory is that the problem is a race condition between when the exporter initially starts and when we run mysql_upgrade to update all the internal MySQL tables to the new 10.4 schemas.

The workflow is normally:

  • Host reimages to buster and installs 10.4
  • Puppet runs
  • MySQL exporter starts
  • MySQL gets started and the exporter starts connecting to it
  • mysql_upgrade runs (depending on the host it can take a while) and update all the tables.

The exporter might be reading old non working structures, variables, etc and might need a restart to pick up all the new changes.

Workaround for this issue

  • Restart the exporter with systemctl restart prometheus-mysqld-exporter.service if it is a multi-instance host: systemctl restart prometheus-mysqld-exporter@sX.service once mysql_upgrade has been run.

Event Timeline

Marostegui moved this task from Triage to In progress on the DBA board.

I still have to install some hosts with buster and 10.4 so I am going to confirm if that's the issue, by making sure to stop the exporter before I start mysql and upgrade the tables. And then allow the exporter to start

Mentioned in SAL (#wikimedia-operations) [2020-03-10T11:26:07Z] <marostegui> Restart mysqld exporter on db2125 to see if the collection errors decrease from 30 T247290

The last host pending to restart from the above dashboard:

root@db2125:~# curl -s localhost:9104/metrics | grep scrape | grep -v "#"
mysql_exporter_last_scrape_error 1
mysql_exporter_scrapes_total 3151
root@db2125:~# systemctl restart prometheus-mysqld-exporter.service ; sleep 60 ; curl -s localhost:9104/metrics | grep scrape | grep -v "#"
mysql_exporter_last_scrape_error 0
mysql_exporter_scrapes_total 3
root@db2125:~#

I have done the workflow in a different way, which is essentially, starting mysql and running mysql_upgrade BEFORE starting the exporter and:

root@db2121:/srv# curl -s localhost:9104/metrics | grep scrape | grep -v "#"
mysql_exporter_last_scrape_error 0
mysql_exporter_scrapes_total 4
root@db2121:/srv#

Going to give it a few hours to see how it goes.

Marostegui claimed this task.

Added to the list of known issues: https://wikitech.wikimedia.org/wiki/MariaDB#Stretch_+_10.1_-%3E_Buster_+_10.4_known_issues
I am going to consider this fixed with the above workaround:

  • Restart the exporter with systemctl restart prometheus-mysqld-exporter.service if it is a multi-instance host: systemctl restart prometheus-mysqld-exporter@sX.service once mysql_upgrade has been run.

The host is ok and https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&from=1583857057406&to=1583857357406&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=All&var-role=All&fullscreen&panelId=4 and https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&from=1583857057406&to=1583857357406&fullscreen&panelId=4they&var-dc=codfw%20prometheus%2Fops&var-group=core&var-shard=All&var-role=All they both look ok
db1114 can be ignored.