Page MenuHomePhabricator

Investigate with Prometheus doesn't report on some graphs on MariaDB 10.3
Closed, ResolvedPublic

Description

db1114 is a host running buster+mariadb 10.3.
Its MySQL is at: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1114&var-port=9104&from=now-24h&to=now

As it can be seen, there are some graphs that don't contain any data:

  • Monitoring queries latency
  • InnoDB Purge Lag
  • Change Buffer
  • Adaptive Hash and other memory usage
  • InnoDB Semaphores

A possible explanation is that some metrics/options might have changed within information_schema and performance_schema on newer versions than 10.1 (the one we currently use in production) as we saw on T231182: Tendril activity column: trx_adaptive_hash_latched column removed on 10.2 (and onwards) from information_schema.innodb_trx causes the `*_activity` event to fail

One of the blockers (apart from lots of more testing) is the fact that don't have full visibility at the moment on 10.3.

Update: the MySQL specific variables have been removed on 10.3 on upstream: https://jira.mariadb.org/browse/MDEV-18582

Event Timeline

Interesting! Since on buster there's an implicit upgrade of mysqld-exporter to 0.11, some of the innodb-related performance schema options need to be enabled? e.g. --collect.info_schema.innodb_metrics although just a guess as I don't know where all metrics needed live

re: "monitoring queries latency" the expression needs to be changed like this (i.e. to handle multiple handlers)

http_request_duration_microseconds{instance="$server:$port", handler=~"(prometheus|metrics)",quantile="0.5"}

The innodb variables, as kinda expected, are failing due to the fact that they've been removed upstream: https://jira.mariadb.org/browse/MDEV-18582
I haven't checked 100% of the ones we used, but I have randomly checked some of the ones we have failing on some of the graphs, and they are all gone.

Those graphs aren't super critical, so we can probably live without them on 10.3 anyways.

re: "monitoring queries latency" the expression needs to be changed like this (i.e. to handle multiple handlers)

http_request_duration_microseconds{instance="$server:$port", handler=~"(prometheus|metrics)",quantile="0.5"}

Thanks! Would that change break the existing 10.1 hosts?
If that happens, how can we solve the issue that we might have different things depending on the version, as the migration from 10.1 to 10.3 will be done slowly and the will co-exist for months.
Can you help us get this working? :)

The innodb variables, as kinda expected, are failing due to the fact that they've been removed upstream: https://jira.mariadb.org/browse/MDEV-18582
I haven't checked 100% of the ones we used, but I have randomly checked some of the ones we have failing on some of the graphs, and they are all gone.

Those graphs aren't super critical, so we can probably live without them on 10.3 anyways.

re: "monitoring queries latency" the expression needs to be changed like this (i.e. to handle multiple handlers)

http_request_duration_microseconds{instance="$server:$port", handler=~"(prometheus|metrics)",quantile="0.5"}

Thanks! Would that change break the existing 10.1 hosts?
If that happens, how can we solve the issue that we might have different things depending on the version, as the migration from 10.1 to 10.3 will be done slowly and the will co-exist for months.
Can you help us get this working? :)

For sure! The metric above will work for both versions of mysqld-exporter (not mysqld, since the metric belongs to the exporter itself) because one has handler=prometheus and the buster version has handler=metrics.

In general when metrics are renamed you can either add a brand new expression to the panel with the new query, or use old_query or new_query2 in the same expression to evaluate one or the other. Once the migration is complete then old_query can be removed when needed.

Hope that helps! Let us know when you have more questions

Mentioned in SAL (#wikimedia-operations) [2019-08-26T13:21:37Z] <marostegui> Change MySQL.monitoring queries latency graph parameters to support buster+mariadb 10.3 - T231190

Marostegui claimed this task.

Thanks @fgiunchedi for the explanation and guidance to get it changed.
I have replaced it on the dashboard and confirmed that 10.1 hosts keep working as they were and the new 10.3 also starts graphing

10.1 https://grafana.wikimedia.org/d/000000273/mysql?panelId=40&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1089&var-port=9104
10.3 https://grafana.wikimedia.org/d/000000273/mysql?panelId=40&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1114&var-port=9104

I am going to consider this task resolved. The most important graph (the query latency) now works, and the InnoDB ones are not that critical (and there is nothing we can do anyways, if they were removed upstream, and they will be re-introduced, apparently on 10.5: T231190#5437489