Page MenuHomePhabricator

toolsdb: review alerting
Open, MediumPublic

Description

We currently have two alerts in the metricsinfra prometheus server for toolsdb:

  • read-only status (there should be exactly 1 writable server)
  • simple replication status based on the SHOW SLAVE STATUS\G last error number

That could be improved with for example:

  • actual replag times
    • T334925 using pt-heartbeat
    • T301994 using the built in one in mariadb
  • disk space
  • prometheus exporter being down