- Mark MariaDB10 masters explicitly.
- ensure => pt-heartbeat is running; ensure pt-heartbeat is stopped otherwise
- icinga check pt-heartbeat is running
- icinga replication lag => using heartbeat.heartbeat
- Replicate these tables to Labs
- Display that lag on tendril and dbtree
- Send lag to graphite
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | • jcrespo | T114752 Puppetize pt-heartbeat on MariaDB10 masters and its corresponding checks on the several monitoring backends | |||
Open | None | T141968 Display lag on grafana (prometheus) from pt-heartbeat instead (or in addition) of Seconds_Behind_Master |
Event Timeline
Change 244651 had a related patch set uploaded (by Jcrespo):
Add pt-heartbeat start & execution script to mariadb
Change 244651 merged by Jcrespo:
Add pt-heartbeat start & execution script to mariadb
Change 253665 had a related patch set uploaded (by Jcrespo):
[WIP] Use heartbeat when possible to check slave lag
This is semi-working now. We have to decide what to show on icinga for multi-tier slaves and fix some issues on multi-source replication slaves like dbstore.
Change 271495 had a related patch set uploaded (by Jcrespo):
Allow heartbeat table to replicate to dbstore* hosts
Change 271495 merged by Jcrespo:
Allow heartbeat table to replicate to dbstore* hosts
Change 271792 had a related patch set uploaded (by Jcrespo):
Add nagios@localhost the permissions to read the heartbeat table
Change 271792 merged by Jcrespo:
Add nagios@localhost the permissions to read the heartbeat table
Change 271815 had a related patch set uploaded (by Jcrespo):
Small formatting fixes for replication lag check
@aaron, I have patched pt-heartbeat to create and update automatically a "shard" column:
mysql> SELECT * FROM heartbeat; +----------------------------+-----------+-------------------+-----------+-----------------------+---------------------+-------+ | ts | server_id | file | position | relay_master_log_file | exec_master_log_pos | shard | +----------------------------+-----------+-------------------+-----------+-----------------------+---------------------+-------+ | 2016-03-01T16:05:04.109250 | 180359186 | db2030-bin.000004 | 374798448 | db1009-bin.000185 | 619775551 | s1 | +----------------------------+-----------+-------------------+-----------+-----------------------+---------------------+-------+ 1 row in set (0.03 sec)
This should simplify the latency checking and hopefully not require memcache keys; only "check the largest ts value within my shard", something like:
SELECT ts FROM heartbeat WHERE shard='s1' ORDER BY ts DESC LIMIT 1
(substituting ts for the appropriate substraction). I need to properly check that this works (I will puppetize it on all servers first), and maybe add a critical error if heartbeat.heartbead does not contain the appropriate shard value.
Change 274134 had a related patch set uploaded (by Jcrespo):
[WIP] Adding custom heartbeat script with "shard" additional column
Change 274134 merged by Jcrespo:
Add custom heartbeat script with "shard" additional column
Change 274377 had a related patch set uploaded (by Jcrespo):
[WIP]Test the new heartbeat functionality on m5-master
Change 274415 had a related patch set uploaded (by Jcrespo):
Fixes for pt-heartbeat daemon init script (fails automatic runs)
Change 274415 merged by Jcrespo:
Fixes for pt-heartbeat daemon init script (fails automatic runs)
Change 274423 had a related patch set uploaded (by Jcrespo):
Previous heartbeat fix was not enough (introduced extra errors)
Change 274423 merged by Jcrespo:
Previous heartbeat fix was not enough (introduced extra errors)
Change 274640 had a related patch set uploaded (by Jcrespo):
Enable pt-heartbeat on all misc master (except m1)
Change 274670 had a related patch set uploaded (by Jcrespo):
Enable the new pt-heartbeat on core production hosts
Change 274670 merged by Jcrespo:
Enable the new pt-heartbeat on core production hosts
pt-heartbeat is puppetized and in production on all main core, misc and labs servers.
There are some minor pending tasks:
- toolsdb slave (not active)
- dbstore[12]001 (delayed slave, it will take 24 hour to take effect)
- db1047 and db2002 (need a restart/change of replication filter)
- db1048 and db2012 needs to white list heartbeat, too
The others too, now. It was a combination of replication filters not having been updated (pending restart) and lacking permisions for non-core servers (nagios grants).
This is progressing by adding a datacenter field.
The pending scope of this ticket may be changed; as T126757 is progressing, probably tendril/graphite work will evolve into prometheus.