Page MenuHomePhabricator

Puppetize pt-heartbeat on MariaDB10 masters and its corresponding checks on the several monitoring backends
Closed, ResolvedPublic

Description

  • Mark MariaDB10 masters explicitly.
  • ensure => pt-heartbeat is running; ensure pt-heartbeat is stopped otherwise
  • icinga check pt-heartbeat is running
  • icinga replication lag => using heartbeat.heartbeat
  • Replicate these tables to Labs
  • Display that lag on tendril and dbtree
  • Send lag to graphite

Event Timeline

jcrespo created this task.Oct 6 2015, 12:04 PM
jcrespo raised the priority of this task from to Needs Triage.
jcrespo updated the task description. (Show Details)
jcrespo added projects: DBA, acl*sre-team.
jcrespo added a subscriber: jcrespo.
Restricted Application added subscribers: Matanya, Aklapper. · View Herald TranscriptOct 6 2015, 12:04 PM
jcrespo added a subscriber: aaron.Oct 6 2015, 12:05 PM

CCing Aaron so that he know about its progress.

jcrespo updated the task description. (Show Details)Oct 6 2015, 12:06 PM
jcrespo set Security to None.
jcrespo updated the task description. (Show Details)Oct 6 2015, 12:09 PM

Change 244651 had a related patch set uploaded (by Jcrespo):
Add pt-heartbeat start & execution script to mariadb

https://gerrit.wikimedia.org/r/244651

jcrespo claimed this task.Oct 19 2015, 7:10 PM
jcrespo triaged this task as Normal priority.
jcrespo moved this task from Triage to In progress on the DBA board.

Change 244651 merged by Jcrespo:
Add pt-heartbeat start & execution script to mariadb

https://gerrit.wikimedia.org/r/244651

jcrespo updated the task description. (Show Details)Oct 20 2015, 3:53 PM

Change 253665 had a related patch set uploaded (by Jcrespo):
[WIP] Use heartbeat when possible to check slave lag

https://gerrit.wikimedia.org/r/253665

jcrespo moved this task from In progress to Backlog on the DBA board.Jan 29 2016, 10:20 AM
Volans added a subscriber: Volans.Feb 14 2016, 2:56 PM

Change 253665 merged by Jcrespo:
Use heartbeat when possible to check slave lag

https://gerrit.wikimedia.org/r/253665

jcrespo updated the task description. (Show Details)Feb 18 2016, 12:02 PM

This is semi-working now. We have to decide what to show on icinga for multi-tier slaves and fix some issues on multi-source replication slaves like dbstore.

Change 271495 had a related patch set uploaded (by Jcrespo):
Allow heartbeat table to replicate to dbstore* hosts

https://gerrit.wikimedia.org/r/271495

Change 271495 merged by Jcrespo:
Allow heartbeat table to replicate to dbstore* hosts

https://gerrit.wikimedia.org/r/271495

jcrespo moved this task from Backlog to In progress on the DBA board.Feb 19 2016, 3:07 PM

Change 271792 had a related patch set uploaded (by Jcrespo):
Add nagios@localhost the permissions to read the heartbeat table

https://gerrit.wikimedia.org/r/271792

Change 271792 merged by Jcrespo:
Add nagios@localhost the permissions to read the heartbeat table

https://gerrit.wikimedia.org/r/271792

Change 271815 had a related patch set uploaded (by Jcrespo):
Small formatting fixes for replication lag check

https://gerrit.wikimedia.org/r/271815

Change 271815 merged by Jcrespo:
Small formatting fixes for replication lag check

https://gerrit.wikimedia.org/r/271815

jcrespo added a comment.EditedMar 1 2016, 4:15 PM

@aaron, I have patched pt-heartbeat to create and update automatically a "shard" column:

mysql> SELECT * FROM heartbeat;
+----------------------------+-----------+-------------------+-----------+-----------------------+---------------------+-------+
| ts                         | server_id | file              | position  | relay_master_log_file | exec_master_log_pos | shard |
+----------------------------+-----------+-------------------+-----------+-----------------------+---------------------+-------+
| 2016-03-01T16:05:04.109250 | 180359186 | db2030-bin.000004 | 374798448 | db1009-bin.000185     |           619775551 | s1    |
+----------------------------+-----------+-------------------+-----------+-----------------------+---------------------+-------+
1 row in set (0.03 sec)

This should simplify the latency checking and hopefully not require memcache keys; only "check the largest ts value within my shard", something like:

SELECT ts FROM heartbeat WHERE shard='s1' ORDER BY ts DESC LIMIT 1

(substituting ts for the appropriate substraction). I need to properly check that this works (I will puppetize it on all servers first), and maybe add a critical error if heartbeat.heartbead does not contain the appropriate shard value.

Change 274134 had a related patch set uploaded (by Jcrespo):
[WIP] Adding custom heartbeat script with "shard" additional column

https://gerrit.wikimedia.org/r/274134

Change 274134 merged by Jcrespo:
Add custom heartbeat script with "shard" additional column

https://gerrit.wikimedia.org/r/274134

Change 274377 had a related patch set uploaded (by Jcrespo):
[WIP]Test the new heartbeat functionality on m5-master

https://gerrit.wikimedia.org/r/274377

Change 274377 merged by Jcrespo:
Test the new heartbeat functionality on m5-master

https://gerrit.wikimedia.org/r/274377

Change 274415 had a related patch set uploaded (by Jcrespo):
Fixes for pt-heartbeat daemon init script (fails automatic runs)

https://gerrit.wikimedia.org/r/274415

Change 274415 merged by Jcrespo:
Fixes for pt-heartbeat daemon init script (fails automatic runs)

https://gerrit.wikimedia.org/r/274415

Change 274423 had a related patch set uploaded (by Jcrespo):
Previous heartbeat fix was not enough (introduced extra errors)

https://gerrit.wikimedia.org/r/274423

Change 274423 merged by Jcrespo:
Previous heartbeat fix was not enough (introduced extra errors)

https://gerrit.wikimedia.org/r/274423

Change 274640 had a related patch set uploaded (by Jcrespo):
Enable pt-heartbeat on all misc master (except m1)

https://gerrit.wikimedia.org/r/274640

Change 274640 merged by Jcrespo:
Enable pt-heartbeat on all misc masters (except m1)

https://gerrit.wikimedia.org/r/274640

Change 274670 had a related patch set uploaded (by Jcrespo):
Enable the new pt-heartbeat on core production hosts

https://gerrit.wikimedia.org/r/274670

Change 274670 merged by Jcrespo:
Enable the new pt-heartbeat on core production hosts

https://gerrit.wikimedia.org/r/274670

pt-heartbeat is puppetized and in production on all main core, misc and labs servers.

There are some minor pending tasks:

  • toolsdb slave (not active)
  • dbstore[12]001 (delayed slave, it will take 24 hour to take effect)
  • db1047 and db2002 (need a restart/change of replication filter)
  • db1048 and db2012 needs to white list heartbeat, too
jcrespo updated the task description. (Show Details)Mar 4 2016, 2:22 PM

dbstore2002 and db1047 fixed

The others too, now. It was a combination of replication filters not having been updated (pending restart) and lacking permisions for non-core servers (nagios grants).

jcrespo moved this task from In progress to Backlog on the DBA board.Mar 9 2016, 11:10 AM

check dbstore2001, it seems to have issues with pt-heartbeat.

jcrespo renamed this task from Puppetize pt-heartbeat on MariaDB10 masters and its corresponding checks on icinga to Puppetize pt-heartbeat on MariaDB10 masters and its corresponding checks on the several monitoring backends.Apr 22 2016, 5:41 PM
jcrespo moved this task from Backlog to In progress on the DBA board.Aug 2 2016, 5:30 PM

This is progressing by adding a datacenter field.

The pending scope of this ticket may be changed; as T126757 is progressing, probably tendril/graphite work will evolve into prometheus.

jcrespo closed this task as Resolved.Aug 3 2016, 9:07 AM

Resolving this, T141968 will handle tendril/dbtree and T126757, other monitoring separately.