Puppetize pt-heartbeat on MariaDB10 masters and its corresponding checks on the several monitoring backends
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• jcrespo
	Oct 6 2015, 12:04 PM

Description

Mark MariaDB10 masters explicitly.
ensure => pt-heartbeat is running; ensure pt-heartbeat is stopped otherwise
icinga check pt-heartbeat is running
icinga replication lag => using heartbeat.heartbeat
Replicate these tables to Labs
Display that lag on tendril and dbtree
Send lag to graphite

Details

Subject	Repo	Branch	Lines +/-
Enable the new pt-heartbeat on core production hosts	operations/puppet	production	+40 -3
Enable pt-heartbeat on all misc masters (except m1)	operations/puppet	production	+17 -4
Previous heartbeat fix was not enough (introduced extra errors)	operations/puppet/mariadb	master	+5 -4
Fixes for pt-heartbeat daemon init script (fails automatic runs)	operations/puppet/mariadb	master	+5 -4
Test the new heartbeat functionality on m5-master	operations/puppet	production	+6 -0
Add custom heartbeat script with "shard" additional column	operations/puppet/mariadb	master	+6 K -15
Small formatting fixes for replication lag check	operations/puppet	production	+4 -4
Add nagios@localhost the permissions to read the heartbeat table	operations/puppet	production	+3 -1
Allow heartbeat table to replicate to dbstore* hosts	operations/puppet	production	+10 -0
Use heartbeat when possible to check slave lag	operations/puppet	production	+51 -21
Add pt-heartbeat start & execution script to mariadb	operations/puppet/mariadb	master	+66 -0

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		• jcrespo	T114752 Puppetize pt-heartbeat on MariaDB10 masters and its corresponding checks on the several monitoring backends
		Open		None	T141968 Display lag on grafana (prometheus) from pt-heartbeat instead (or in addition) of Seconds_Behind_Master

Event Timeline

• jcrespo created this task.Oct 6 2015, 12:04 PM

• jcrespo raised the priority of this task from to Needs Triage.

• jcrespo updated the task description. (Show Details)

• jcrespo added projects: DBA, acl*sre-team.

• jcrespo subscribed.

Restricted Application added subscribers: Matanya, Aklapper. · View Herald TranscriptOct 6 2015, 12:04 PM

CCing Aaron so that he know about its progress.

• jcrespo updated the task description. (Show Details)Oct 6 2015, 12:06 PM

• jcrespo set Security to None.

• jcrespo updated the task description. (Show Details)Oct 6 2015, 12:09 PM

• jcrespo mentioned this in T95501: Fix causes of replica lag and get it to under 5 seconds at peak.Oct 9 2015, 7:31 AM

Change 244651 had a related patch set uploaded (by Jcrespo):
Add pt-heartbeat start & execution script to mariadb

https://gerrit.wikimedia.org/r/244651

gerritbot added a project: Patch-For-Review.Oct 9 2015, 9:50 AM

• jcrespo claimed this task.Oct 19 2015, 7:10 PM

• jcrespo triaged this task as Medium priority.

• jcrespo moved this task from Triage to In progress on the DBA board.

Change 244651 merged by Jcrespo:
Add pt-heartbeat start & execution script to mariadb

https://gerrit.wikimedia.org/r/244651

• jcrespo updated the task description. (Show Details)Oct 20 2015, 3:53 PM

• jcrespo added a parent task: T111266: Make LoadBalancer slave lag check and read-only mode more robust (for example, using pt-heartbeat).Oct 24 2015, 9:48 PM

Change 253665 had a related patch set uploaded (by Jcrespo):
[WIP] Use heartbeat when possible to check slave lag

https://gerrit.wikimedia.org/r/253665

• jcrespo moved this task from In progress to Backlog on the DBA board.Jan 29 2016, 10:20 AM

ArielGlenn subscribed.Feb 14 2016, 2:49 PM

Volans subscribed.Feb 14 2016, 2:56 PM

Change 253665 merged by Jcrespo:
Use heartbeat when possible to check slave lag

https://gerrit.wikimedia.org/r/253665

• jcrespo updated the task description. (Show Details)Feb 18 2016, 12:02 PM

This is semi-working now. We have to decide what to show on icinga for multi-tier slaves and fix some issues on multi-source replication slaves like dbstore.

Change 271495 had a related patch set uploaded (by Jcrespo):
Allow heartbeat table to replicate to dbstore* hosts

https://gerrit.wikimedia.org/r/271495

Change 271495 merged by Jcrespo:
Allow heartbeat table to replicate to dbstore* hosts

https://gerrit.wikimedia.org/r/271495

• jcrespo moved this task from Backlog to In progress on the DBA board.Feb 19 2016, 3:07 PM

Change 271792 had a related patch set uploaded (by Jcrespo):
Add nagios@localhost the permissions to read the heartbeat table

https://gerrit.wikimedia.org/r/271792

Change 271792 merged by Jcrespo:
Add nagios@localhost the permissions to read the heartbeat table

https://gerrit.wikimedia.org/r/271792

Change 271815 had a related patch set uploaded (by Jcrespo):
Small formatting fixes for replication lag check

https://gerrit.wikimedia.org/r/271815

Change 271815 merged by Jcrespo:
Small formatting fixes for replication lag check

https://gerrit.wikimedia.org/r/271815

@aaron, I have patched pt-heartbeat to create and update automatically a "shard" column:

mysql> SELECT * FROM heartbeat;
+----------------------------+-----------+-------------------+-----------+-----------------------+---------------------+-------+
| ts                         | server_id | file              | position  | relay_master_log_file | exec_master_log_pos | shard |
+----------------------------+-----------+-------------------+-----------+-----------------------+---------------------+-------+
| 2016-03-01T16:05:04.109250 | 180359186 | db2030-bin.000004 | 374798448 | db1009-bin.000185     |           619775551 | s1    |
+----------------------------+-----------+-------------------+-----------+-----------------------+---------------------+-------+
1 row in set (0.03 sec)

This should simplify the latency checking and hopefully not require memcache keys; only "check the largest ts value within my shard", something like:

SELECT ts FROM heartbeat WHERE shard='s1' ORDER BY ts DESC LIMIT 1

(substituting ts for the appropriate substraction). I need to properly check that this works (I will puppetize it on all servers first), and maybe add a critical error if heartbeat.heartbead does not contain the appropriate shard value.

Change 274134 had a related patch set uploaded (by Jcrespo):
[WIP] Adding custom heartbeat script with "shard" additional column

https://gerrit.wikimedia.org/r/274134

Change 274134 merged by Jcrespo:
Add custom heartbeat script with "shard" additional column

https://gerrit.wikimedia.org/r/274134

Change 274377 had a related patch set uploaded (by Jcrespo):
[WIP]Test the new heartbeat functionality on m5-master

https://gerrit.wikimedia.org/r/274377

Change 274377 merged by Jcrespo:
Test the new heartbeat functionality on m5-master

https://gerrit.wikimedia.org/r/274377

Change 274415 had a related patch set uploaded (by Jcrespo):
Fixes for pt-heartbeat daemon init script (fails automatic runs)

https://gerrit.wikimedia.org/r/274415

Change 274415 merged by Jcrespo:
Fixes for pt-heartbeat daemon init script (fails automatic runs)

https://gerrit.wikimedia.org/r/274415

Change 274423 had a related patch set uploaded (by Jcrespo):
Previous heartbeat fix was not enough (introduced extra errors)

https://gerrit.wikimedia.org/r/274423

Change 274423 merged by Jcrespo:
Previous heartbeat fix was not enough (introduced extra errors)

https://gerrit.wikimedia.org/r/274423

Change 274640 had a related patch set uploaded (by Jcrespo):
Enable pt-heartbeat on all misc master (except m1)

https://gerrit.wikimedia.org/r/274640

Change 274640 merged by Jcrespo:
Enable pt-heartbeat on all misc masters (except m1)

https://gerrit.wikimedia.org/r/274640

Change 274670 had a related patch set uploaded (by Jcrespo):
Enable the new pt-heartbeat on core production hosts

https://gerrit.wikimedia.org/r/274670

Change 274670 merged by Jcrespo:
Enable the new pt-heartbeat on core production hosts

https://gerrit.wikimedia.org/r/274670

pt-heartbeat is puppetized and in production on all main core, misc and labs servers.

There are some minor pending tasks:

toolsdb slave (not active)
dbstore[12]001 (delayed slave, it will take 24 hour to take effect)
db1047 and db2002 (need a restart/change of replication filter)
db1048 and db2012 needs to white list heartbeat, too

• jcrespo updated the task description. (Show Details)Mar 4 2016, 2:22 PM

dbstore2002 and db1047 fixed

The others too, now. It was a combination of replication filters not having been updated (pending restart) and lacking permisions for non-core servers (nagios grants).

• jcrespo removed a parent task: T111266: Make LoadBalancer slave lag check and read-only mode more robust (for example, using pt-heartbeat).Mar 4 2016, 5:46 PM

• jcrespo mentioned this in T111266: Make LoadBalancer slave lag check and read-only mode more robust (for example, using pt-heartbeat).

• jcrespo moved this task from In progress to Backlog on the DBA board.Mar 9 2016, 11:10 AM

Adding graphite, too as a TODO.

check dbstore2001, it seems to have issues with pt-heartbeat.

Krinkle subscribed.Apr 19 2016, 4:31 AM

• jcrespo renamed this task from Puppetize pt-heartbeat on MariaDB10 masters and its corresponding checks on icinga to Puppetize pt-heartbeat on MariaDB10 masters and its corresponding checks on the several monitoring backends.Apr 22 2016, 5:41 PM

• jcrespo merged a task: T50694: Show replication lags in Graphite.

• jcrespo added subscribers: Ricordisamoa, Krenair, yuvipanda and 4 others.

This is progressing by adding a datacenter field.

The pending scope of this ticket may be changed; as T126757 is progressing, probably tendril/graphite work will evolve into prometheus.

• jcrespo mentioned this in rOPUP272a5bfcbc15: Add datacenter to lag checks.Aug 3 2016, 7:05 AM

• jcrespo mentioned this in rOPUPf33671ba2b74: Add datacenter to lag checks.Aug 3 2016, 7:11 AM

• jcrespo mentioned this in rOPUP1ec2334b155c: Add datacenter to lag checks.

• jcrespo created subtask T141968: Display lag on grafana (prometheus) from pt-heartbeat instead (or in addition) of Seconds_Behind_Master.Aug 3 2016, 9:05 AM

Resolving this, T141968 will handle tendril/dbtree and T126757, other monitoring separately.

• Phabricator_maintenance removed a subscriber: yuvipanda.Jun 7 2017, 6:47 PM

Puppetize pt-heartbeat on MariaDB10 masters and its corresponding checks on the several monitoring backendsClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Puppetize pt-heartbeat on MariaDB10 masters and its corresponding checks on the several monitoring backends
Closed, ResolvedPublic
Actions

Related Objects
Search...