ToolsDB: setup pt-heartbeat replication monitor
Open, In Progress, LowPublic
Actions

Assigned To

None

Authored By

	fnegri
	Apr 18 2023, 10:30 AM

Description

This seems easy to set up and might have some advantages over the basic replication lag alert I created in T301994.

https://wikitech.wikimedia.org/wiki/MariaDB/pt-heartbeat

Details

	Subject	Repo	Branch	Lines +/-
	toolforge: Use shard name 'toolsdb' in profile::wmcs::services::toolsdb_*	operations/puppet	production	+2 -0

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T306453 toolsdb: review alerting
		In Progress		None	T334925 ToolsDB: setup pt-heartbeat replication monitor

Event Timeline

fnegri created this task.Apr 18 2023, 10:30 AM

Restricted Application added a project: User-bd808. · View Herald TranscriptApr 18 2023, 10:30 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 909397 had a related patch set uploaded (by FNegri; author: Bryan Davis):

[operations/puppet@production] toolforge: Use shard name 'toolsdb' in profile::wmcs::services::toolsdb_*

https://gerrit.wikimedia.org/r/909397

gerritbot added a project: Patch-For-Review.Apr 18 2023, 10:31 AM

fnegri changed the task status from Open to In Progress.Apr 18 2023, 5:55 PM

fnegri moved this task from Backlog to In progress on the cloud-services-team (FY2022/2023-Q4) board.

@jcrespo do you think there is any benefit in using pt-heartbeat for ToolsDB?

It is very easy to setup, and works better than the usual method, so there is a benefit- but given the small size of tools db (wikireplicas already get that for free from production) wouldn't be a huge priority (you don't need second-accurate metrics of lag there, I belive). I'd say eventually should be deployed, for consistency, although there is still lack of support on alert manager for it. So up to you when to do it. If I were in your position I would say "TODO when there is time" 0:-).

Thanks @jcrespo, I agree it's nice to have for consistency. The pt-heartbeat service is actually already running (as @bd808 noticed) and updating the heartbeat table, but I'm not sure what would be the next step. Creating an alert linked to the heartbeat? Creating a web page like https://replag.toolforge.org/? Or maybe we are happy with just querying the heartbeat table manually when we want to check the lag?

The goal for the mediawiki cluster is: T315866: Migrate mysql icinga alerts to alert manager (which you should be able to reuse). The main blocker is: T141968: Display lag on grafana (prometheus) from pt-heartbeat instead (or in addition) of Seconds_Behind_Master. Here you can do 2 things: do the work yourself, in coordination with the DBAs (so it works for both), or wait for them to solve it, but that will depend on your availability.

@jcrespo all clear, thanks for the links! I'll move this to "low priority" for now. @bd808 I will also remove you from the "assignee" field, actually I'm sorry for having assigned this task to you in the first place (I should have asked!)

fnegri removed a parent task: T301949: ToolsDB upgrade => Bullseye, MariaDB 10.4.Apr 24 2023, 4:03 PM

bd808 removed a project: User-bd808.Apr 24 2023, 4:05 PM

Change 909397 merged by FNegri:

[operations/puppet@production] toolforge: Use shard name 'toolsdb' in profile::wmcs::services::toolsdb_*

https://gerrit.wikimedia.org/r/909397

Maintenance_bot removed a project: Patch-For-Review.Apr 26 2023, 8:29 AM

fnegri edited projects, added cloud-services-team; removed cloud-services-team (FY2022/2023-Q4).Apr 28 2023, 10:30 AM

JJMC89 moved this task from Backlog to ToolsDB on the Data-Services board.Apr 28 2023, 4:19 PM

fnegri added a parent task: T306453: toolsdb: review alerting.Apr 28 2023, 4:39 PM

fnegri mentioned this in T306453: toolsdb: review alerting.

ToolsDB: setup pt-heartbeat replication monitorOpen, In Progress, LowPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

ToolsDB: setup pt-heartbeat replication monitor
Open, In Progress, LowPublic
Actions

Related Objects
Search...