Page MenuHomePhabricator

ToolsDB: setup pt-heartbeat replication monitor
Open, In Progress, LowPublic

Description

This seems easy to set up and might have some advantages over the basic replication lag alert I created in T301994.

https://wikitech.wikimedia.org/wiki/MariaDB/pt-heartbeat

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 909397 had a related patch set uploaded (by FNegri; author: Bryan Davis):

[operations/puppet@production] toolforge: Use shard name 'toolsdb' in profile::wmcs::services::toolsdb_*

https://gerrit.wikimedia.org/r/909397

fnegri changed the task status from Open to In Progress.Apr 18 2023, 5:55 PM
fnegri moved this task from Backlog to In progress on the cloud-services-team (FY2022/2023-Q4) board.
fnegri triaged this task as Medium priority.Apr 24 2023, 10:54 AM
fnegri added a subscriber: jcrespo.

@jcrespo do you think there is any benefit in using pt-heartbeat for ToolsDB?

It is very easy to setup, and works better than the usual method, so there is a benefit- but given the small size of tools db (wikireplicas already get that for free from production) wouldn't be a huge priority (you don't need second-accurate metrics of lag there, I belive). I'd say eventually should be deployed, for consistency, although there is still lack of support on alert manager for it. So up to you when to do it. If I were in your position I would say "TODO when there is time" 0:-).

Thanks @jcrespo, I agree it's nice to have for consistency. The pt-heartbeat service is actually already running (as @bd808 noticed) and updating the heartbeat table, but I'm not sure what would be the next step. Creating an alert linked to the heartbeat? Creating a web page like https://replag.toolforge.org/? Or maybe we are happy with just querying the heartbeat table manually when we want to check the lag?

The goal for the mediawiki cluster is: T315866: Migrate mysql icinga alerts to alert manager (which you should be able to reuse). The main blocker is: T141968: Display lag on grafana (prometheus) from pt-heartbeat instead (or in addition) of Seconds_Behind_Master. Here you can do 2 things: do the work yourself, in coordination with the DBAs (so it works for both), or wait for them to solve it, but that will depend on your availability.

fnegri removed bd808 as the assignee of this task.Apr 24 2023, 3:00 PM
fnegri lowered the priority of this task from Medium to Low.
fnegri added a subscriber: bd808.

@jcrespo all clear, thanks for the links! I'll move this to "low priority" for now. @bd808 I will also remove you from the "assignee" field, actually I'm sorry for having assigned this task to you in the first place (I should have asked!)

Change 909397 merged by FNegri:

[operations/puppet@production] toolforge: Use shard name 'toolsdb' in profile::wmcs::services::toolsdb_*

https://gerrit.wikimedia.org/r/909397