This seems easy to set up and might have some advantages over the basic replication lag alert I created in T301994.
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T306453 toolsdb: review alerting | |||
In Progress | None | T334925 ToolsDB: setup pt-heartbeat replication monitor |
Event Timeline
Change 909397 had a related patch set uploaded (by FNegri; author: Bryan Davis):
[operations/puppet@production] toolforge: Use shard name 'toolsdb' in profile::wmcs::services::toolsdb_*
It is very easy to setup, and works better than the usual method, so there is a benefit- but given the small size of tools db (wikireplicas already get that for free from production) wouldn't be a huge priority (you don't need second-accurate metrics of lag there, I belive). I'd say eventually should be deployed, for consistency, although there is still lack of support on alert manager for it. So up to you when to do it. If I were in your position I would say "TODO when there is time" 0:-).
Thanks @jcrespo, I agree it's nice to have for consistency. The pt-heartbeat service is actually already running (as @bd808 noticed) and updating the heartbeat table, but I'm not sure what would be the next step. Creating an alert linked to the heartbeat? Creating a web page like https://replag.toolforge.org/? Or maybe we are happy with just querying the heartbeat table manually when we want to check the lag?
The goal for the mediawiki cluster is: T315866: Migrate mysql icinga alerts to alert manager (which you should be able to reuse). The main blocker is: T141968: Display lag on grafana (prometheus) from pt-heartbeat instead (or in addition) of Seconds_Behind_Master. Here you can do 2 things: do the work yourself, in coordination with the DBAs (so it works for both), or wait for them to solve it, but that will depend on your availability.
Change 909397 merged by FNegri:
[operations/puppet@production] toolforge: Use shard name 'toolsdb' in profile::wmcs::services::toolsdb_*