Page MenuHomePhabricator

write irc bot to report high replag of s{1,2,3}.labsdb on #wikimedia-labsdb
Open, LowestPublic


if there exists high replication on labs replications db it is unkown by labs admins/operators until sb. reports replag problems on irc.

Event Timeline

Merl created this task.Jul 17 2015, 3:43 PM
Merl raised the priority of this task from to Needs Triage.
Merl updated the task description. (Show Details)
Merl added a subscriber: Merl.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 17 2015, 3:43 PM

Why not report it in Cloud-Services and DBA? Do we really need yet another channel?

Sitic added a subscriber: Sitic.Jul 20 2015, 2:38 AM
chasemp triaged this task as Lowest priority.Nov 30 2015, 4:24 PM
chasemp added a subscriber: chasemp.

I believe @Krinkle already has a bot that does this for the production DB lags on IRC, I wonder if it could be modify to accommodate these instead of writing a whole new bot.

"it is unkown by labs admins/operators until sb. reports replag problems on irc." That is untrue- lag is monitored by several admin-only tools, however those are not public due to sensitivity of potentially exposing query data. We get real-time data of all db servers. By the time someone reports it, we already know, but sadly a) replication lag is many times created by user locks, and we do give some grace time for those to continue b) is created by issues or maintenance, and it takes time to recover.

If someone wants to help with making the lag available publicly, help is welcome, although the heartbeat tables and already exists (the DBA created the underlying infrastructure and users created the web interface).

The best option is probably to make the heartbeat tables available through a Diamond collector. That allows us to aggregate the data in Graphite, and to connect it to Shinken alerts as required.

If anyone is interested in picking this up, take a look at rOPUP Wikimedia Puppet modules/diamond/files/collector to implement the collector. Alternatively, the replag tool could use a crontab to regularly push data to Graphite directly.

We are doing that for production (directly from the database), so no need for a separate ticket for labs.

I have been working last 2 weeks to prepare the production infrastructure for that. Tickets are: T50694 T114752 T99485 T71463

Yeah, the use case for the bot (and my bot) is three parts. Two of which would presumably be covered by the bot requested in this ticket:

  • Proactively report if replag is non-zero (or some other threshold).
  • Respond to queries for specific replag numbers.

The former is covered by monitoring bots from icinga and in the future with shinken.

The latter is handled web tools such as We could probably set another instance in prod for the prod slaves if ops prefer that. Though I imagine it can already be retrieved through an existing dashboard (e.g. Incinga). And once T50694 is resolved we can have a Grafana dashboard for it as well. The replag tool is quite nice though. Perhaps we can add it to

@Krinkle, already exists- and it is linked from noc- (but it still uses the old replication lag definition). The next step is to actually change it use heartbeat (T114752), and I hope to complete it soon.

If you want something more elaborate, like a webservice, I would ask you for a patch (O:-P) to (but something non-static would be problematic to cache and could create security issues).

If you need it yourself it is ok, but please believe me if I say that ops are more than satisfied with the internal monitoring of replication lag (they actually think it is at times too noisy):

The problem is that for years, it has been tangled with query monitoring and it cannot be made public for privacy and security reasons. That is changing now, I am working on that, as I previously mentined. But contributions from non-ops for other uses are still welcome!

BTW, lab hosts are not shown on dbtree on purpose, but they can be exposed (on a separate section outside of the coredb servers) with a simple config change, if that could be useful.

Please notice T138378 ; actually 73970 sec on s1

Krinkle removed a subscriber: Krinkle.May 29 2018, 1:39 PM