if there exists high replication on labs replications db it is unkown by labs admins/operators until sb. reports replag problems on irc.
Description
Related Objects
- Mentioned In
- T50628: Provide replication lag as a database function
- Mentioned Here
- T138378: Replag on s1 (enwiki) is 70370 and is still growing fast.
T50694: Show replication lags in Graphite
T71463: Create a table in labs with replication lag data
T99485: implement performance_schema for mysql monitoring
T114752: Puppetize pt-heartbeat on MariaDB10 masters and its corresponding checks on the several monitoring backends
Event Timeline
I believe @Krinkle already has a bot that does this for the production DB lags on IRC, I wonder if it could be modify to accommodate these instead of writing a whole new bot.
"it is unkown by labs admins/operators until sb. reports replag problems on irc." That is untrue- lag is monitored by several admin-only tools, however those are not public due to sensitivity of potentially exposing query data. We get real-time data of all db servers. By the time someone reports it, we already know, but sadly a) replication lag is many times created by user locks, and we do give some grace time for those to continue b) is created by issues or maintenance, and it takes time to recover.
If someone wants to help with making the lag available publicly, help is welcome, although the heartbeat tables and https://tools.wmflabs.org/replag/ already exists (the DBA created the underlying infrastructure and users created the web interface).
The best option is probably to make the heartbeat tables available through a Diamond collector. That allows us to aggregate the data in Graphite, and to connect it to Shinken alerts as required.
If anyone is interested in picking this up, take a look at rOPUP Wikimedia Puppet modules/diamond/files/collector to implement the collector. Alternatively, the replag tool could use a crontab to regularly push data to Graphite directly.
Yeah, the use case for the bot (and my bot) is three parts. Two of which would presumably be covered by the bot requested in this ticket:
- Proactively report if replag is non-zero (or some other threshold).
- Respond to queries for specific replag numbers.
The former is covered by monitoring bots from icinga and in the future with shinken.
The latter is handled web tools such as https://tools.wmflabs.org/replag/. We could probably set another instance in prod for the prod slaves if ops prefer that. Though I imagine it can already be retrieved through an existing dashboard (e.g. Incinga). And once T50694 is resolved we can have a Grafana dashboard for it as well. The replag tool is quite nice though. Perhaps we can add it to noc.wikimedia.org?
@Krinkle, https://dbtree.wikimedia.org/ already exists- and it is linked from noc- (but it still uses the old replication lag definition). The next step is to actually change it use heartbeat (T114752), and I hope to complete it soon.
If you want something more elaborate, like a webservice, I would ask you for a patch (O:-P) to https://gerrit.wikimedia.org/r/#/admin/projects/operations/software/dbtree (but something non-static would be problematic to cache and could create security issues).
If you need it yourself it is ok, but please believe me if I say that ops are more than satisfied with the internal monitoring of replication lag (they actually think it is at times too noisy):
The problem is that for years, it has been tangled with query monitoring and it cannot be made public for privacy and security reasons. That is changing now, I am working on that, as I previously mentined. But contributions from non-ops for other uses are still welcome!
BTW, lab hosts are not shown on dbtree on purpose, but they can be exposed (on a separate section outside of the coredb servers) with a simple config change, if that could be useful.
