Page MenuHomePhabricator

Show replication lags in Graphite
Closed, DuplicatePublic

Description

The replication lags of the database servers should be shown in Graphite.

Details

Reference
bz48694

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 1:22 AM
bzimport added a project: Toolforge.
bzimport set Reference to bz48694.

As a test, I have set up ~scfc/bin/replagstats to run every minute. The statistics are available at http://ganglia.wmflabs.org/ -> tools -> tools-login -> "Replication Lags metrics".

Looks like it's working fine to me.

(In reply to comment #3)

Looks like it's working fine to me.

No, as discussed on IRC, it's still running under my personal account.

As it would be useful to show replication lag for every MariaDB slave, I wanted to discuss this as a wider change with Asher. But:

a) chance never came about, and
b) it's already there! For db1035, go to http://ganglia.wikimedia.org/latest/?c=MySQL%20eqiad&h=db1035.eqiad.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2 and search for "mysql_slave_lag".

However, this isn't available for labsdb* yet (cf. http://ganglia.wikimedia.org/latest/?c=MySQL%20eqiad&h=labsdb1001.eqiad.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2), and at the moment can't be enabled anyway as the monitoring for db1035 et al. assumes that only /one/ MariaDB instance runs on any server, while on labsdb* there are several and so mysql_slave_log & Co. need to be prefixed by, for example, "s1_".

So to resolve this bug, we need to:

a) refactor the monitoring bits and pieces that they handle multiple instances on one server,
b) enable such monitoring for labsdb*, and
c) create a ganglia::view where *_mysql_slave_lag for labsdb* is combined in one report so that the information isn't scattered over three pages and literally hundreds of graphs.

(I moved ~scfc/bin/replagstats to ~tools.admin/bin/, rewrote it from a cron to a continuous job and started it with jstart.)

(I needed to group the statistics at Ganglia under the virtual host "tools-replags".)

yuvipanda subscribed.

Ganglia has been dead for a while now.

yuvipanda renamed this task from Show replication lags in Ganglia to Show replication lags in Graphite.Mar 25 2015, 9:48 PM
yuvipanda added a project: ToolLabs-Goals-Q4.
yuvipanda set Security to None.

This would also allow us to alert based on it.

scfc removed coren as the assignee of this task.Apr 6 2015, 8:10 AM
scfc updated the task description. (Show Details)
scfc moved this task from Backlog to Ready to be worked on on the Toolforge board.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript