Page MenuHomePhabricator

Set cron script to dump MediaWiki DB lag times into statsd
Closed, ResolvedPublic

Description

Similar to getJobQueueLengths.php, we could modify getLagTimes.php and have a chron call it with a --report option. This would be good for showing:
a) MW perceived lag times
b) Not counting depooled servers (which often have technically high lag that doesn't matter)

Event Timeline

Depooled servers will continue to show up in Graphite and Grafana (particularly when using wildcards to select metrics), because the metric doesn't get deleted when you stop reporting values. For that reason, it might be useful to have a metric representing the average and maximum lag of all pooled servers.

Change 318215 had a related patch set uploaded (by Aaron Schulz):
Improve getLagTimes.php output and add statsD flag

https://gerrit.wikimedia.org/r/318215

Change 318215 merged by jenkins-bot:
Improve getLagTimes.php output and add statsD flag

https://gerrit.wikimedia.org/r/318215

Change 327667 had a related patch set uploaded (by Aaron Schulz):
Set cron script to dump MediaWiki DB lag times into statsd

https://gerrit.wikimedia.org/r/327667

I have unblocked this, but please have a look at my comments on gerrit.

Change 327667 merged by Jcrespo:
[operations/puppet@production] Set cron script to dump MediaWiki DB lag times into statsd

https://gerrit.wikimedia.org/r/327667

This has been deployed to terbium. Please keep me updated on how this is used, I am interested on graphs or anything created.

Sadly, I had to revert the script (https://gerrit.wikimedia.org/r/344174) because I am almost sure this is not the intended results:

Please followup with me to see how to deploy it again.

Change 344472 had a related patch set uploaded (by Aaron Schulz):
[mediawiki/core@master] Send integer ms to DB lag time guage instead of seconds

https://gerrit.wikimedia.org/r/344472

Change 344472 merged by jenkins-bot:
[mediawiki/core@master] Send integer ms to DB lag time guage instead of seconds

https://gerrit.wikimedia.org/r/344472

Looks the puppet change can be attempted again.

No problem, but unless you have a specific reason to want it sooner (a specific test or issue to debug), I would wait until DC failback.

Of course, it's lower priority, I'm just making note since I keep forgetting about this.

I just got a request from wikiuser checking the lag on a depooled server (hours ago): db2062, could this be related in some way to this script, checking incorrectly non-pooled servers. It is not icinga, it comes from mediawiki.

Doubtful, the script just uses LoadBalancer::getLagTimes(), which is what the main code does for getConnection() lag checks.

Change 354138 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[operations/puppet@production] Set cron script to dump MediaWiki DB lag times into statsd

https://gerrit.wikimedia.org/r/354138

Change 354138 merged by Filippo Giunchedi:
[operations/puppet@production] Set cron script to dump MediaWiki DB lag times into statsd

https://gerrit.wikimedia.org/r/354138