Page MenuHomePhabricator

Set cron script to dump MediaWiki DB lag times into statsd
Closed, ResolvedPublic

Description

Similar to getJobQueueLengths.php, we could modify getLagTimes.php and have a chron call it with a --report option. This would be good for showing:
a) MW perceived lag times
b) Not counting depooled servers (which often have technically high lag that doesn't matter)

Event Timeline

aaron created this task.Oct 26 2016, 4:46 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 26 2016, 4:46 PM
ori added a subscriber: ori.Oct 26 2016, 5:16 PM

Depooled servers will continue to show up in Graphite and Grafana (particularly when using wildcards to select metrics), because the metric doesn't get deleted when you stop reporting values. For that reason, it might be useful to have a metric representing the average and maximum lag of all pooled servers.

Change 318215 had a related patch set uploaded (by Aaron Schulz):
Improve getLagTimes.php output and add statsD flag

https://gerrit.wikimedia.org/r/318215

Gilles moved this task from Inbox to Doing on the Performance-Team board.Oct 27 2016, 8:11 PM

Change 318215 merged by jenkins-bot:
Improve getLagTimes.php output and add statsD flag

https://gerrit.wikimedia.org/r/318215

Change 327667 had a related patch set uploaded (by Aaron Schulz):
Set cron script to dump MediaWiki DB lag times into statsd

https://gerrit.wikimedia.org/r/327667

I have unblocked this, but please have a look at my comments on gerrit.

Change 327667 merged by Jcrespo:
[operations/puppet@production] Set cron script to dump MediaWiki DB lag times into statsd

https://gerrit.wikimedia.org/r/327667

This has been deployed to terbium. Please keep me updated on how this is used, I am interested on graphs or anything created.

Sadly, I had to revert the script (https://gerrit.wikimedia.org/r/344174) because I am almost sure this is not the intended results:

Please followup with me to see how to deploy it again.

Change 344472 had a related patch set uploaded (by Aaron Schulz):
[mediawiki/core@master] Send integer ms to DB lag time guage instead of seconds

https://gerrit.wikimedia.org/r/344472

Change 344472 merged by jenkins-bot:
[mediawiki/core@master] Send integer ms to DB lag time guage instead of seconds

https://gerrit.wikimedia.org/r/344472

aaron added a comment.Apr 20 2017, 7:08 PM

Looks the puppet change can be attempted again.

No problem, but unless you have a specific reason to want it sooner (a specific test or issue to debug), I would wait until DC failback.

aaron added a comment.Apr 20 2017, 7:17 PM

Of course, it's lower priority, I'm just making note since I keep forgetting about this.

I just got a request from wikiuser checking the lag on a depooled server (hours ago): db2062, could this be related in some way to this script, checking incorrectly non-pooled servers. It is not icinga, it comes from mediawiki.

aaron added a comment.May 10 2017, 7:14 PM

Doubtful, the script just uses LoadBalancer::getLagTimes(), which is what the main code does for getConnection() lag checks.

Change 354138 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[operations/puppet@production] Set cron script to dump MediaWiki DB lag times into statsd

https://gerrit.wikimedia.org/r/354138

Change 354138 merged by Filippo Giunchedi:
[operations/puppet@production] Set cron script to dump MediaWiki DB lag times into statsd

https://gerrit.wikimedia.org/r/354138