Page MenuHomePhabricator

Write wmf replag ircbot
Closed, ResolvedPublic

Description

Like toolserver's replag bot would get it's data from the api:

action=query&meta=siteinfo&siprop=dbrepllag

Commands somewhat like:

[#wikimedia-tech] <Krinkle>: @replag
[#wikimedia-tech] <wmfreplag>: [s1] db26: 6; [s5] db14: 1, db35: 1

[#wikimedia-tech] <Krinkle>: @replag all
[#wikimedia-tech] <wmfreplag>: [s1] db36: 0, db32: 0, db12: 0, db26: 0, db38: 0; [s2] db13: 0, db30: 0, db24: 0; [s4] db31: 0, db22: 0, db33: 0;
[#wikimedia-tech] <wmfreplag>: [s5] db23: 0, db14: 0, db35: 0; [s6] db29: 0, db21: 0, db7: 0; [s7] db37: 0, db18: 0, db16: 0;

[#wikimedia-dev] <Krinkle>: @replag s4
[#wikimedia-dev] <wmfreplag>: [s4] db31: 0, db22: 0, db33: 0

[#wikimedia-dev] <Krinkle>: @replag db36
[#wikimedia-dev] <wmfreplag>: db36: 0 (s1)

[#wikimedia-dev] <Krinkle>: @replag commonswiki
[#wikimedia-dev] <wmfreplag>: [commonswiki: s4] db31: 0, db22: 0, db33: 0

Info like dbserver-numbers, server-clusternumebrs and wikidb-names will be periodically fetched from: Wikimedia's conf/db.php [1]

This is basically a reminder for myself right now, although I haven't started on this yet so anyone who feels like it. Go ahead and assign it to yourself :-)

Krinkle

Krinkle

[1]
http://noc.wikimedia.org/conf/highlight.php?file=db.php
http://noc.wikimedia.org/conf/db.php.txt


Version: unspecified
Severity: enhancement

Details

Reference
bz28492

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:24 PM
bzimport set Reference to bz28492.

Can we do this in a saner way for say all, rather than just hitting an API page on each cluster...?

(In reply to comment #1)

Can we do this in a saner way for say all, rather than just hitting an API page
on each cluster...?

Based on the info from db.php it would only have to make 1, 2 or 7 http requests depending on the IRC command. Note that this I do not intend to create a bot that warns when replag is too high (in other words, it would not make any requests while idling) - since that is probably something that should be catched serverside and would indicate a larger issue.

Although it could ofcourse check 'all' silently once every 15 minutes and report anything out of the ordinary, not that big a deal.

Stupid question (I'm just curious) - if its not going to check repetitively in case things go wrong, whats the use case for knowing the replag? If its big enough to make a difference, I'd imagine that'd fall in the category of something gone wrong.

(In reply to Bawolff comment #3)

Stupid question (I'm just curious) - if its not going to check repetitively in
case things go wrong, whats the use case for knowing the replag?

(In reply to Krinkle comment #2)

Although it could ofcourse check 'all' silently once every 15 minutes and
report anything out of the ordinary, not that big a deal.

Okay, it *will* check periodically!

(In reply to comment #4)

(In reply to Bawolff comment #3)

Stupid question (I'm just curious) - if its not going to check repetitively in
case things go wrong, whats the use case for knowing the replag?

(In reply to Krinkle comment #2)

Although it could ofcourse check 'all' silently once every 15 minutes and
report anything out of the ordinary, not that big a deal.

Okay, it *will* check periodically!

Um, doesn't the nagios bot already report this in channel if it goes too high?

A basic start has been made.

Booted it for a test run in #wikimedia-dev, #wikimedia-tech, #wmfDbBot.

Account: wmfDbBot

Right now it doesn't do the periodic checks and nagging yet. Just on-demand to see if it is wanted or not.

Current supported commands:

@info <id>
@replag <id>

id:

  • cluster: (s1-s7; @info also supports 'DEFAULT')
  • dbhost (ie. db18)
  • dbname (ie. enwiki, dewiktionary; @info also supports 'centralauth')

"@replag" without arguments will check all hosts and only return those that have a replag higher than 1 second (or alternatively, "No replag").

"@replag all" will check all clusters and return all their dbhosts+lag counts.

(In reply to comment #5)

(In reply to Bawolff comment #3)

Stupid question (I'm just curious) - if its not going to check repetitively in
case things go wrong, whats the use case for knowing the replag?

Um, doesn't the nagios bot already report this in channel if it goes too high?

I have never seen it do that. Can someone verify this ?

AFAIK I'm sure it doesn't...

Marking as fixed.

It's been running for a while and works nicely.

Source code for bot: https://svn.toolserver.org/svnroot/krinkle/trunk/Kribo/
wmf-replag backend + bridge to Kribo-bot: https://svn.toolserver.org/svnroot/krinkle/trunk/Kribo%20(plugins)/wmfDbBot_KriboBridge/