Page MenuHomePhabricator

Implement a system to monitor tools on tool-labs
Open, MediumPublic

Description

Automated monitoring + alerts for tool users would be awesome, and will probably increase reliability, etc of toollabs a fair bit. No need to play 'is this tool up or not?!' guessing game.


Version: unspecified
Severity: normal

Details

Reference
bz51434

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 2:02 AM
bzimport added a project: Toolforge.
bzimport set Reference to bz51434.

This should be a separate setup than what we have for production and also for critical infrastructure on toollabs (such as the mysql or apache deamons).

sumanah wrote:

I am willing to be told I'm wrong - but I think this is a pretty important step in improving our own reliability, and in providing high-reassurance support to our users.

scfc added a comment.Jan 19 2014, 9:11 PM

(In reply to comment #2)

I am willing to be told I'm wrong - but I think this is a pretty important
step
in improving our own reliability, and in providing high-reassurance support
to
our users.

IIRC the scope of this bug is Icinga for users' tools; for Tools's reliability in general we have icinga.wmflabs.org with (currently) various shortcomings (if it is running at all) that should be addressed in a different bug. For the latter, I remember hashar being interested in using it more for beta as well.

scott.leea wrote:

Is this something I can work on?

coren added a comment.Jul 9 2014, 5:03 PM

Not just yet; we're currently at the stage where we are setting equipment aside for the task and doing our first round of specifications. I expect we'll spend some time at the Hackaton in London working on this; if you're around then you'd be welcome to join us.

Otherwise, as we return, we'll probably have something worth hacking on.

coren added a comment.Aug 27 2014, 5:18 PM

Handing off to Yuvi, who is the gatekeeper of labmon1001

sumanah wrote:

Good luck, Yuvi!

coren moved this task from Triage to Backlog on the Toolforge board.Nov 25 2014, 4:21 PM
yuvipanda raised the priority of this task from High to Needs Triage.Jan 12 2015, 7:52 AM

We have shinken!

yuvipanda closed this task as Declined.Mar 23 2015, 7:59 PM

Not going to happen. We will probably end up doing some monitoring as part of the service manifests work, however.

Restricted Application added a project: Cloud-Services. · View Herald TranscriptNov 26 2015, 12:52 PM
Restricted Application added a subscriber: StudiesWorld. · View Herald Transcript
Matthewrbowker renamed this task from Setup an icinga instance to monitor tools on tool-labs to Implement a system to montior tools on tool-labs.Sep 14 2016, 6:10 PM
Matthewrbowker reopened this task as Open.
Matthewrbowker claimed this task.
Matthewrbowker triaged this task as Medium priority.

I am re-opening this ticket and taking it on in my capacity as a volunteer.

Icinga is not the given solution for this, so I've also generalized the title. @yuvipanda wants to look at "prometheus blackbox_exporter + alertmanager"

Aklapper renamed this task from Implement a system to montior tools on tool-labs to Implement a system to monitor tools on tool-labs.Nov 13 2016, 4:12 PM
Aklapper set Security to None.

Okay, after some examination here's what I'd like to propose. @yuvipanda this is subject to your OK.

I currently have a Labs project set up. For the short term, I'd like to set up icinga tied into LDAP. Monitoring would be set up with an email to me, manually configured. This will get something out there, functional and relatively useful.

In the long term, I'd like to create a custom management console and monitoring solution, tied to OAuth and written in PHP. This would use the Nagios monitoring plugins but have a custom front-end interface whereby people with Wikitech accounts could manage monitoring their tools. I have been unable to find a solution that fits that bill. This will take a while to code, but I believe this will be far more sustainable and usable in the long run.

Does this make sense? If not, feel free to contact me on IRC (nick: Matthew_) or post here.

scfc added a comment.Dec 27 2016, 2:53 AM

(Labs is slowly moving authorative information about instances from LDAP to OpenStack, so if that affects your Icinga setup, https://gerrit.wikimedia.org/r/#/c/328611/2/modules/shinken/files/shinkengen is probably interesting to you.)

General thought #1: There are too many monitoring systems already in place, and adding yet another one further increases maintenance effort, bus factor, etc. Icinga is good as it is also used in production, tweaking it to use some OAuth backend probably workable, having a completely new application opens a can of worms :-). (IMHO; after the experience with Shinken.)

General thought #2: To assess whether it makes sense to put effort into this system or not, we should probably start with defining what is meant by "monitor". What functionality should the system offer that cannot be done better in a different way? (For example, instead of "monitoring" a webservice and alerting its maintainers if it does not respond (if that is a common problem), we could add an option to webservice that specifies a URL path that must return 200 and automatically restart the webservice if not. If the webservice is restarted more than x times per y minutes, we can alert the maintainers via mail. (Or we could leave out the restart and just alert them if it does not respond correctly.) This functionality would live somewhere in webservice, the proxy/Kubernetes, etc. Similarly, we can centralize a nag script that alerts maintainers once a day if a grid job is stuck in error state.)

General thought #3: If a monitoring system should be added that is not Icinga, it should probably be part of Striker (https://toolsadmin.wikimedia.org/).

What about the bots? Many of the current (pseudo-)monitoring service are for webservices (check 200, etc), but for bots, do we have anything at all?

scfc added a comment.Dec 27 2016, 5:05 AM

@zhuyifei1999: Depends on the definition of monitoring. If a bot is started by bigbrother and jstart, if it fails, it will be restarted a couple of times, each time with a mail to the maintainers. But this will monitor the process not failing, i. e. if the process is "stuck", it won't notice that type of failure.

But this will monitor the process not failing, i. e. if the process is "stuck", it won't notice that type of failure.

Exactly. Things like T145633: Deadlock can be caused by raising SpamfilterError in site.editpage() happens. Also when a periodic task submitted via jsub in cron fail, no one knows till someone check the logs.

I'm looking for some either sort of check-ping system that expects periodic (configurable) pings from a bot script, or check the time of latest edit with summary matching a regex (many bots have multiple tasks).

Hi, all. Apologies about the delay, I didn't see emails related to this task for some reason...

(Labs is slowly moving authorative information about instances from LDAP to OpenStack, so if that affects your Icinga setup, https://gerrit.wikimedia.org/r/#/c/328611/2/modules/shinken/files/shinkengen is probably interesting to you.)

Thank you for the information. Does that include information about Tools in Tool Labs?

General thought #1: There are too many monitoring systems already in place, and adding yet another one further increases maintenance effort, bus factor, etc. Icinga is good as it is also used in production, tweaking it to use some OAuth backend probably workable, having a completely new application opens a can of worms :-). (IMHO; after the experience with Shinken.)

You do raise a fair point. Another idea I can play with (and I've actually already played with it at work) is providing a front-end interface for icinga management or something of the sort. Again, they key is low barrier to entry for tool developers who simply want to know if a tool is up or down. As far as I can tell, there is no set procedure to set up tools to be monitored with Shinken.

General thought #2: To assess whether it makes sense to put effort into this system or not, we should probably start with defining what is meant by "monitor". What functionality should the system offer that cannot be done better in a different way? (For example, instead of "monitoring" a webservice and alerting its maintainers if it does not respond (if that is a common problem), we could add an option to webservice that specifies a URL path that must return 200 and automatically restart the webservice if not. If the webservice is restarted more than x times per y minutes, we can alert the maintainers via mail. (Or we could leave out the restart and just alert them if it does not respond correctly.) This functionality would live somewhere in webservice, the proxy/Kubernetes, etc. Similarly, we can centralize a nag script that alerts maintainers once a day if a grid job is stuck in error state.)

I define a monitor as a software check that determines whether a given piece of software is working correctly.

For a Tool Labs tool (http://tools.wmflabs.org), a webservice check should be sufficient. Icinga provides one in its base package.

For a Labs instance, a host alive check and a ping check is possible right out of the gate. I can do more if people need more.

General thought #3: If a monitoring system should be added that is not Icinga, it should probably be part of Striker (https://toolsadmin.wikimedia.org/).

Could we do management as part of Striker? That would make this very easy, at least on the Tool Labs side. I don't quite know who to ask though...

What about the bots? Many of the current (pseudo-)monitoring service are for webservices (check 200, etc), but for bots, do we have anything at all?

I'm looking for some either sort of check-ping system that expects periodic (configurable) pings from a bot script, or check the time of latest edit with summary matching a regex (many bots have multiple tasks).

Something can definitely be coded there, Icinga and the Nagios plugins are very flexible. We could also do pings from a bot script or from IRC, or indeed check edit summaries, although the latter will be harder.

Could we do management as part of Striker? That would make this very easy, at least on the Tool Labs side. I don't quite know who to ask though...

@bd808

bd808 added a comment.Jan 1 2017, 5:34 PM

Could we do management as part of Striker? That would make this very easy, at least on the Tool Labs side. I don't quite know who to ask though...

The best way to discuss adding something to Striker is in a phab ticket associated with the Striker project.

In the long term, I'd like to create a custom management console and monitoring solution, tied to OAuth and written in PHP. This would use the Nagios monitoring plugins but have a custom front-end interface whereby people with Wikitech accounts could manage monitoring their tools. I have been unable to find a solution that fits that bill. This will take a while to code, but I believe this will be far more sustainable and usable in the long run.

Striker is Python rather than PHP, but it does provide authentication for Labs users. Its current authorization layer only knows about Tool Labs tool membership, but that may be fixable. Wikitech supports OAuth authentication that could be used in a tool or Labs project, but an authorization layer would have to be developed separately.

The universe is full of FLOSS system monitoring tools. Nearly every one of them was started because the author found all other tools lacking and set out to create a better solution rather than improving an existing tool. I can see the utility in making some helper functionality to make configuring an existing monitoring system easier for Labs. I can not see the utility in adding to the total number of monitoring tools available in the universe.

scfc added a comment.Jan 1 2017, 10:57 PM

[…]

(Labs is slowly moving authorative information about instances from LDAP to OpenStack, so if that affects your Icinga setup, https://gerrit.wikimedia.org/r/#/c/328611/2/modules/shinken/files/shinkengen is probably interesting to you.)

Thank you for the information. Does that include information about Tools in Tool Labs?

AFAIUI: No.

[…]
I define a monitor as a software check that determines whether a given piece of software is working correctly.
For a Tool Labs tool (http://tools.wmflabs.org), a webservice check should be sufficient. Icinga provides one in its base package.
For a Labs instance, a host alive check and a ping check is possible right out of the gate. I can do more if people need more.

For Labs instances we already have an ssh check via Shinken which is effectively alive and ping.

[…]

I'm looking for some either sort of check-ping system that expects periodic (configurable) pings from a bot script, or check the time of latest edit with summary matching a regex (many bots have multiple tasks).

Something can definitely be coded there, Icinga and the Nagios plugins are very flexible. We could also do pings from a bot script or from IRC, or indeed check edit summaries, although the latter will be harder.

One major problem with any self-serving (monitoring) solution is that users must be treated as potentially hostile. So, for example, you can't just use simple Nagios plugins for webservices, but must check that the URL "belongs" to the user. Similarly, users must not be able to interfere with each other's tools.

When webservices were first introduced, on failure they would just stop working, with the idea that maintainers would then come along, fix any issues and restart the webservice. IIRC users then wanted webservices to automatically restart because that was (all) what they would do when they encountered a failed webservice.

I assume that bot operators would act in the same way, so I think that a pattern for bots would be more useful, e. g. start the bot with bigbrother, on every edit touch a file ~/.bot-watchdog and have a cron job every hour/day that tests whether ~/.bot-watchdog has been touched in the past x hours and, if not, delete the grid job and let it be restarted by bigbrother.

Hello!

My apologies for the delay.

Based on this information, I'm going to split this task into two parts. First part will be just for Tool Labs, second part will be for Labs as a whole. I will begin by handling tool labs only, as this appears to be the less involved of the two...

@bd808 if I provided some sort of CRUD API, could we use Striker as a front end? if that answer is yes, I'll create a task to discuss specifics. This will handle @scfc 's issues with regard to user input.

The question now is what monitoring would need to be implemented for Tool Labs? I understand web (HTTP), but what would need to be implemented for bots? Could I poll the job queue in some way? I'm not very farmiliar with jsub so...

@scfc my thought would be just start with monitoring. Automated restart can be handled down the line.

@bd808 if I provided some sort of CRUD API, could we use Striker as a front end? if that answer is yes, I'll create a task to discuss specifics. This will handle @scfc 's issues with regard to user input.

Totally possible, yes. As I mentioned in T53434#2909937, open a ticket with the rough ideas and we can iterate from there to figure out what would be needed to create the integration. The trickiest part may be securing authentication between Striker and a Labs project hosting the monitor.

The question now is what monitoring would need to be implemented for Tool Labs? I understand web (HTTP), but what would need to be implemented for bots? Could I poll the job queue in some way? I'm not very farmiliar with jsub so...

There's a tool for this! https://tools.wmflabs.org/gridengine-status/ dumps out a json blob that provides the same information as https://tools.wmflabs.org/?status. The tool that was built for the Precise migration should give you an idea of how you can consume it.

The 'is my webservice up' and 'is my job running' checks are probably a good place to start. Longer term some sort of liveness checks would be more awesome. The need for any of this may magically disappear with a proper Kubernetes based PaaS (T136264: Evaluate Kubernetes based workflow replacement options for SGE) as Kubernetes has built in support for per 'pod' liveness checking, but that's no reason to block trying to find a solution now. I have a feeling that even after we have chosen and deployed a PaaS it will take quite a while to get everyone migrated over to using it.

@bd808 if I provided some sort of CRUD API, could we use Striker as a front end? if that answer is yes, I'll create a task to discuss specifics. This will handle @scfc 's issues with regard to user input.

Totally possible, yes. As I mentioned in T53434#2909937, open a ticket with the rough ideas and we can iterate from there to figure out what would be needed to create the integration. The trickiest part may be securing authentication between Striker and a Labs project hosting the monitor.

Done, see T157847: Preparation for api for community-labs-monitoring

The question now is what monitoring would need to be implemented for Tool Labs? I understand web (HTTP), but what would need to be implemented for bots? Could I poll the job queue in some way? I'm not very farmiliar with jsub so...

There's a tool for this! https://tools.wmflabs.org/gridengine-status/ dumps out a json blob that provides the same information as https://tools.wmflabs.org/?status. The tool that was built for the Precise migration should give you an idea of how you can consume it.
The 'is my webservice up' and 'is my job running' checks are probably a good place to start. Longer term some sort of liveness checks would be more awesome. The need for any of this may magically disappear with a proper Kubernetes based PaaS (T136264: Evaluate Kubernetes based workflow replacement options for SGE) as Kubernetes has built in support for per 'pod' liveness checking, but that's no reason to block trying to find a solution now. I have a feeling that even after we have chosen and deployed a PaaS it will take quite a while to get everyone migrated over to using it.

Sounds good! Thank you for the information.

Harej added a subscriber: Harej.Aug 19 2018, 7:41 PM

I'm looking for information on how tools-prometheus-01 and tools-prometheus-02 work. The only documentation I've found was this task and a small section in Wikitech about monitoring in the Kubernetes cluster.

I see both nodes are running are actively collecting metrics. Any help is welcome and sorry to hijack this task to ask for information but it seems the solution proposed here was already implemented to some extent.

I've found a presentation that says the Toolforge Prometheus instances were used as a testbed for ideas before implementing the production ones. So I think the main Prometheus page in Wikitech applies then. It doesn't talk a lot about Toolforge but I think it's a starting point. If anyone remembers something that's special/different about it when compared to Production, please let me know.

GTirloni removed a subscriber: GTirloni.Mar 21 2019, 9:11 PM

The Coretex project and community has been very active too, looks like it could be a good fit for multi-tenant monitoring based on tools we already use.

https://github.com/cortexproject/cortex