Page MenuHomePhabricator

monitor some things on all Cloud instances (discussion)
Closed, DuplicatePublic

Description

Every now and then when doing maintenance tasks (e.g. yesterday's facter upgrades) I find myself stuck in a heap of sticky, broken labs VMs. I think that we can monitor for a few simple issues (specifically disk space and puppet failures) and intervene before these problems become too serious.

I don't want these things to alert, or even nag in an IRC channel. But I do want a big status board that shows ALL the vms and how they're doing. That way when I have some free time (or better yet when the clinic duty person has time) we can go through and nag, delete files, and otherwise clean up.

I'm doing this anyway, better to do it when it's not an emergency.

I have no real opinion about what the right tool is for this.

Event Timeline

Andrew triaged this task as Medium priority.Jun 1 2017, 11:12 PM
Andrew added projects: Cloud-VPS, Cloud-Services.

Like shinken?

Yes! But as I understand it Shinken currently only monitors select projects... I'd like to monitor a narrower set of things on every single instance.

SSH availability via whatever new and fancy Cumin things get setup seems ideal

Like shinken?

Yes! But as I understand it Shinken currently only monitors select projects... I'd like to monitor a narrower set of things on every single instance.

shinkengen has only a subset of projects listed in its config, yeah. I think we could easily change that though.

Change 374897 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] [WIP] shinkengen for all projects

https://gerrit.wikimedia.org/r/374897

Note: use cumin from labpuppetmaster*

Apparently shinken uses enough resources that we'd have to build out a bigger monitoring cluster to actually have it work cloud-wide. Instead we're going to try to have one-off or cron'd cumin tests in the shortrun.

I still sort of want this but I'm clearly not really working on it.

Change 374897 abandoned by Andrew Bogott:
[operations/puppet@production] shinkengen for all projects

Reason:
We've stopped using shinken entirely

https://gerrit.wikimedia.org/r/374897