Page MenuHomePhabricator

Enable Shinken monitoring for 'gratitude' Cloud VPS Project
Closed, ResolvedPublic

Description

Per T237132,
To enable server monitoring for the 'gratitude' Cloud VPS project, the best way forward has been determined to use the existing shinken infrastructure. This task is to create a shinken account for myself, or find another way to enable that monitoring.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 15 2019, 5:44 PM
bd808 removed bd808 as the assignee of this task.Nov 17 2019, 9:59 PM
bd808 moved this task from Inbox to Clinic Duty on the cloud-services-team (Kanban) board.
bd808 added a subscriber: bd808.

Please excuse this BUMP @bd808 . Thanks, and happy beginning of wintertime.

aborrero assigned this task to Andrew.Dec 10 2019, 9:48 AM
aborrero triaged this task as Medium priority.
aborrero added subscribers: Andrew, aborrero.

@Andrew is the clinic-duty person this week. Assigning this task to him.

Change 556389 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Add shinken group and contacts for the 'gratitude' cloud-vps project

https://gerrit.wikimedia.org/r/556389

The attached patch should do some of what you want -- please confirm the list of contacts & emails looks reasonable to you.

That will get you some standard host checks (e.g. if puppet is failing, or if a host is down.) If you want specific service checks then that may be possible but I don't know as much about it.

Hey @Andrew the details in that patch look correct. Thanks!

I don't seem to be able to log in to shinken.wmflabs.org using my Wikitech credentials however.
Are you telling me that if puppet is failing or the cloud-VPS is down then I should get emails to the contact information in the patch?

Thanks for clarifying.

Change 556389 merged by Andrew Bogott:
[operations/puppet@production] Add shinken group and contacts for the 'gratitude' cloud-vps project

https://gerrit.wikimedia.org/r/556389

Hey @Andrew the details in that patch look correct. Thanks!

I don't seem to be able to log in to shinken.wmflabs.org using my Wikitech credentials however.

The status of VMs is public and the shinken UI read-only, so you can log in with username: guest password: guest.

Are you telling me that if puppet is failing or the cloud-VPS is down then I should get emails to the contact information in the patch?

Yes, now that I've merged the patch that should happen. You could test by shutting something down on purpose for an hour and seeing what happens :)

Thanks for merging the patch and giving us the credentials to the shinken interface!

Just a few problems though.

  1. When I log into the shinken web interface, I did not see a host with name "gratitude" after searching around.
  1. I shut down the gratiude cloud-vps instance for 5 minutes with the horizon interface, but didn't receive an email alert.
  1. Sorry to have forgotten this at patch-time, but we have a further email address that sends us push-alerts to our phones, so it would be fantastic if you could add another contact that has email address < 1ghkk7qrmc@pomail.net > .

Thank you again for your hard work keeping cloud services running smoothly.

Hm, you're right, looks like something is missing. I'll investigate.

Oh, the VM is called 'gratsync', not 'gratitude' -- it has a page here: https://shinken.wmflabs.org/host/gratsync

I'm not sure why you didn't get alerted, though -- I'm going to add myself to the contact group, created a test VM, and see what I can see. Brace for alert spam :)

Change 558587 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Add andrewbogott to the 'gratitude' contact list

https://gerrit.wikimedia.org/r/558587

Change 558588 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Add contact for the 'gratitude' project that uses sms-via-email

https://gerrit.wikimedia.org/r/558588

Change 558587 merged by Andrew Bogott:
[operations/puppet@production] Add andrewbogott to the 'gratitude' contact list

https://gerrit.wikimedia.org/r/558587

With my test VM I was able to trigger an alert email by shutting down the VM. The check appears to run every five minutes. It's possible you didn't wait long enough, or possible that my restarting the service during testing caused something to get picked up that was missed before.

In any case, I'm going to merge the change that adds your sms alert and then you can test again.

Change 558588 merged by Andrew Bogott:
[operations/puppet@production] Add contact for the 'gratitude' project that uses sms-via-email

https://gerrit.wikimedia.org/r/558588

Change 558609 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] rename gratitude-team to gratitudeteam

https://gerrit.wikimedia.org/r/558609

Change 558609 merged by Andrew Bogott:
[operations/puppet@production] rename gratitude-team to gratitudeteam

https://gerrit.wikimedia.org/r/558609

@Andrew , thanks so much. You were right it needs to go down for several minutes to trigger the shinken. I tried it for 5 minuets, and it worked (even got the push notification).
Thank you so much for all your help! Cloud services rock.

Andrew closed this task as Resolved.Dec 19 2019, 1:14 PM

@Maximilianklein glad to hear it! Let me know if you run into any other issues.

btw, we have a bit of planned work on shinken coming up (T240969) which may produce some new alerts as we fix things.

Thanks @Andrew . I will keep an eye out for the shinken upgrade.

The ssh-check is working reliably, but I have a question about free-disk-space check. I looked at how /etc/nagios/nrpe.d/check_disk_space.cfg was configured and saw that it should have given us a warning notification at 6% free and critical at 3% free. I used fallocate to generate a dummy file to increase the disk usage, but couldn't seem to trigger the free-disk-space shinken alert. Is there a setting somewhere else that isn't obvious, that can help me troubleshoot why I can't get this alert to fire?

Thanks again.

I just got an alert that the disk space was at 100%.