Page MenuHomePhabricator

shinken: all puppet reports showing as 'unknown'
Open, MediumPublic

Description

It looks like shinken has been failing to determine puppet status for several days. It might be that I'm misunderstanding the console though.

Event Timeline

Andrew created this task.Dec 17 2019, 4:56 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 17 2019, 4:56 PM
Andrew added a subscriber: Phamhi.Dec 18 2019, 7:56 AM

@Phamhi, shinken is watching checks like

https://graphite-labs.wikimedia.org!10!$HOSTNOTES$.$HOSTNAME$.puppetagent.time_since_last_run!3600!43200!10min!0min!1!--over

It looks like most of those pages are missing. Is this a result of some of your recent labmon/graphite work? (I could also believe that those tests were just removed entirely in favor of some Prometheus thing that shinken doesn't know about).

bd808 added a subscriber: bd808.Dec 18 2019, 5:41 PM

@Andrew this was almost certainly broken by me with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/549241/ which replaced the diamond state tracking with Prometheus for T210993: Deprecate Diamond collectors in Cloud VPS.

bd808 added a comment.Dec 18 2019, 5:46 PM

Maybe we could add something like https://github.com/prometheus/nagios_plugins to the shinken hosts so that they can poll prometheus for stats?

I'm still a bit confused about what prometheus does and doesn't do. Some Prometheus docs mention that Prometheus can alert, which has me wondering if we need another alerting tool or can just use Prometheus directly?

bd808 added a comment.Wed, Jan 1, 5:52 PM

I'm still a bit confused about what prometheus does and doesn't do. Some Prometheus docs mention that Prometheus can alert, which has me wondering if we need another alerting tool or can just use Prometheus directly?

https://prometheus.io/docs/alerting/overview/ mentions that this requires an "Alertmanager" service.

Krenair added a subscriber: Krenair.Wed, Jan 1, 5:59 PM
Bstorm added a subscriber: Bstorm.Thu, Jan 2, 7:15 PM

I believe we already run the alertmanager service, but we don't have anything configured for that to talk to. It would send the alerts to the external setup (or to an email server).

Bstorm added a comment.Thu, Jan 2, 7:25 PM

I don't see it running, actually. It's pretty simple to set up if you have an email server for it to talk to (or pagerduty and friends). But if we'd want it coming through shinken or icinga, you need plugins and other services AFAIK.

Andrew removed Andrew as the assignee of this task.Wed, Jan 15, 4:44 AM

The right fix for this is to build a new monitoring system which I'm not going to dive into immediately

Bstorm triaged this task as Medium priority.Wed, Jan 22, 10:18 PM