Page MenuHomePhabricator

Debug / fine tune puppet failed metrics and alerts on alert* hosts
Closed, ResolvedPublic

Description

Yesterday puppet failed on alert* hosts, though the related alert PuppetFailure (in alerts.git) did not fire. On closer inspection the underlying metric puppet_agent_failed never went to 1 even though this happened:

Jan 19 17:48:55 alert1001 puppet-agent[4671]: Using configured environment 'production'
Jan 19 17:48:55 alert1001 puppet-agent[4671]: Retrieving pluginfacts
Jan 19 17:48:55 alert1001 puppet-agent[4671]: Retrieving plugin
Jan 19 17:48:55 alert1001 puppet-agent[4671]: Retrieving locales
Jan 19 17:48:55 alert1001 puppet-agent[4671]: Loading facts
Jan 19 17:49:27 alert1001 puppet-agent[4671]: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Netops::Ripeatlas[eqi
ad]:
Jan 19 17:49:27 alert1001 puppet-agent[4671]:   parameter 'ipv4' expects a value of type Undef or String, got Tuple
Jan 19 17:49:27 alert1001 puppet-agent[4671]:   parameter 'ipv6' expects a value of type Undef or String, got Tuple (file: /etc/puppet/modules/netops/manifests/monitoring.pp, line: 177) on node alert1001.wikimedi
a.org
Jan 19 17:49:27 alert1001 puppet-agent[4671]: Not using cache on failed catalog
Jan 19 17:49:27 alert1001 puppet-agent[4671]: Could not retrieve catalog; skipping run
Jan 19 17:52:16 alert1001 puppet-agent[15190]: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Netops::Ripeatlas[eq
iad]:
Jan 19 17:52:16 alert1001 puppet-agent[15190]:   parameter 'ipv4' expects a value of type Undef or String, got Tuple
Jan 19 17:52:16 alert1001 puppet-agent[15190]:   parameter 'ipv6' expects a value of type Undef or String, got Tuple (file: /etc/puppet/modules/netops/manifests/monitoring.pp, line: 177) on node alert1001.wikimed
ia.org
Jan 19 17:52:16 alert1001 puppet-agent[15190]: Not using cache on failed catalog
Jan 19 17:52:16 alert1001 puppet-agent[15190]: Could not retrieve catalog; skipping run
Jan 19 18:10:03 alert1001 puppet-agent[31185]: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Netops::Ripeatlas[eq
iad]:
Jan 19 18:10:03 alert1001 puppet-agent[31185]:   parameter 'ipv4' expects a value of type Undef or String, got Tuple
Jan 19 18:10:03 alert1001 puppet-agent[31185]:   parameter 'ipv6' expects a value of type Undef or String, got Tuple (file: /etc/puppet/modules/netops/manifests/monitoring.pp, line: 177) on node alert1001.wikimed
ia.org
Jan 19 18:10:03 alert1001 puppet-agent[31185]: Not using cache on failed catalog
Jan 19 18:10:03 alert1001 puppet-agent[31185]: Could not retrieve catalog; skipping run
Jan 19 18:11:10 alert1001 puppet-agent[9368]: Using configured environment 'production'
Jan 19 18:11:10 alert1001 puppet-agent[9368]: Retrieving pluginfacts
Jan 19 18:11:10 alert1001 puppet-agent[9368]: Retrieving plugin
Jan 19 18:11:10 alert1001 puppet-agent[9368]: Retrieving locales
Jan 19 18:11:10 alert1001 puppet-agent[9368]: Loading facts
Jan 19 18:11:41 alert1001 puppet-agent[9368]: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Netops::Ripeatlas[eqi
ad]:
Jan 19 18:11:41 alert1001 puppet-agent[9368]:   parameter 'ipv4' expects a value of type Undef or String, got Tuple
Jan 19 18:11:41 alert1001 puppet-agent[9368]:   parameter 'ipv6' expects a value of type Undef or String, got Tuple (file: /etc/puppet/modules/netops/manifests/monitoring.pp, line: 177) on node alert1001.wikimedi
a.org
Jan 19 18:11:41 alert1001 puppet-agent[9368]: Not using cache on failed catalog
Jan 19 18:11:41 alert1001 puppet-agent[9368]: Could not retrieve catalog; skipping run
Jan 19 18:18:02 alert1001 puppet-agent-cronjob: Sleeping 9 for random splay
Jan 19 18:18:12 alert1001 puppet-agent-cronjob: su: warning: cannot change directory to /nonexistent: No such file or directory
Jan 19 18:18:13 alert1001 puppet-agent-cronjob: INFO:debmonitor:Found 666 installed binary packages
Jan 19 18:18:13 alert1001 puppet-agent-cronjob: INFO:debmonitor:Found 9 upgradable binary packages (including new dependencies)
Jan 19 18:18:13 alert1001 puppet-agent-cronjob: INFO:debmonitor:Successfully sent the upgradable update to the DebMonitor server
Jan 19 18:18:15 alert1001 puppet-agent[1994]: Using configured environment 'production'
Jan 19 18:18:15 alert1001 puppet-agent[1994]: Retrieving pluginfacts
Jan 19 18:18:15 alert1001 puppet-agent[1994]: Retrieving plugin
Jan 19 18:18:16 alert1001 puppet-agent[1994]: Retrieving locales
Jan 19 18:18:16 alert1001 puppet-agent[1994]: Loading facts
Jan 19 18:18:46 alert1001 puppet-agent[1994]: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Netops::Ripeatlas[eqiad]:
Jan 19 18:18:46 alert1001 puppet-agent[1994]:   parameter 'ipv4' expects a value of type Undef or String, got Tuple
Jan 19 18:18:46 alert1001 puppet-agent[1994]:   parameter 'ipv6' expects a value of type Undef or String, got Tuple (file: /etc/puppet/modules/netops/manifests/monitoring.pp, line: 177) on node alert1001.wikimedia.org
Jan 19 18:18:46 alert1001 puppet-agent[1994]: Not using cache on failed catalog
Jan 19 18:18:46 alert1001 puppet-agent[1994]: Could not retrieve catalog; skipping run
Jan 19 18:46:42 alert1001 puppet-agent[20615]: Using configured environment 'production'

Event Timeline

I've noticed that when puppet fails to compile catalog, it won't show as failed but will have 0 resources, which is what happened here too:

image.png (758×1 px, 47 KB)

We alert on puppet_agent_resources_total == 0 on Cloud VPS instances, maybe you want to do the same here?

Change 758860 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: check agent resources too in PuppetFailure

https://gerrit.wikimedia.org/r/758860

In T299628#7635655, @Majavah wrote:

I've noticed that when puppet fails to compile catalog, it won't show as failed but will have 0 resources, which is what happened here too:

image.png (758×1 px, 47 KB)

We alert on puppet_agent_resources_total == 0 on Cloud VPS instances, maybe you want to do the same here?

Thank you @Majavah, I'll give that a try!

Change 758860 merged by Filippo Giunchedi:

[operations/alerts@master] sre: check agent resources too in PuppetFailure

https://gerrit.wikimedia.org/r/758860

fgiunchedi claimed this task.

Tentatively resolving