Page MenuHomePhabricator

check_puppetrun fails to run under certain conditions
Closed, ResolvedPublic

Description

As part of T323557 we are running a cookbook that triggers a puppet run, on some hosts we're getting a check_puppetrun crash after the puppet run finishes. Status gets reported as UNKOWN. Running the check manually on the impacted host produces the following error:

vgutierrez@cp6010:~$ sudo /usr/local/lib/nagios/plugins/check_puppetrun -w 10800 -c 21600
Traceback (most recent call last):
        12: from /usr/local/lib/nagios/plugins/check_puppetrun:193:in `<main>'
        11: from /usr/lib/ruby/2.7.0/psych.rb:360:in `safe_load'
        10: from /usr/lib/ruby/2.7.0/psych/visitors/to_ruby.rb:32:in `accept'
         9: from /usr/lib/ruby/2.7.0/psych/visitors/visitor.rb:6:in `accept'
         8: from /usr/lib/ruby/2.7.0/psych/visitors/visitor.rb:16:in `visit'
         7: from /usr/lib/ruby/2.7.0/psych/visitors/to_ruby.rb:313:in `visit_Psych_Nodes_Document'
         6: from /usr/lib/ruby/2.7.0/psych/visitors/to_ruby.rb:32:in `accept'
         5: from /usr/lib/ruby/2.7.0/psych/visitors/visitor.rb:6:in `accept'
         4: from /usr/lib/ruby/2.7.0/psych/visitors/visitor.rb:16:in `visit'
         3: from /usr/lib/ruby/2.7.0/psych/visitors/to_ruby.rb:208:in `visit_Psych_Nodes_Mapping'
         2: from /usr/lib/ruby/2.7.0/psych/visitors/to_ruby.rb:411:in `resolve_class'
         1: from /usr/lib/ruby/2.7.0/psych/class_loader.rb:28:in `load'
/usr/lib/ruby/2.7.0/psych/class_loader.rb:97:in `find': Tried to load unspecified class: Puppet::Transaction::Report (Psych::DisallowedClass)

Event Timeline

if we could grab a copy of /var/lib/puppet/state/last_run_report.yaml and attach it here using a WMF-NDA protected past it would help with debugging

@fgiunchedi before spending to much time on this i first wanted to check if we still need this or has it already been ported to alertmanager?

@fgiunchedi before spending to much time on this i first wanted to check if we still need this or has it already been ported to alertmanager?

We do have the metrics (via prometheus::node_puppet_agent) and alerts on per-cluster widespread puppet failures (team-sre/puppet-agent.yaml) so IMHO yes in practice we can ditch check_puppetrun altogether

https://phabricator.wikimedia.org/P48862 has the last_run_report.yaml of cp3053 exhibiting this problem. Puppet is actually failing but the check crashes as well so it doesn't get reported :)

@fgiunchedi before spending to much time on this i first wanted to check if we still need this or has it already been ported to alertmanager?

We do have the metrics (via prometheus::node_puppet_agent) and alerts on per-cluster widespread puppet failures (team-sre/puppet-agent.yaml) so IMHO yes in practice we can ditch check_puppetrun altogether

Flagging as low considering this

lmata added subscribers: joanna_borun, lmata.

@joanna_borun we had a chat in team meeting and thought this was better suited for your team.

This seems to be also the reason why we were getting the 'nrpe unknown' alerts (might be the cause of some of https://alerts.wikimedia.org/?q=severity%3Dunknown).

Fyi. just passing the class as allowed is not enough, it seems that there's some values it does not like:

root@cloudmetrics1003:~# diff /usr/local/lib/nagios/plugins/check_puppetrun  dcaro_test.rb
192,193c192,193
< if failcount > 0 # rubocop:disable Style/NumericPredicate
<   report = YAML.safe_load(File.read(reportfile))
---
> if File.exists?(reportfile)
>   report = YAML.safe_load(File.read(reportfile), [Puppet::Transaction::Report])
root@cloudmetrics1003:~# ./dcaro_test.rb -w 10800 -c 21600
Traceback (most recent call last):
./dcaro_test.rb:194:in `<main>': undefined method `[]' for #<Puppet::Transaction::Report:0x000055d660104410> (NoMethodError)

If anyone sees this happening again please copy /var/lib/puppet/state/last_run_report.yaml to some location so we can try to recreate and troubleshoot. The error has already cleared, also worth noting that all reports start with `--- !ruby/object:Puppet::Transaction::Report
`

If anyone sees this happening again please copy /var/lib/puppet/state/last_run_report.yaml to some location so we can try to recreate and troubleshoot. The error has already cleared, also worth noting that all reports start with `--- !ruby/object:Puppet::Transaction::Report
`

I have left a copy of the failing report + the script above (that forces the check to read it anyhow) under cloudmetrics1003.eqiad.wmnet:/root/puppet_check_bad_report.yaml and cloudmetrics1003.eqiad.wmnet:/root/dcaro_test.rb.

yep, the script only tries to read the report if there were any failures, so it's only "double" failing then xd

Change 949959 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] check_puppetrun: update to use failed_resources Puppet::Transaction::Report

https://gerrit.wikimedia.org/r/949959

I have left a copy of the failing report + the script above (that forces the check to read it anyhow) under cloudmetrics1003.eqiad.wmnet:/root/puppet_check_bad_report.yaml and cloudmetrics1003.eqiad.wmnet:/root/dcaro_test.rb.

Thanks have sent a fix for review

Change 949959 merged by Jbond:

[operations/puppet@production] check_puppetrun: update to use failed_resources Puppet::Transaction::Report

https://gerrit.wikimedia.org/r/949959

jbond claimed this task.

i merged the above changed and will optimistically close this