Page MenuHomePhabricator

puppet_alert.py not working
Closed, ResolvedPublic

Description

There is a crontab entry in VMs that runs puppet_alert.py:

# Puppet Name: send_puppet_failure_emails
15 8 * * * /usr/local/sbin/puppet_alert.py
`

I changed the NAG_INTERVAL from 24h to 60s to trigger an email notification and it generated an error instead:

tools-services-02:~$ sudo /usr/local/sbin/puppet_alert.py
It has been 60 seconds since last puppet run.Sending nag emails.
Traceback (most recent call last):
  File "/usr/local/sbin/puppet_alert.py", line 83, in <module>
    main()
  File "/usr/local/sbin/puppet_alert.py", line 79, in main
    email_admins(subject, body)
  File "/usr/local/sbin/notify_maintainers.py", line 56, in email_admins
    ldap.SCOPE_BASE
  File "/usr/lib/python2.7/dist-packages/ldap/ldapobject.py", line 552, in search_s
    return self.search_ext_s(base,scope,filterstr,attrlist,attrsonly,None,None,timeout=self.timeout)
  File "/usr/lib/python2.7/dist-packages/ldap/ldapobject.py", line 546, in search_ext_s
    return self.result(msgid,all=1,timeout=timeout)[1]
  File "/usr/lib/python2.7/dist-packages/ldap/ldapobject.py", line 458, in result
    resp_type, resp_data, resp_msgid = self.result2(msgid,all,timeout)
  File "/usr/lib/python2.7/dist-packages/ldap/ldapobject.py", line 462, in result2
    resp_type, resp_data, resp_msgid, resp_ctrls = self.result3(msgid,all,timeout)
  File "/usr/lib/python2.7/dist-packages/ldap/ldapobject.py", line 469, in result3
    resp_ctrl_classes=resp_ctrl_classes
  File "/usr/lib/python2.7/dist-packages/ldap/ldapobject.py", line 476, in result4
    ldap_result = self._ldap_call(self._l.result4,msgid,all,timeout,add_ctrls,add_intermediates,add_extop)
  File "/usr/lib/python2.7/dist-packages/ldap/ldapobject.py", line 99, in _ldap_call
    result = func(*args,**kwargs)
ldap.NO_SUCH_OBJECT: {'matched': 'cn=tools,ou=projects,dc=wikimedia,dc=org', 'desc': 'No such object'}

Same happens in every other Cloud VPS instance that I've tried.

Additionally, I knew shinken-02 had Puppet disabled for over a week so it should have generated email notifications (or errors) every day at 8:15 but nothing was generated:

Nov 25 08:15:01 shinken-02 CRON[4736]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Nov 25 08:15:01 shinken-02 CRON[4738]: (root) CMD (/usr/local/sbin/puppet_alert.py)
Nov 25 08:15:01 shinken-02 CRON[4739]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
...
Nov 26 08:15:01 shinken-02 CRON[22243]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Nov 26 08:15:01 shinken-02 CRON[22245]: (root) CMD (/usr/local/sbin/puppet_alert.py)
Nov 26 08:15:01 shinken-02 CRON[22244]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)

This seems to be happening because the default exim4 configuration sends mail to /dev/null unless an alias for root is added to /etc/aliases. So the puppet_alert.py errors are being discarded.

Looking further to track where this work originated and whether we still depend on it, I found T121773.

Event Timeline

Whenever I find things that were probably broken for a while and nobody noticed, I must ask: do we want to keep this around?

Whenever I find things that were probably broken for a while and nobody noticed, I must ask: do we want to keep this around?

We do probably want to resume alerting people about instances with broken Puppet. I have a TODO recorded to write a bot that ensures a Phabricator task is open for each instance with failing puppet.

Change 475875 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] cloudvps: Fix Puppet alerts

https://gerrit.wikimedia.org/r/475875

Change 475875 merged by GTirloni:
[operations/puppet@production] cloudvps: Fix Puppet alerts

https://gerrit.wikimedia.org/r/475875