Page MenuHomePhabricator

Any puppet failure on a labs instance should send an email to project admins
Closed, ResolvedPublic

Description

I have not thought about whether this is hard or easy. But ideally, puppet could punish project admins for neglected instances -- this might encourage people to fix or delete them.

Event Timeline

Andrew raised the priority of this task from to Needs Triage.
Andrew updated the task description. (Show Details)
Andrew added a project: Cloud-Services.
Andrew subscribed.
Andrew set Security to None.

Change 262856 had a related patch set uploaded (by Andrew Bogott):
WIP: Send email to project admins when puppet fails.

https://gerrit.wikimedia.org/r/262856

Change 262856 merged by Andrew Bogott:
Send email to project admins if puppet runs are failing.

https://gerrit.wikimedia.org/r/262856

OK... in projects 'testlabs' and 'puppet' I can do this:

echo "this is text" | mail -s 'subject line' andrewbogott@gmail.com

and it sends me an email. That doesn't work on an instance in 'tools' though. Merlijn predicted that I would have this problem :(

Why does it work in some projects and not others? Note that if it's /only/ tools that prevents emails, that's just fine since the Tools people are getting shinken notifications about puppet anyway. I just want to confirm that nags will get sent from other projects.

This works for me in an interactive session on tools-bastion-01. For grid jobs, @Anomie found out that this can fail under some circumstances (cf. https://wikitech.wikimedia.org/wiki/Help:Tool_Labs#Mail_from_tools) and you have to use /usr/sbin/exim -odf -i instead. Could you have encountered a similar problem?

The messages seem to be queued on the test host (tools-puppet-is-broken-here-on-purpose.tools.eqiad.wmflabs), and are not actually being sent out:

2016-01-16 06:50:41 1aJzIg-0002oE-Aa no IP address found for host polonium.wikimedia.org
2016-01-16 06:50:41 1aJzIg-0002oE-Aa == root@wmflabs.org R=smart_route defer (-1): lookup of host "polonium.wikimedia.org" failed in smart_route router

and there's 60 emails to andrewbogott@gmail.com waiting in /var/spool/exim4/input.

I'm not sure why the default labs email config doesn't work, but applying a tools manifest (which should set tools-mail as router instead of polonium) might help for this specific issue.

Ok, seems like tools is an outlier. I'm testing in 'testlabs' instead.

Change 264904 had a related patch set uploaded (by Andrew Bogott):
Only check puppet freshness once per day, not 60 times in a row.

https://gerrit.wikimedia.org/r/264904

Change 264904 merged by Andrew Bogott:
Only check puppet freshness once per day, not 60 times in a row.

https://gerrit.wikimedia.org/r/264904