Page MenuHomePhabricator

Any puppet failure on a labs instance should send an email to project admins
Closed, ResolvedPublic

Description

I have not thought about whether this is hard or easy. But ideally, puppet could punish project admins for neglected instances -- this might encourage people to fix or delete them.

Event Timeline

Andrew created this task.Dec 17 2015, 4:43 PM
Andrew raised the priority of this task from to Needs Triage.
Andrew updated the task description. (Show Details)
Andrew added a project: Cloud-Services.
Andrew added a subscriber: Andrew.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptDec 17 2015, 4:43 PM
Andrew updated the task description. (Show Details)Dec 17 2015, 6:22 PM
Andrew set Security to None.

Change 262856 had a related patch set uploaded (by Andrew Bogott):
WIP: Send email to project admins when puppet fails.

https://gerrit.wikimedia.org/r/262856

Change 262856 merged by Andrew Bogott:
Send email to project admins if puppet runs are failing.

https://gerrit.wikimedia.org/r/262856

Andrew claimed this task.Jan 16 2016, 6:52 PM

OK... in projects 'testlabs' and 'puppet' I can do this:

echo "this is text" | mail -s 'subject line' andrewbogott@gmail.com

and it sends me an email. That doesn't work on an instance in 'tools' though. Merlijn predicted that I would have this problem :(

Why does it work in some projects and not others? Note that if it's /only/ tools that prevents emails, that's just fine since the Tools people are getting shinken notifications about puppet anyway. I just want to confirm that nags will get sent from other projects.

scfc added subscribers: Anomie, scfc.Jan 16 2016, 8:30 PM

This works for me in an interactive session on tools-bastion-01. For grid jobs, @Anomie found out that this can fail under some circumstances (cf. https://wikitech.wikimedia.org/wiki/Help:Tool_Labs#Mail_from_tools) and you have to use /usr/sbin/exim -odf -i instead. Could you have encountered a similar problem?

@scfc thank you! I will try that.

The messages seem to be queued on the test host (tools-puppet-is-broken-here-on-purpose.tools.eqiad.wmflabs), and are not actually being sent out:

2016-01-16 06:50:41 1aJzIg-0002oE-Aa no IP address found for host polonium.wikimedia.org
2016-01-16 06:50:41 1aJzIg-0002oE-Aa == root@wmflabs.org R=smart_route defer (-1): lookup of host "polonium.wikimedia.org" failed in smart_route router

and there's 60 emails to andrewbogott@gmail.com waiting in /var/spool/exim4/input.

I'm not sure why the default labs email config doesn't work, but applying a tools manifest (which should set tools-mail as router instead of polonium) might help for this specific issue.

Anomie removed a subscriber: Anomie.Jan 16 2016, 8:59 PM

Ok, seems like tools is an outlier. I'm testing in 'testlabs' instead.

Change 264904 had a related patch set uploaded (by Andrew Bogott):
Only check puppet freshness once per day, not 60 times in a row.

https://gerrit.wikimedia.org/r/264904

Change 264904 merged by Andrew Bogott:
Only check puppet freshness once per day, not 60 times in a row.

https://gerrit.wikimedia.org/r/264904

Andrew closed this task as Resolved.Jan 21 2016, 11:44 PM