today, right after a deployment sync was triggered, the salt-minions on both deployment servers stopped:
mira:
1555 Dec 28 21:18:36 mira puppet-agent[2411]: (/Stage[main]/Deployment::Deployment_server/Exec[deployment_server_sync_all]) Triggered 'refresh' from 1 events
1556 Dec 28 21:18:38 mira kernel: [2926852.979534] init: salt-minion main process (17705) terminated with status 1
tin:
3395 Dec 28 21:07:18 tin puppet-agent[31334]: (/Stage[main]/Deployment::Deployment_server/Exec[deployment_server_sync_all]) Triggered 'refresh' from 1 events
3396 Dec 28 21:07:20 tin kernel: [28902389.700485] init: salt-minion main process (22731) terminated with status 1
resulting in:
13:15 < icinga-wm> PROBLEM - salt-minion processes on tin is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion
13:26 < icinga-wm> PROBLEM - salt-minion processes on mira is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion
When i saw the icinga messages i went to both servers and found salt-minion status to be "stop/waiting" and the log above.
I started the service on both servers and we got recoveries.