For a completely unrelated reason I took a look at tegmen and found that everything was wrong over there, in random order:
- Puppet was not running because of an Icinga configuration error
- Apparently it was failing since a while because I couldn't find tegmen on Icinga (web), so no monitoring
- There were ~128k processes running, almost all of them /usr/sbin/nsca --daemon -c /etc/nsca.cfg
- After fixing the issue, is still not showing up on Icinga web (didn't had time to check why though)
I've killed all the nsca processes, stopped icinga and make puppet run, it seems to have cleanup the situation, but it would be nice to know why this happended and why tegmen doesn't show up on Icinga (web) that's probably the reason why we didn't notice it.
Puppet error:
Info: /Stage[main]/Icinga::Naggen/File[/etc/icinga/puppet_services.cfg]: Scheduling refresh of Service[icinga] Error: /Stage[main]/Icinga/Service[icinga]: Failed to call refresh: Could not restart Service[icinga]: Execution of '/etc/init.d/icinga reload' returned 1: Reloading icinga configuration (via systemctl): icinga.serviceJob for icinga.service failed. See 'systemctl status icinga.service' and 'journalctl -xn' for details. failed! Error: /Stage[main]/Icinga/Service[icinga]: Could not restart Service[icinga]: Execution of '/etc/init.d/icinga reload' returned 1: Reloading icinga configuration (via systemctl): icinga.serviceJob for icinga.service failed. See 'systemctl status icinga.service' and 'journalctl -xn' for details. failed! Notice: Finished catalog run in 40.54 seconds
Icinga error:
[1492497008] Icinga 1.11.6 starting... (PID=929) [1492497008] Local time is Tue Apr 18 06:30:08 UTC 2017 [1492497008] LOG VERSION: 2.0 [1492497009] Warning: Duplicate definition found for service 'keystone http' on host 'labtestcontrol2001' (config file '/etc/icinga/puppet_services.cfg', starting on line 209871) ... [SNIP] ... [1492497009] Warning: Duplicate definition found for service 'Varnishkafka log producer' on host 'cp1008' (config file '/etc/icinga/puppet_services.cfg', starting on line 27921) [1492497010] Bailing out due to failure to daemonize. (PID=929)
Changes at every puppet run
Notice: /Stage[main]/Icinga/File[/var/lib/nagios/rw/nagios.cmd]/group: group changed 'icinga' to 'www-data' Notice: /Stage[main]/Icinga/File[/var/lib/nagios/rw/nagios.cmd]/mode: mode changed '0660' to '0664' Notice: /Stage[main]/Icinga::Naggen/File[/etc/icinga/puppet_services.cfg]/owner: owner changed 'root' to 'icinga' Notice: /Stage[main]/Icinga::Naggen/File[/etc/icinga/puppet_services.cfg]/group: group changed 'root' to 'icinga'