Tegmen: process spawn loop + failed icinga + failing puppet
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Volans
	Apr 18 2017, 11:25 PM

Description

For a completely unrelated reason I took a look at tegmen and found that everything was wrong over there, in random order:

Puppet was not running because of an Icinga configuration error
Apparently it was failing since a while because I couldn't find tegmen on Icinga (web), so no monitoring
There were ~128k processes running, almost all of them /usr/sbin/nsca --daemon -c /etc/nsca.cfg
After fixing the issue, is still not showing up on Icinga web (didn't had time to check why though)

I've killed all the nsca processes, stopped icinga and make puppet run, it seems to have cleanup the situation, but it would be nice to know why this happended and why tegmen doesn't show up on Icinga (web) that's probably the reason why we didn't notice it.

Puppet error:

Info: /Stage[main]/Icinga::Naggen/File[/etc/icinga/puppet_services.cfg]: Scheduling refresh of Service[icinga]
Error: /Stage[main]/Icinga/Service[icinga]: Failed to call refresh: Could not restart Service[icinga]: Execution of '/etc/init.d/icinga reload' returned 1: Reloading icinga configuration (via systemctl): icinga.serviceJob for icinga.service failed. See 'systemctl status icinga.service' and 'journalctl -xn' for details.
 failed!
Error: /Stage[main]/Icinga/Service[icinga]: Could not restart Service[icinga]: Execution of '/etc/init.d/icinga reload' returned 1: Reloading icinga configuration (via systemctl): icinga.serviceJob for icinga.service failed. See 'systemctl status icinga.service' and 'journalctl -xn' for details.
 failed!
Notice: Finished catalog run in 40.54 seconds

Icinga error:

[1492497008] Icinga 1.11.6 starting... (PID=929)
[1492497008] Local time is Tue Apr 18 06:30:08 UTC 2017
[1492497008] LOG VERSION: 2.0
[1492497009] Warning: Duplicate definition found for service 'keystone http' on host 'labtestcontrol2001' (config file '/etc/icinga/puppet_services.cfg', starting on line 209871)

... [SNIP] ...

[1492497009] Warning: Duplicate definition found for service 'Varnishkafka log producer' on host 'cp1008' (config file '/etc/icinga/puppet_services.cfg', starting on line 27921)
[1492497010] Bailing out due to failure to daemonize. (PID=929)

Changes at every puppet run

Notice: /Stage[main]/Icinga/File[/var/lib/nagios/rw/nagios.cmd]/group: group changed 'icinga' to 'www-data'
Notice: /Stage[main]/Icinga/File[/var/lib/nagios/rw/nagios.cmd]/mode: mode changed '0660' to '0664'
Notice: /Stage[main]/Icinga::Naggen/File[/etc/icinga/puppet_services.cfg]/owner: owner changed 'root' to 'icinga'
Notice: /Stage[main]/Icinga::Naggen/File[/etc/icinga/puppet_services.cfg]/group: group changed 'root' to 'icinga'

Details

Subject	Repo	Branch	Lines +/-
icinga: Adjust the frequency of sync-icinga-state	operations/puppet	production	+1 -1
Make varnishkafka log producer descriptions unique	operations/puppet	production	+3 -3
icinga: Fix permissions for /var/lib/nagios/rw	operations/puppet	production	+4 -11
icinga: wrap run-no-puppet around sync_icinga_state	operations/puppet	production	+1 -1

Customize query in gerrit

Event Timeline

Volans created this task.Apr 18 2017, 11:25 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 18 2017, 11:25 PM

Could it be that the crontab that runs every 10 minutes had a race with a puppet run and make all this mess... I don't see it wrapped in a run-no-puppet:

*/10 * * * * /usr/local/sbin/sync_icinga_state >/dev/null 2>&1

and that script does indeed an icinga stop/rsync/start

Also, why we do the stop/sync/start all the time instead of just syncing the files on a safe location and have a script make-icinga-primary or similar that does in a run-no-puppet the stop/mv (safer cp)/start that we can run manually only when needed?

Change 348898 had a related patch set uploaded (by Dzahn):
[operations/puppet@production] icinga: wrap run-no-puppet around sync_icinga_state

https://gerrit.wikimedia.org/r/348898

gerritbot added a project: Patch-For-Review.Apr 19 2017, 3:05 AM

Change 348898 merged by Dzahn:
[operations/puppet@production] icinga: wrap run-no-puppet around sync_icinga_state

https://gerrit.wikimedia.org/r/348898

Puppet was not running because of an Icinga configuration error

puppet runs alright now, no errors

couldn't find tegmen on Icinga (web), so no monitoring

I couldn't see the reason for this yet either, same roles, and the Hiera differences that make it the passive server don't seem to cause this, but didn't have much time to look. @akosiaris any ideas?

There were ~128k processes running, almost all of them /usr/sbin/nsca

Just 700 processes now and a single nsca.

After fixing the issue, is still not showing up on Icinga web

Confirmed, still not there, see above.

Could it be that the crontab that runs every 10 minutes had a race with a puppet run and make all this mess... I don't see it wrapped in a run-no-puppet:

Added that wrapper around it. (https://gerrit.wikimedia.org/r/348898). Your theory seems plausible, so far it seems to behave now since that.

@akosiaris: I've found that the catalog for tegmen doesn't have Nagios_Host and Nagios_Service resources and I think this is due because of this hack:

In T163286#3192868, @Dzahn wrote:

Puppet was not running because of an Icinga configuration error

puppet runs alright now, no errors

What was that configuration error ?

couldn't find tegmen on Icinga (web), so no monitoring

I couldn't see the reason for this yet either, same roles, and the Hiera differences that make it the passive server don't seem to cause this, but didn't have much time to look. @akosiaris any ideas?

It's because of the hack mentioned by @Volans above

There were ~128k processes running, almost all of them /usr/sbin/nsca

Just 700 processes now and a single nsca.

I am very much perplexed by the number of processes (128k per grafana tegmen). How come all these processes were spawned ? It can't be puppet runs cause it would take 90 months if every puppet run was to start an nsca process and the entire thing happened over 3 days. The only thing that comes to mind is some misconfiguration. Logs on tegmen indicate something wrong in those 3 days

...
Mar 23 11:00:06 tegmen icinga: Error: External command failed -> PROCESS_SERVICE_CHECK_RESULT;civi1001;check_procs;0;PROCS OK: 207 processes | procs=207;500;1000;0;
Mar 23 11:00:06 tegmen icinga: External command error: Command failed
Mar 23 11:00:06 tegmen icinga: Warning:  Passive check result was received for service 'check_puppetrun' on host 'civi1001', but the host could not be found!
...

Could it be that the crontab that runs every 10 minutes had a race with a puppet run and make all this mess... I don't see it wrapped in a run-no-puppet:

Added that wrapper around it. (https://gerrit.wikimedia.org/r/348898). Your theory seems plausible, so far it seems to behave now since that.

That script though only restarts icinga, not nsca. While there might indeed be a race I am currently doubting it's related.

In T163286#3193236, @Volans wrote:

@akosiaris: I've found that the catalog for tegmen doesn't have Nagios_Host and Nagios_Service resources and I think this is due because of this hack:

https://github.com/wikimedia/puppet/blob/production/modules/monitoring/manifests/host.pp#L89

https://github.com/wikimedia/puppet/blob/production/modules/monitoring/manifests/service.pp#L98

Yes, it is because of that. The reason for that hack is that we have a multitude of hosts and services that are effectively defined in the scope of the icinga host. Examples are all the LVS services, LVS IPs, PDUs and so on. Having those exported by more than 1 host causes duplicate definitions in puppet, hence that hack. A side-effect of that hack is that the 2 icinga hosts DO NOT monitor each other, which is not great. I 've been thinking of ways to solve this (like tagging exported resources and filtering on the tag or manually setting the "other" host in each icinga server's config) but none are great.

In T163286#3193430, @akosiaris wrote:

What was that configuration error ?

Sorry my bad, I was probably too tired to copy, paste and SNIP correctly the output, I was mislead by the ton of duplicate warnings and forgot to leave the actual error (that of course I don't remember).

There were ~128k processes running, almost all of them /usr/sbin/nsca

Just 700 processes now and a single nsca.

I am very much perplexed by the number of processes (128k per grafana tegmen). How come all these processes were spawned ? It can't be puppet runs cause it would take 90 months if every puppet run was to start an nsca process and the entire thing happened over 3 days. The only thing that comes to mind is some misconfiguration. Logs on tegmen indicate something wrong in those 3 days

No it's not for 3 days, it started around 11:15-11:30 on March 23rd and continued since then. Doing a kill of the processes without the parent one was continuing to respawn nsca processes continuosly, I had to killall -9 /usr/sbin/nsca AND systemctl stop icinga to get it stopped.

In T163286#3193236, @Volans wrote:

@akosiaris: I've found that the catalog for tegmen doesn't have Nagios_Host and Nagios_Service resources and I think this is due because of this hack:

https://github.com/wikimedia/puppet/blob/production/modules/monitoring/manifests/host.pp#L89

https://github.com/wikimedia/puppet/blob/production/modules/monitoring/manifests/service.pp#L98

Yes, it is because of that. The reason for that hack is that we have a multitude of hosts and services that are effectively defined in the scope of the icinga host. Examples are all the LVS services, LVS IPs, PDUs and so on. Having those exported by more than 1 host causes duplicate definitions in puppet, hence that hack. A side-effect of that hack is that the 2 icinga hosts DO NOT monitor each other, which is not great. I 've been thinking of ways to solve this (like tagging exported resources and filtering on the tag or manually setting the "other" host in each icinga server's config) but none are great.

Could it work adding some code that checks if the role::icinga::passive is true and in that cases exporting the Nagios_Host that matches the hostname of the host and similar for the services?

In T163286#3193528, @Volans wrote:

In T163286#3193430, @akosiaris wrote:

What was that configuration error ?

Sorry my bad, I was probably too tired to copy, paste and SNIP correctly the output, I was mislead by the ton of duplicate warnings and forgot to leave the actual error (that of course I don't remember).

But something not requiring puppet changes ? We should puppetize it, that's my point.

There were ~128k processes running, almost all of them /usr/sbin/nsca

Just 700 processes now and a single nsca.

I am very much perplexed by the number of processes (128k per grafana tegmen). How come all these processes were spawned ? It can't be puppet runs cause it would take 90 months if every puppet run was to start an nsca process and the entire thing happened over 3 days. The only thing that comes to mind is some misconfiguration. Logs on tegmen indicate something wrong in those 3 days

No it's not for 3 days, it started around 11:15-11:30 on March 23rd and continued since then. Doing a kill of the processes without the parent one was continuing to respawn nsca processes continuosly, I had to killall -9 /usr/sbin/nsca AND systemctl stop icinga to get it stopped.

Sorry, I wasn't clear on that. I meant it took 3 days to go from ~600 processes to 128k processes. But yes, the issue went on for a long time. So from what I gather from your comment is that icinga was spawning the nsca daemon ? That does not sound correct.

In T163286#3193236, @Volans wrote:

@akosiaris: I've found that the catalog for tegmen doesn't have Nagios_Host and Nagios_Service resources and I think this is due because of this hack:

https://github.com/wikimedia/puppet/blob/production/modules/monitoring/manifests/host.pp#L89

https://github.com/wikimedia/puppet/blob/production/modules/monitoring/manifests/service.pp#L98

Yes, it is because of that. The reason for that hack is that we have a multitude of hosts and services that are effectively defined in the scope of the icinga host. Examples are all the LVS services, LVS IPs, PDUs and so on. Having those exported by more than 1 host causes duplicate definitions in puppet, hence that hack. A side-effect of that hack is that the 2 icinga hosts DO NOT monitor each other, which is not great. I 've been thinking of ways to solve this (like tagging exported resources and filtering on the tag or manually setting the "other" host in each icinga server's config) but none are great.

Could it work adding some code that checks if the role::icinga::passive is true and in that cases exporting the Nagios_Host that matches the hostname of the host and similar for the services?

Yeah that was part of how I was thinking to implement add the "other" host thing I mentioned above. The thing is monitoring::host is buried deep in the code and called by multiple code paths each with their own scope (the icinga hosts, the "extra" hosts, the rest of the hosts exporting themselves), ending up with at least 2 weird ifs with subnested ifs and at the same time I 've tried to hide this complexity from the caller (clearly not very well). So it ends up not being great, maybe just sufficient and waiting to bite us at some other point in time ?

So after the switch of tegmen as active now we have the issue on einsteinium:

 1012 ?        Ss    48:55 /usr/sbin/nsca --daemon -c /etc/nsca.cfg
 1042 ?        SL     0:00  \_ /usr/sbin/nsca --daemon -c /etc/nsca.cfg
 1043 ?        SL     0:00  \_ /usr/sbin/nsca --daemon -c /etc/nsca.cfg
 1044 ?        SL     0:00  \_ /usr/sbin/nsca --daemon -c /etc/nsca.cfg
 1045 ?        SL     0:00  \_ /usr/sbin/nsca --daemon -c /etc/nsca.cfg
 1046 ?        SL     0:00  \_ /usr/sbin/nsca --daemon -c /etc/nsca.cfg
 1047 ?        SL     0:00  \_ /usr/sbin/nsca --daemon -c /etc/nsca.cfg
 1048 ?        SL     0:00  \_ /usr/sbin/nsca --daemon -c /etc/nsca.cfg
 1049 ?        SL     0:00  \_ /usr/sbin/nsca --daemon -c /etc/nsca.cfg
 1050 ?        SL     0:00  \_ /usr/sbin/nsca --daemon -c /etc/nsca.cfg
 1051 ?        SL     0:00  \_ /usr/sbin/nsca --daemon -c /etc/nsca.cfg
 1052 ?        SL     0:00  \_ /usr/sbin/nsca --daemon -c /etc/nsca.cfg
 1053 ?        SL     0:00  \_ /usr/sbin/nsca --daemon -c /etc/nsca.cfg
 1054 ?        SL     0:00  \_ /usr/sbin/nsca --daemon -c /etc/nsca.cfg
 1055 ?        SL     0:00  \_ /usr/sbin/nsca --daemon -c /etc/nsca.cfg
 1056 ?        SL     0:00  \_ /usr/sbin/nsca --daemon -c /etc/nsca.cfg
 1057 ?        SL     0:00  \_ /usr/sbin/nsca --daemon -c /etc/nsca.cfg
 1058 ?        SL     0:00  \_ /usr/sbin/nsca --daemon -c /etc/nsca.cfg
 1059 ?        SL     0:00  \_ /usr/sbin/nsca --daemon -c /etc/nsca.cfg
 1060 ?        SL     0:00  \_ /usr/sbin/nsca --daemon -c /etc/nsca.cfg
 1061 ?        SL     0:00  \_ /usr/sbin/nsca --daemon -c /etc/nsca.cfg
 1072 ?        SL     0:00  \_ /usr/sbin/nsca --daemon -c /etc/nsca.cfg
 1073 ?        SL     0:00  \_ /usr/sbin/nsca --daemon -c /etc/nsca.cfg
 1074 ?        SL     0:00  \_ /usr/sbin/nsca --daemon -c /etc/nsca.cfg
 1075 ?        SL     0:00  \_ /usr/sbin/nsca --daemon -c /etc/nsca.cfg
 1076 ?        SL     0:00  \_ /usr/sbin/nsca --daemon -c /etc/nsca.cfg
 1077 ?        SL     0:00  \_ /usr/sbin/nsca --daemon -c /etc/nsca.cfg
....

The parent seems in an infinite loop

$ sudo strace -fF -p 1012
Process 1012 attached
restart_syscall(<... resuming interrupted call ...>) = 0
accept(4, 0, NULL)                      = -1 EAGAIN (Resource temporarily unavailable)
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {0x402e50, [], SA_RESTORER|SA_NOCLDSTOP, 0x7fc30e8e20e0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({1, 0}, 0x7fff3fcdef60)       = 0
accept(4, 0, NULL)                      = -1 EAGAIN (Resource temporarily unavailable)
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {0x402e50, [], SA_RESTORER|SA_NOCLDSTOP, 0x7fc30e8e20e0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({1, 0}, 0x7fff3fcdef60)       = 0
accept(4, 0, NULL)                      = -1 EAGAIN (Resource temporarily unavailable)
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {0x402e50, [], SA_RESTORER|SA_NOCLDSTOP, 0x7fc30e8e20e0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({1, 0}, ^CProcess 1012 detached
 <detached ...>

Attaching to a child make it exit:

$ sudo strace -fF -p 29891
Process 29891 attached
open("/rw/nagios.cmd", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 4
fstat(4, {st_mode=S_IFIFO|0660, st_size=0, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fc30f2bf000
write(4, "[1492684217] PROCESS_SERVICE_CHE"..., 143) = 143
close(4)                                = 0
munmap(0x7fc30f2bf000, 4096)            = 0
recvfrom(5, "", 4304, 0, NULL, NULL)    = 0
munlock(0x180d7f0, 32)                  = 0
munlock(0x180d820, 1020)                = 0
munlock(0x180d7d0, 24)                  = 0
close(5)                                = 0
exit_group(0)                           = ?
+++ exited with 0 +++

And open("/rw/nagios.cmd"... seems really wrong, @akosiaris and I are investigating.

I 've finally figured out the sheer beauty of this bug. It's a race condition between 3 components (puppet+icinga+nsca)

What happens is that icinga gets restarted by the rsync cron process for keeping the 2 icinga hosts in sync. Icinga is the one responsible for the creation of the /var/lib/nagios/rw/nagios.cmd file and creates is correctly as a pipe. Puppet is configured unfortunately in https://github.com/wikimedia/puppet/blame/production/modules/icinga/manifests/init.pp#L162 to ensure the file present. Before https://gerrit.wikimedia.org/r/348898 puppet would create the file as well. With that merge that avenue of the race condition is fixed.

However there is one more avenue. That would be icinga being stopped while the fork of the nsca child process is ongoing. The file disappears right before the child actually manages to write to it which prompts (eventually?) what @Volans pasted above

open("/rw/nagios.cmd", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 4

would also create the file (look at the O_CREAT flag). Looking at

-rw-r--r-- 1 icinga nagios    185 Apr 20 12:43 nagios.cmd

instead of the correct which is

prw-rw---- 1 icinga icinga      0 Apr 20 12:44 nagios.cmd

Ignore the permissions for now, look at how the file is not a pipe. Clearly nsca writes to the file, not detecting it is not a pipe. The trigger is probably that the icinga restart pulls the carpet right from under it's feet removing the .cmd file and ending up blocking nsca. That is kind of documented in https://github.com/NagiosEnterprises/nsca/blob/433aa8e8df5bed1e9f03d8c5c567512a3f3bb138/src/nsca.c#L1213

And open("/rw/nagios.cmd"... seems really wrong, @akosiaris and I are investigating.

We 've also figured out that one. nsca is chrooted so /rw/nagios.cmd is correct.

I 'll fix puppet and we 'll follow @Volans suggestion in T163286#3192354 and avoid unnecessary restarts

Change 349221 had a related patch set uploaded (by Alexandros Kosiaris):
[operations/puppet@production] icinga: Fix permissions for /var/lib/nagios/rw

https://gerrit.wikimedia.org/r/349221

Change 349224 had a related patch set uploaded (by Alexandros Kosiaris):
[operations/puppet@production] Make varnishkafka log producer descriptions unique

https://gerrit.wikimedia.org/r/349224

Change 349221 merged by Alexandros Kosiaris:
[operations/puppet@production] icinga: Fix permissions for /var/lib/nagios/rw

https://gerrit.wikimedia.org/r/349221

Change 349224 merged by Alexandros Kosiaris:
[operations/puppet@production] Make varnishkafka log producer descriptions unique

https://gerrit.wikimedia.org/r/349224

Change 349603 had a related patch set uploaded (by Alexandros Kosiaris):
[operations/puppet@production] icinga: Adjust the frequency of sync-icinga-state

https://gerrit.wikimedia.org/r/349603

Change 349603 merged by Alexandros Kosiaris:
[operations/puppet@production] icinga: Adjust the frequency of sync-icinga-state

https://gerrit.wikimedia.org/r/349603

The trick above (https://gerrit.wikimedia.org/r/349603) seems to have solved the issue. What now effectively happens is that the icinga process restart does not coincide with NSCA results coming in, hence avoiding the need to processes and thus the race. While the way this is done is crude, it looks like it is working einsteinium processes.

I 'll resolve the task for now, we can always reopen

Reopening, looks like tegmen is suffering lots of nsca processes again :(

root@tegmen:~# ps fwwwaux | grep -c nsca
118
root@tegmen:~# uptime
 09:45:27 up  1:29,  2 users,  load average: 0.56, 2.21, 8.65

• Phabricator_maintenance moved this task from Backlog to Acknowledged on the SRE board.Jan 26 2019, 9:11 PM

Since this task last update we've migrated Icinga to new hosts (jessie -> stretch) and slightly different version of Icinga. Resolving.

Tegmen: process spawn loop + failed icinga + failing puppetClosed, ResolvedPublicActions

Description

Details

Event Timeline

Tegmen: process spawn loop + failed icinga + failing puppet
Closed, ResolvedPublic
Actions