Page MenuHomePhabricator

Huge log files on icinga machines
Closed, ResolvedPublic

Description

Today einsteinum was going to run out of disk space, turns out this is due to puppet spam (!) in daemon.log. There's icinga config files diffs emitted at each puppet run causing churn, e.g:

 define service {
-# --PUPPET_NAME-- cp1059 varnish http maps-frontend - port 3123
+# --PUPPET_NAME-- elastic2010 puppet_checkpuppetrun
        active_checks_enabled          1
-       check_command                  check_http_varnish!varnishcheck!3123
+       check_command                  nrpe_check!check_puppet_checkpuppetrun!10
        check_freshness                0
        check_interval                 1
        check_period                   24x7
        contact_groups                 admins
-       host_name                      cp1059
+       host_name                      elastic2010
        is_volatile                    0
        max_check_attempts             3
        notification_interval          0
@@ -386662,19 +386662,19 @@
        notification_period            24x7
        passive_checks_enabled         1
        retry_interval                 1
-       service_description            Varnish HTTP maps-frontend - port 3123
-       servicegroups                  cache_maps_eqiad
+       service_description            puppet last run
+       servicegroups                  elasticsearch_codfw
 
 }

I've archived the previous big daemon.log in /srv so there's no immediate danger for the next few days.

Aside from this issue there seem to be organically big log files, syslog messages and user.log have mostly the same contents (i.e. logs from icinga)

root@einsteinium:/var/log# du -hcs * | grep G
9.2G    account
2.0G    daemon.log.1
2.1G    icinga
1.4G    messages
2.7G    messages.1
2.0G    syslog
3.2G    syslog.1
1.4G    user.log
2.2G    user.log.1
29G     total

Details

Related Gerrit Patches:
operations/puppet : productionnaggen2: order resources in python

Related Objects

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2016-11-04T22:43:54Z] <godog> stop puppet on einsteinium and tegment to avoid log spam - T150061

Dzahn added a comment.EditedNov 5 2016, 2:25 AM

9.2G account

Do we really want process accounting on?

root@einsteinium:/var/log/account# lastcomm | head
icinga            F    icinga   __         0.00 secs Sat Nov  5 02:21
check_nrpe             icinga   __         0.00 secs Sat Nov  5 02:21
icinga            F    icinga   __         0.00 secs Sat Nov  5 02:21
check_nrpe             icinga   __         0.00 secs Sat Nov  5 02:21
icinga            F    icinga   __         0.00 secs Sat Nov  5 02:21
check_nrpe             icinga   __         0.00 secs Sat Nov  5 02:21
icinga            F    icinga   __         0.00 secs Sat Nov  5 02:21
...

root@einsteinium:/var/log/account# sa 
47607542  287709.00re    2264.91cp         0avio     20722k
  403370    2099.85re    1013.85cp         0avio     15364k   check_ssl
22868826   98505.70re     338.18cp         0avio     38345k   icinga*
  236712     262.77re     181.56cp         0avio     13213k   python
..
 1387842    2385.35re      46.15cp         0avio      4358k   check_ntp_time
..
16167908   56517.17re      14.13cp         0avio      3888k   check_nrpe
..
2275875    4759.28re       0.55cp         0avio      5380k   check_http
..

We could just turn accounting off.

Volans added subscribers: akosiaris, Joe, Volans.EditedNov 7 2016, 1:59 PM

@fgiunchedi @akosiaris @Joe
From the looks of it (I've take just a quick look, correct me if I'm wrong):

  • Puppet spam is caused by a missing sorting in the generated file from naggen2. The previous MySQL query was sorted, while the current PuppetDB query is not, hence the generated files have diffs at each run. It can be fixed adding the sorting in the PuppetDB query with order-by or directly in Python afterwards.
  • messages spam is caused by Icinga that reach the max concurrent checks and log the message Max concurrent service checks (3200) has been reached. for each of them. The value can be tuned but it was left as is to keep the load of the hosts under control AFAIK.
  • the load of both einsteinium and tegmen dropped significantly on Nov. 1st, I dunno what was changed but looks that now it could use some more concurrency.
Joe added a comment.Nov 7 2016, 3:20 PM

@fgiunchedi @akosiaris @Joe
From the looks of it (I've take just a quick look, correct me if I'm wrong):

  • Puppet spam is caused by a missing sorting in the generated file from naggen2. The previous MySQL query was sorted, while the current PuppetDB query is not, hence the generated files have diffs at each run. It can be fixed adding the sorting in the PuppetDB query with order-by or directly in Python afterwards.
  • messages spam is caused by Icinga that reach the max concurrent checks and log the message Max concurrent service checks (3200) has been reached. for each of them. The value can be tuned but it was left as is to keep the load of the hosts under control AFAIK.
  • the load of both einsteinium and tegmen dropped significantly on Nov. 1st, I dunno what was changed but looks that now it could use some more concurrency.

The order-by was removed from naggen2 because it was causing a lot of pain to puppetdb. The resource ordering used to be consistent across runs though.

FTR I am reenabling puppet now, this is not really an host we can avoid running puppet on.

@fgiunchedi @akosiaris @Joe
From the looks of it (I've take just a quick look, correct me if I'm wrong):

  • Puppet spam is caused by a missing sorting in the generated file from naggen2. The previous MySQL query was sorted, while the current PuppetDB query is not, hence the generated files have diffs at each run. It can be fixed adding the sorting in the PuppetDB query with order-by or directly in Python afterwards.

The sorting in PuppetDB was added in the beginning and then removed. https://gerrit.wikimedia.org/r/#/c/312186/ has the details. We can retry it or do it in python indeed.

  • messages spam is caused by Icinga that reach the max concurrent checks and log the message Max concurrent service checks (3200) has been reached. for each of them. The value can be tuned but it was left as is to keep the load of the hosts under control AFAIK.

Yes, that one was practically up to Nov 1st impossible to avoid due to the load on the boxes. It was a well placed choke that allowed the service to have a predictable cpu/memory usage to continue to exist. We can start increasing it way more aggressively now

  • the load of both einsteinium and tegmen dropped significantly on Nov. 1st, I dunno what was changed but looks that now it could use some more concurrency.

check_ssl changes in https://gerrit.wikimedia.org/r/#/q/topic:check_ssl_stuff practically. The one that changed everything so dramatically was https://gerrit.wikimedia.org/r/#/c/318971/

Joe added a comment.Nov 7 2016, 3:43 PM

Ok, after a few puppet runs it seems clear to me that ordering is not constant anymore when coming from puppetdb.

I will try to set the order in the puppetdb query first.

Dzahn added a comment.Nov 8 2016, 2:54 AM

Is it possible to re-enable puppet for a single run so that "gallium" gets removed from it (i ran node deactivate earlier today).

Peachey88 updated the task description. (Show Details)Nov 8 2016, 3:23 AM
Joe claimed this task.Nov 8 2016, 9:50 AM
Joe added a project: User-Joe.
Joe removed a subscriber: JohnLewis.

Change 320359 had a related patch set uploaded (by Giuseppe Lavagetto):
naggen2: order resources in python

https://gerrit.wikimedia.org/r/320359

Joe moved this task from Backlog to Doing on the User-Joe board.Nov 8 2016, 10:12 AM

Change 320359 merged by Giuseppe Lavagetto:
naggen2: order resources in python

https://gerrit.wikimedia.org/r/320359

Joe added a comment.Nov 8 2016, 11:22 AM

With the current naggen2 update the output files are stabe again.

I reenabled puppet on both hosts and it's running without issues.

Dzahn awarded a token.Nov 8 2016, 3:46 PM

https://gerrit.wikimedia.org/r/#/c/320582/ increase the max concurrent checks to 5000, which einsteinium seems to have no problem handling (which was to be expected given the lag was ~0). So with that both issues seem resolved. I 'd leave the task open for a while for monitoring purposes before resolving though

Joe closed this task as Resolved.Nov 22 2016, 8:22 AM