Page MenuHomePhabricator

improve icinga performance / solve general load issues on neon
Closed, ResolvedPublic

Description

tracking ticket to improve icinga performance, for obvious reasons the biggest improvements can be obtained by tweaking the icinga active checks.
the machine hosting icinga (neon) routinely suffers from high load, occasionally icinga children won't cleanup their children leaving behind zombies and eventually swapping.

http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&c=Miscellaneous+eqiad&h=neon.wikimedia.org&tab=m&vn=&hide-hf=false&mc=2&z=medium&metric_group=ALLGROUPS

also given the rate at which we spawn checks fork() overhead is significant, thus for example checks that spawn other commands are incredibily inefficient (e.g. check_ssl_cert spawns perl at each invocation)

@BBlack comments also on T110822:

Neon is routinely at 0% idle CPU when looking at realtime info. Even after manually turning down some of the most CPU-expensive checks, the remaining load still routinely spiked the machine to 0% idle CPU, so in its normal config it's definitely well past the line.

Also, the main icinga process itself is single-threaded and routinely locks up a single CPU core, effectively running out of processing power to keep up with its own demands, even if other cores are idle.

I noticed this while debugging the intermittent ipv6 monitor failures. I really don't know if this is casual or even related, but I figure solving this basic issue seems prudent...

Event Timeline

fgiunchedi raised the priority of this task from to High.
fgiunchedi updated the task description. (Show Details)
fgiunchedi added projects: acl*sre-team, Icinga.
fgiunchedi added a subscriber: fgiunchedi.
chasemp lowered the priority of this task from High to Medium.Jan 6 2015, 11:25 PM
chasemp added a subscriber: chasemp.

Reducing to normal as it doesn't have an assignee :)

jcrespo added subscribers: jcrespo, JohnLewis, Matanya, BBlack.

Merging because I think both proposals are about the same thing.

jcrespo renamed this task from improve icinga performance to improve icinga performance / solve general load issues.Sep 9 2015, 6:49 PM
jcrespo renamed this task from improve icinga performance / solve general load issues to improve icinga performance / solve general load issues on neon.
jcrespo updated the task description. (Show Details)
jcrespo set Security to None.

I am for sure not the most experienced person here, but I'll see where I can help.

So, I took a look at http://docs.icinga.org/latest/en/tuning.html. It has 17 points, and actually most of them (except 1, 10 and I guess two others + 16?) are already in place here. @faidon disabled embedded Perl one year ago in https://gerrit.wikimedia.org/r/#/c/183416/.

Looking at neon, the oldest SAL entry involving neon dates back to October 3, 2011 (https://wikitech.wikimedia.org/wiki/Server_admin_log/Archive_19), which means that neon is almost 4,5 years old (possibly even older!). Looking at the fact (Ganglia's node view wouldn't show incorrect info, right? :)) that neon is equipped with 8 cores @ 1.20 GHz (is it 8 cores with or without hyperthreading enabled?), 50GB disk and 16GB of RAM - I think that this can't be wrong.

Icinga undoubtedly pushes neon to its limits, with an average CPU usage of 80% and an average load(avg) of 18.3 - which suggests that the disk is a large bottleneck too. Only the RAM usage looks fine. Looking at the yearly load graphs, I see the load actually was only going down this year until somewhere end January 2016, after which it suddenly climbed to above 20 again. A load of 80 (2hr - 4hr graphs) is not that uncommon, but does not always show up, so Icinga seems to irregularly stress this server?

Again, it seems that you already tuned Icinga very much: there are not that many things you can still do. My very first suggestion would be allocating new hardware (more + faster cores, SSD (I guess? An SSD would be cool, but I think that it might not be much needed at all.)), the second suggestion (which is harder but also better) is investigating which checks are the most resource intensive and looking how we could improve them.

The third one is making Icinga a platform with multiple servers: http://docs.icinga.org/latest/en/distributed.html. Even harder than the other two I guess [the basic "this is production and we're not just monitoring a few servers"], but nevertheless still a nice project.

I'd like to know what others think. Comments are welcome.

Thanks @Southparkfan for the thorough analysis. You're pretty much right on all counts :)

Looking forward, we've thought of replacing the server with a more powerful one for a while (as well as upgrade to a newer distribution, as this is still one of the few remaining Ubuntu 12.04 ones).

However, @akosiaris has been working on a next-generation alerting infrastructure, using Shinken, that will be distributed from the get-go and be provisioned on current hardware. So we've kinda put the neon upgrade/replacement on hold until this happens, as it will make this whole issue redundant.

If you're interested, we might be able to use your help on tuning *that* infrastructure instead, once it gets deployed. :)

faidon changed the task status from Open to Stalled.Feb 17 2016, 2:08 PM
faidon assigned this task to akosiaris.
faidon lowered the priority of this task from Medium to Low.

This has more or less been resolved for now, mostly due to the migration of neon to einsteinium and the deprecation of check_sslXNN.