Page MenuHomePhabricator

Investigate icinga (einsteinium) load
Closed, ResolvedPublic

Description

Einsteinium has a relatively high load with the icinga parent process consuming lots of cpu. Presumably this is contributing to lag in the web UI. Creating a task to investigate potential tuning steps.

Event Timeline

Potential tunable - Number of packets sent by each call to check_ping (moved to https://phabricator.wikimedia.org/T173315)

max_concurrent_checks is set to a high value (6000).

Maybe there are checks that could be run on a less frequent interval?

Could also experiment with lowering this value and tradeoffs between scheduling latency and concurrency.

Quite interesting to me is the periodical nature of the load increase and decrease, namely https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&panelId=12&fullscreen&orgId=1&var-server=einsteinium&var-network=bond0&from=1503286018566&to=1503321869466. Seems to me like the icinga scheduler co-schedules a lot of slow checks next to each other and a minor queueing effect takes place increases the load for a while.

FWIW I don't consider a load of 8 on average much for a machine with 32 CPU threads. The box is not really loaded IMHO. That can also be observed at the low CPU usage (~30% average and with no real spikes)

I agree -- this doesn't look very much loaded. That said, investigating whether our check intervals make sense (in any direction) is still worthwhile. @herron, is your investigation (that resulted into those three subtasks above) done?

What seems off to me is the icinga parent process sitting at ~100% cpu. Feels like there is a bottleneck here and would like to better understand it.

top - 17:48:42 up 134 days,  4:10,  3 users,  load average: 4.40, 6.46, 7.60
Tasks: 1108 total,   3 running, 1100 sleeping,   0 stopped,   5 zombie
%Cpu(s):  0.4 us, 11.3 sy,  6.2 ni, 82.0 id,  0.0 wa,  0.0 hi,  0.2 si,  0.0 st
KiB Mem:  65929380 total, 24151916 used, 41777464 free,   213244 buffers
KiB Swap:   976316 total,        0 used,   976316 free. 21685092 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 5626 icinga    25   5  177480  76272   2628 R 100.0  0.1  67280:20 icinga

@faidon yes, done with initial investigation. Looking for consensus on whether or not this is worth digging deeper into.

T173315 host check tuning looks like low hanging fruit to improve the average host check execution time.

My guess this is mostly due to it aside from being responsible for spawning the checks it is also responsible for reaping/processing the results (this is done via temporary files on disk - tmpfs actually in our case) and writing the status files. It's also single process so it's always gonna be bound to 1 CPU. It's not always at 100% CPU, but it does sit there for several seconds usually. I don't think it's yet a bottleneck, but it's good that you caught it as it is going to be at some point since our checks are only going to increase in number. Tuning some check might indeed help.

Dzahn claimed this task.
Dzahn subscribed.

This ticket doesn't seem very actionable at the current state. It seems more like there was mild consensus that it's not actually that overloaded and it was good to take a look but it's ok and since then it's been idling.

Given that the subtasks are all closed i wil therefore also call this resolved. Of course reopen if you disagree...

ArielGlenn subscribed.

I'm going to re-open this; https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?panelId=9&fullscreen&orgId=1&var-server=einsteinium&var-datasource=eqiad%20prometheus%2Fops&from=1524151895118&to=1526903931086 shows the load has gone up significantly since May 2nd. Maybe folks still think this is well in the bounds of ok, n which case, feel free to close again.

Thanks @ArielGlenn for re-opening this. From a quick look we had two big increases, one on May 2nd and one on May 8th. I think they are related to those two changes that are basically adding a check for each host each:

What I found is that both of them are checked with the default interval of 1m, that I think is way overkill for those kind of checks. I'm sending a patch to increase the check interval to 30 minutes and retry interval to 5 minutes, reducing the retries to HARD state from 5 to 3.

@akosiaris @fgiunchedi thoughts?

Change 434455 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] Icinga: add check and retry intervals for prometheus

https://gerrit.wikimedia.org/r/434455

Change 434456 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] Icinga: reduce checks frequency for SMART and EDAC

https://gerrit.wikimedia.org/r/434456

Thanks @ArielGlenn for re-opening this. From a quick look we had two big increases, one on May 2nd and one on May 8th. I think they are related to those two changes that are basically adding a check for each host each:

What I found is that both of them are checked with the default interval of 1m, that I think is way overkill for those kind of checks. I'm sending a patch to increase the check interval to 30 minutes and retry interval to 5 minutes, reducing the retries to HARD state from 5 to 3.

@akosiaris @fgiunchedi thoughts?

I have no prior experience with EDAC checks so maybe 30 mins is fine. For SMART data I would even say 1 check every 4h (for a notification every 12h and an entry in icinga ui immediately) is fine.

Change 434455 merged by Volans:
[operations/puppet@production] Icinga: add check and retry intervals for prometheus

https://gerrit.wikimedia.org/r/434455

Change 434456 merged by Volans:
[operations/puppet@production] Icinga: reduce checks frequency for SMART and EDAC

https://gerrit.wikimedia.org/r/434456

The CPU usage is already back to 40%, we can decide tomorrow if we want to increase the check_interval further.

@akosiaris the current EDAC check is sum(increase($metric[4d])), so is checking the increase over the last 4 days, I'd say is not time-sensitive at all.

CPU usage is back to where it was before, but we can reduce further the frequency to 4h maybe with a retry every 30m. I don't have a strong opinion in merit.

Thoughts?

I guess that if we managed to bring the load back down to reasonable levels, it's improbable we are going to see much gain in that front from lowering the frequency more. Being a bit more pedantic, it does sounds better to lower the frequency since we don't gain much from the increased frequency anyway. But I have no really strong opinion either

Ack, I propose to leave it as is for now and re-evaluate once also Filippo is back. Resolving for now, feel free to re-open.