Page MenuHomePhabricator

icinga hp raid check timeout on busy ms-be and db machines
Closed, ResolvedPublic

Description

during disk-intensive operations (e.g. swift rebalance or mysql alter table) we've seen the check_hpssacli nrpe check timing out on newer hp machines. The timeout has been increased to 40s already but sometimes that doesn't seem sufficient.

Swift machines are a bit peculiar in which they have all disks in raid0, so for each disk there's a physical volume and a logical volume to be checked.

MySQL is not peculiar and that also happens when there is high io_activity.

Event Timeline

jcrespo renamed this task from icinga hp raid check timeout on busy ms-be machines to icinga hp raid check timeout on busy ms-be and db machines.Mar 22 2017, 1:00 PM
jcrespo added a project: DBA.
jcrespo updated the task description. (Show Details)

Good example of a db server where that happens with big alter tables: dbstore2001

I've checked, and the currently in use check does too much, probably we do not need such a thorough check every time icinga runs, which would solve the issues with the timeout. Just returning a full output of the configuration and status is not that heavy, so we may be able to compromise here.

Should we close this too, or too early to say? @herron

herron claimed this task.

Sure, sounds good to me. We could always reopen and evaluate if the issue occurs again in the future.