icinga hp raid check timeout on busy ms-be and db machines
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	fgiunchedi
	Jul 25 2016, 10:15 AM

Description

during disk-intensive operations (e.g. swift rebalance or mysql alter table) we've seen the check_hpssacli nrpe check timing out on newer hp machines. The timeout has been increased to 40s already but sometimes that doesn't seem sufficient.

Swift machines are a bit peculiar in which they have all disks in raid0, so for each disk there's a physical volume and a logical volume to be checked.

MySQL is not peculiar and that also happens when there is high io_activity.

Related Objects
Search...

Status	Assigned	Task
Open	None	T294906 Puppet Improvements
Duplicate	jbond	T265138 Work required to prepare for puppet 7
Resolved	SLyngshede-WMF	T273673 replace all puppet crons with systemd timers
Open	None	T132324 Tracking and Reducing cron-spam to root@
Resolved	jcrespo	T84178 investigate RAID BBU auto-learn on db hosts
Resolved	faidon	T84050 Refactor RAID checks (check-raid)
Resolved	faidon	T97998 Add RAID monitoring for HP servers
Resolved	herron	T141252 icinga hp raid check timeout on busy ms-be and db machines
Resolved	herron	T172921 Nrpe command_timeout and "Service Check Timed Out" errors

Event Timeline

fgiunchedi created this task.Jul 25 2016, 10:15 AM

fgiunchedi merged a task: T147916: Investigate check_hpssacli number of calls / efficiency.Oct 12 2016, 11:25 AM

fgiunchedi merged a task: T138597: investigate speeding up hp raid checks.Oct 12 2016, 11:32 AM

fgiunchedi added a subscriber: Zppix.

jcrespo renamed this task from icinga hp raid check timeout on busy ms-be machines to icinga hp raid check timeout on busy ms-be and db machines.Mar 22 2017, 1:00 PM

jcrespo added a project: DBA.

jcrespo updated the task description. (Show Details)

Good example of a db server where that happens with big alter tables: dbstore2001

jcrespo moved this task from Triage to Meta/Epic on the DBA board.Mar 22 2017, 7:51 PM

I've checked, and the currently in use check does too much, probably we do not need such a thorough check every time icinga runs, which would solve the issues with the timeout. Just returning a full output of the configuration and status is not that heavy, so we may be able to compromise here.

herron added a subtask: T172921: Nrpe command_timeout and "Service Check Timed Out" errors.Aug 10 2017, 6:52 PM

herron closed subtask T172921: Nrpe command_timeout and "Service Check Timed Out" errors as Resolved.Aug 16 2017, 4:14 PM

Should we close this too, or too early to say? @herron

Sure, sounds good to me. We could always reopen and evaluate if the issue occurs again in the future.

icinga hp raid check timeout on busy ms-be and db machinesClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

icinga hp raid check timeout on busy ms-be and db machines
Closed, ResolvedPublic
Actions

Related Objects
Search...