Icinga raid handler improvements
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Volans
	Sep 2 2017, 2:33 PM

Description

In T174777#3572694 it was noticed that the raid handler didn't properly open the task for that specific host on failure.

Upon looking at the Icinga logs (see below), it's clear that there are two issues to fix here:

An exception was raised while gathering the Raid status. At that point we've already passed all the safety checks to avoid to open tasks for intermittent issues (connection, timeout), so we should catch that exception and go ahead opening the task and acking the Icinga alarm anyway, adding a fixed message in the task to state that the raid gathering has failed

For ms-be2023 the hpssacli raid gathering script returns a too verbose output (~8k) that once compressed exceeds by few bytes the NRPE hardcoded limit of 1024 bytes.

2017-08-31 22:36:07 [DEBUG] raid_handler::main: RAID Handler called with args: Namespace(datacenter='codfw', debug=True, host_address='ms-be2023', message='CRITICAL: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Failed: 1I:1:5 - Controller: OK - Battery/Capacitor: OK', message_remain='', raid_type='hpssacli', service_attempts=1, service_description='HP RAID', service_state='CRITICAL', service_state_type='SOFT', skip_nrpe=False)
2017-08-31 22:36:07 [DEBUG] raid_handler::main: Nothing to do, exiting
2017-08-31 22:40:58 [DEBUG] raid_handler::main: RAID Handler called with args: Namespace(datacenter='codfw', debug=True, host_address='ms-be2023', message='CRITICAL: Slot 3: Failed: 1I:1:5 - OK: 2I:4:1, 2I:4:2, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK', message_remain='', raid_type='hpssacli', service_attempts=2, service_description='HP RAID', service_state='CRITICAL', service_state_type='SOFT', skip_nrpe=False)
2017-08-31 22:40:58 [DEBUG] raid_handler::main: Nothing to do, exiting
2017-08-31 22:46:06 [DEBUG] raid_handler::main: RAID Handler called with args: Namespace(datacenter='codfw', debug=True, host_address='ms-be2023', message='CRITICAL: Slot 3: Failed: 1I:1:5 - OK: 2I:4:1, 2I:4:2, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK', message_remain='', raid_type='hpssacli', service_attempts=3, service_description='HP RAID', service_state='CRITICAL', service_state_type='HARD', skip_nrpe=False)
2017-08-31 22:46:09 [ERROR] raid_handler::<module>: Unable to handle RAID check alert
Traceback (most recent call last):
  File "/usr/lib/nagios/plugins/eventhandlers/raid_handler", line 273, in <module>
    main()
  File "/usr/lib/nagios/plugins/eventhandlers/raid_handler", line 250, in main
    raid_status += get_raid_status(args.host_address, args.raid_type)
  File "/usr/lib/nagios/plugins/eventhandlers/raid_handler", line 116, in get_raid_status
    status = zlib.decompress(stdout.replace('###NULL###', '\x00'))
error: Error -5 while decompressing data: incomplete or truncated stream

Details

	Subject	Repo	Branch	Lines +/-
	Raid: optimize get raid status for HP controllers	operations/puppet	production	+5 -2
	Icinga: raid handler, catch zlib exceptions	operations/puppet	production	+11 -4

Customize query in gerrit

Related Objects

Mentioned Here: T174777: Degraded RAID on ms-be2023

Event Timeline

Volans created this task.Sep 2 2017, 2:33 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 2 2017, 2:33 PM

Volans moved this task from Backlog to In Progress on the SRE-tools board.Sep 2 2017, 4:11 PM

Change 375755 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] Icinga: raid handler, catch zlib exceptions

https://gerrit.wikimedia.org/r/375755

Change 375756 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] Raid: optimize get raid status for HP controllers

https://gerrit.wikimedia.org/r/375756

Volans moved this task from In Progress to In Code Review on the SRE-tools board.Sep 4 2017, 9:02 AM

Change 375755 merged by Volans:
[operations/puppet@production] Icinga: raid handler, catch zlib exceptions

https://gerrit.wikimedia.org/r/375755

Change 375756 merged by Volans:
[operations/puppet@production] Raid: optimize get raid status for HP controllers

https://gerrit.wikimedia.org/r/375756

Volans closed this task as Resolved.Sep 8 2017, 7:43 AM

Volans removed a project: Patch-For-Review.

Icinga raid handler improvementsClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Icinga raid handler improvements
Closed, ResolvedPublic
Actions