In T174777#3572694 it was noticed that the raid handler didn't properly open the task for that specific host on failure.
Upon looking at the Icinga logs (see below), it's clear that there are two issues to fix here:
- An exception was raised while gathering the Raid status. At that point we've already passed all the safety checks to avoid to open tasks for intermittent issues (connection, timeout), so we should catch that exception and go ahead opening the task and acking the Icinga alarm anyway, adding a fixed message in the task to state that the raid gathering has failed
- For ms-be2023 the hpssacli raid gathering script returns a too verbose output (~8k) that once compressed exceeds by few bytes the NRPE hardcoded limit of 1024 bytes.
2017-08-31 22:36:07 [DEBUG] raid_handler::main: RAID Handler called with args: Namespace(datacenter='codfw', debug=True, host_address='ms-be2023', message='CRITICAL: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Failed: 1I:1:5 - Controller: OK - Battery/Capacitor: OK', message_remain='', raid_type='hpssacli', service_attempts=1, service_description='HP RAID', service_state='CRITICAL', service_state_type='SOFT', skip_nrpe=False) 2017-08-31 22:36:07 [DEBUG] raid_handler::main: Nothing to do, exiting 2017-08-31 22:40:58 [DEBUG] raid_handler::main: RAID Handler called with args: Namespace(datacenter='codfw', debug=True, host_address='ms-be2023', message='CRITICAL: Slot 3: Failed: 1I:1:5 - OK: 2I:4:1, 2I:4:2, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK', message_remain='', raid_type='hpssacli', service_attempts=2, service_description='HP RAID', service_state='CRITICAL', service_state_type='SOFT', skip_nrpe=False) 2017-08-31 22:40:58 [DEBUG] raid_handler::main: Nothing to do, exiting 2017-08-31 22:46:06 [DEBUG] raid_handler::main: RAID Handler called with args: Namespace(datacenter='codfw', debug=True, host_address='ms-be2023', message='CRITICAL: Slot 3: Failed: 1I:1:5 - OK: 2I:4:1, 2I:4:2, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK', message_remain='', raid_type='hpssacli', service_attempts=3, service_description='HP RAID', service_state='CRITICAL', service_state_type='HARD', skip_nrpe=False) 2017-08-31 22:46:09 [ERROR] raid_handler::<module>: Unable to handle RAID check alert Traceback (most recent call last): File "/usr/lib/nagios/plugins/eventhandlers/raid_handler", line 273, in <module> main() File "/usr/lib/nagios/plugins/eventhandlers/raid_handler", line 250, in main raid_status += get_raid_status(args.host_address, args.raid_type) File "/usr/lib/nagios/plugins/eventhandlers/raid_handler", line 116, in get_raid_status status = zlib.decompress(stdout.replace('###NULL###', '\x00')) error: Error -5 while decompressing data: incomplete or truncated stream