Page MenuHomePhabricator

Icinga raid handler improvements
Closed, ResolvedPublic

Description

In T174777#3572694 it was noticed that the raid handler didn't properly open the task for that specific host on failure.

Upon looking at the Icinga logs (see below), it's clear that there are two issues to fix here:

  1. An exception was raised while gathering the Raid status. At that point we've already passed all the safety checks to avoid to open tasks for intermittent issues (connection, timeout), so we should catch that exception and go ahead opening the task and acking the Icinga alarm anyway, adding a fixed message in the task to state that the raid gathering has failed
  1. For ms-be2023 the hpssacli raid gathering script returns a too verbose output (~8k) that once compressed exceeds by few bytes the NRPE hardcoded limit of 1024 bytes.
2017-08-31 22:36:07 [DEBUG] raid_handler::main: RAID Handler called with args: Namespace(datacenter='codfw', debug=True, host_address='ms-be2023', message='CRITICAL: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Failed: 1I:1:5 - Controller: OK - Battery/Capacitor: OK', message_remain='', raid_type='hpssacli', service_attempts=1, service_description='HP RAID', service_state='CRITICAL', service_state_type='SOFT', skip_nrpe=False)
2017-08-31 22:36:07 [DEBUG] raid_handler::main: Nothing to do, exiting
2017-08-31 22:40:58 [DEBUG] raid_handler::main: RAID Handler called with args: Namespace(datacenter='codfw', debug=True, host_address='ms-be2023', message='CRITICAL: Slot 3: Failed: 1I:1:5 - OK: 2I:4:1, 2I:4:2, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK', message_remain='', raid_type='hpssacli', service_attempts=2, service_description='HP RAID', service_state='CRITICAL', service_state_type='SOFT', skip_nrpe=False)
2017-08-31 22:40:58 [DEBUG] raid_handler::main: Nothing to do, exiting
2017-08-31 22:46:06 [DEBUG] raid_handler::main: RAID Handler called with args: Namespace(datacenter='codfw', debug=True, host_address='ms-be2023', message='CRITICAL: Slot 3: Failed: 1I:1:5 - OK: 2I:4:1, 2I:4:2, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK', message_remain='', raid_type='hpssacli', service_attempts=3, service_description='HP RAID', service_state='CRITICAL', service_state_type='HARD', skip_nrpe=False)
2017-08-31 22:46:09 [ERROR] raid_handler::<module>: Unable to handle RAID check alert
Traceback (most recent call last):
  File "/usr/lib/nagios/plugins/eventhandlers/raid_handler", line 273, in <module>
    main()
  File "/usr/lib/nagios/plugins/eventhandlers/raid_handler", line 250, in main
    raid_status += get_raid_status(args.host_address, args.raid_type)
  File "/usr/lib/nagios/plugins/eventhandlers/raid_handler", line 116, in get_raid_status
    status = zlib.decompress(stdout.replace('###NULL###', '\x00'))
error: Error -5 while decompressing data: incomplete or truncated stream

Related Objects

Event Timeline

Change 375755 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] Icinga: raid handler, catch zlib exceptions

https://gerrit.wikimedia.org/r/375755

Change 375756 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] Raid: optimize get raid status for HP controllers

https://gerrit.wikimedia.org/r/375756

Change 375755 merged by Volans:
[operations/puppet@production] Icinga: raid handler, catch zlib exceptions

https://gerrit.wikimedia.org/r/375755

Change 375756 merged by Volans:
[operations/puppet@production] Raid: optimize get raid status for HP controllers

https://gerrit.wikimedia.org/r/375756

Volans removed a project: Patch-For-Review.