Page MenuHomePhabricator

shinken checks broken with "UNKNOWN: execution of the check script exited with exception ..."
Closed, ResolvedPublic

Description

on shinken labs monitoring (T88142 is about describing and puppetizing it),

most (all?) checks don't work.

there are 4 different types of failure:

  1. UNKNOWN: execution of the check script exited with exception 'UNKNOWN'
  2. 100.00% of data above the critical threshold [0.0]
  3. No valid datapoints found
  4. UNKNOWN: execution of the check script exited with exception list index out of range
  1. could be real instance problems (or just a failure to connect to graphite?) 1, 3 and 4 all look like problems with the check scripts itself

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Krenair subscribed.

I think I'm responsible for some of these

Change 330329 had a related patch set uploaded (by Alex Monk):
check_graphite: Fix some KeyError exceptions in SeriesThreshold.format_message

https://gerrit.wikimedia.org/r/330329

Change 330332 had a related patch set uploaded (by Alex Monk):
check_graphite: Fix some IndexError exceptions in Threshold.parse_result

https://gerrit.wikimedia.org/r/330332

The first patch handles #1, #2 are errors to be dealt with by project admins, #3 should be investigated by project admins (for some reason they're not reporting expected data to graphite, maybe check on diamond), #4 is actually #3 in disguise which the second patch should fix

Krenair renamed this task from shinken checks broken with "UNKNOWN"/no valid datapoints to shinken checks broken with "UNKNOWN: execution of the check script exited with exception ...".Jan 4 2017, 12:01 AM
Krenair updated the task description. (Show Details)

Change 330329 merged by Dzahn:
check_graphite: Fix some KeyError exceptions in SeriesThreshold.format_message

https://gerrit.wikimedia.org/r/330329

Change 330332 merged by Dzahn:
check_graphite: Fix some IndexError exceptions in Threshold.parse_result

https://gerrit.wikimedia.org/r/330332

follow-up for T122332 T105218 (kind of), merged

thank you for the fixes!

in prod icinga we have 1 UNKNOWN of the "execution of the check script exited with exception list index out of range" type. that is also using check_graphite

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=graphite1001&service=Uploads+HTTP+5xx+reqs%2Fmin#comments