Page MenuHomePhabricator

check_graphite - "UNKNOWN: More than half of the datapoints are undefined "
Closed, ResolvedPublic

Description

https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=all&type=detail&servicestatustypes=8&hoststatustypes=3&serviceprops=2097162&nostatusheader

833 x "UNKNOWN: More than half of the datapoints are undefined " or "UNKNOWN: No valid datapoints found "

for all monitoring using check_graphite

modules/nagios_common/files/check_commands/check_graphite: 'UNKNOWN', 'More than half of the datapoints are undefined')

mostly affects all the HHVM monitoring such as HHVM busy threads and HHVM queue size but also others like labstore hosts

Event Timeline

Dzahn raised the priority of this task from to Needs Triage.
Dzahn updated the task description. (Show Details)
Dzahn subscribed.
Dzahn triaged this task as Medium priority.Jul 9 2015, 2:18 AM

this seems to be happening when graphite (or statsd) show null for recent values, check_graphite by default looks back 10m so 5 datapoints as null would trigger the UNKNOWN. reportedly it usually clears within the next icinga check cycle but why there's so much lag at times is still unknown.
it could be statsd flushed data not making it to graphite, graphite caches lagging behind and not returning recent data, or data not making it at all into statsd. the latter is unlikely since hhvm data doesn't show up with holes so it eventualyl catches up

another related case for UNKNOWN is when datapoints are not being pushed at all, for example "mediawiki memcached error rate" is taken from logstash. In that case no log lines mean no datapoints I think and thus the alarm goes unknown.

taking another look at this, I'm going to block it with T101141: UDP rcvbuferrors and inerrors on graphite hosts about fixing inbound udp errors on graphite first since it might be the root cause

Change 274716 had a related patch set uploaded (by Filippo Giunchedi):
graphite: switch carbon-c-relay to carbon_ch hash

https://gerrit.wikimedia.org/r/274716

Change 274716 merged by Filippo Giunchedi:
graphite: switch carbon-c-relay to carbon_ch hash

https://gerrit.wikimedia.org/r/274716

as expected the number of UNKNOWN dropped significantly to about 1/4th (including soft and hard states)

neon:~$ fgrep -c UNKNOWN /var/log/icinga/icinga.log*
/var/log/icinga/icinga.log:176
/var/log/icinga/icinga.log.1:132
/var/log/icinga/icinga.log.2:218
/var/log/icinga/icinga.log.3:799
/var/log/icinga/icinga.log.4:754
/var/log/icinga/icinga.log.5:750
/var/log/icinga/icinga.log.6:781
/var/log/icinga/icinga.log.7:850
neon:~$ ls -latr /var/log/icinga/icinga.log*
-rw-r--r-- 1 icinga adm 11903466 Mar  3 06:25 /var/log/icinga/icinga.log.7
-rw-r--r-- 1 icinga adm 11759027 Mar  4 06:25 /var/log/icinga/icinga.log.6
-rw-r--r-- 1 icinga adm 11760248 Mar  5 06:25 /var/log/icinga/icinga.log.5
-rw-r--r-- 1 icinga adm 11763997 Mar  6 06:25 /var/log/icinga/icinga.log.4
-rw-r--r-- 1 icinga adm 11749754 Mar  7 06:25 /var/log/icinga/icinga.log.3
-rw-r--r-- 1 icinga adm 11572056 Mar  8 06:25 /var/log/icinga/icinga.log.2
-rw-r--r-- 1 icinga adm 11568072 Mar  9 06:25 /var/log/icinga/icinga.log.1
-rw-r--r-- 1 icinga adm  8412209 Mar  9 12:15 /var/log/icinga/icinga.log

ATM there's two outstanding UNKNOWN with "no valid datapoints found", all active since >30d

Parsoid HTTP 5xx reqs/min
UNKNOWN	2016-04-11 13:30:01	48d 1h 26m 42s	3/3	UNKNOWN: No valid datapoints found 	
Throughput of EventLogging NavigationTiming events
UNKNOWN	2016-04-11 13:29:58	76d 3h 58m 14s	3/3	UNKNOWN: No valid datapoints found

and we're down to a few UNKNOWN in the logs

neon:~$ fgrep -c UNKNOWN /var/log/icinga/icinga.log*
/var/log/icinga/icinga.log:43
/var/log/icinga/icinga.log.1:85
/var/log/icinga/icinga.log.2:72
/var/log/icinga/icinga.log.3:30
/var/log/icinga/icinga.log.4:69
/var/log/icinga/icinga.log.5:71
/var/log/icinga/icinga.log.6:35
/var/log/icinga/icinga.log.7:428