I noticed this because Icinga was showing an UNKNOWN for excessive RX traffic on lvs2007, where the Prometheus query was evaluating to NaN. This was because the query happened to hit prometheus2004 several times in a row.
✔️ cdanis@icinga1001.wikimedia.org ~ 🕦☕ /usr/lib/nagios/plugins/check_prometheus_metric.py --url 'http://prometheus2003.codfw.wmnet/ops' -w '1600' -c '3200' -m 'ge' '(sum by (instance) (rate(node_network_receive_bytes_total{instance=~"lvs.*",device!~"lo"}[5m]))) * 8 / 1024 / 1024' --debug DEBUG:__main__:Running '(sum by (instance) (rate(node_network_receive_bytes_total{instance=~"lvs.*",device!~"lo"}[5m]))) * 8 / 1024 / 1024' on 'http://prometheus2003.codfw.wmnet/ops/api/v1/query' DEBUG:__main__:Checking vector data for [{'metric': {'instance': 'lvs2007:9100'}, 'value': [1583292619.18, '54.86336437861125']}, {'metric': {'instance': 'lvs2008:9100'}, 'value': [1583292619.18, '49.30169935226441']}, {'metric': {'instance': 'lvs2009:9100'}, 'value': [1583292619.18, '162.0370532353719']}, {'metric': {'instance': 'lvs2010:9100'}, 'value': [1583292619.18, '18.21561772028605']}] All metrics within thresholds.
vs
❌cdanis@icinga1001.wikimedia.org ~ 🕦☕ /usr/lib/nagios/plugins/check_prometheus_metric.py --url 'http://prometheus2004.codfw.wmnet/ops' -w '1600' -c '3200' -m 'ge' '(sum by (instance) (rate(node_network_receive_bytes_total{instance=~"lvs.*",device!~"lo"}[5m]))) * 8 / 1024 / 1024' --debug DEBUG:__main__:Running '(sum by (instance) (rate(node_network_receive_bytes_total{instance=~"lvs.*",device!~"lo"}[5m]))) * 8 / 1024 / 1024' on 'http://prometheus2004.codfw.wmnet/ops/api/v1/query' DEBUG:__main__:Checking vector data for [{'metric': {'instance': 'lvs2009:9100'}, 'value': [1583292646.276, '162.03705183664957']}, {'metric': {'instance': 'lvs2010:9100'}, 'value': [1583292646.276, '18.30215384165446']}] All metrics within thresholds.
I tried doing curls against lvs2007:9100 directly from both prom servers and that worked fine. Nothing obvious in the logs either.