Page MenuHomePhabricator

icinga-exporter failing on alert hosts
Closed, ResolvedPublic

Description

Looks like the icinga-exporter is failing on the new Buster alert* hosts:

Aug 17 07:08:14 alert1001 prometheus-icinga-exporter[27522]: Exception happened during processing of request from ('10.64.16.38', 42734)
Aug 17 07:08:14 alert1001 prometheus-icinga-exporter[27522]: ----------------------------------------
Aug 17 07:08:14 alert1001 prometheus-icinga-exporter[27522]: Exception happened during processing of request from ('10.64.0.123', 45122)
Aug 17 07:08:14 alert1001 prometheus-icinga-exporter[27522]: Traceback (most recent call last):
Aug 17 07:08:14 alert1001 prometheus-icinga-exporter[27522]:   File "/usr/lib/python3.7/socketserver.py", line 650, in process_request_thread
Aug 17 07:08:14 alert1001 prometheus-icinga-exporter[27522]:     self.finish_request(request, client_address)
Aug 17 07:08:14 alert1001 prometheus-icinga-exporter[27522]:   File "/usr/lib/python3.7/socketserver.py", line 360, in finish_request
Aug 17 07:08:14 alert1001 prometheus-icinga-exporter[27522]:     self.RequestHandlerClass(request, client_address, self)
Aug 17 07:08:14 alert1001 prometheus-icinga-exporter[27522]:   File "/usr/lib/python3.7/socketserver.py", line 720, in __init__
Aug 17 07:08:14 alert1001 prometheus-icinga-exporter[27522]:     self.handle()
Aug 17 07:08:14 alert1001 prometheus-icinga-exporter[27522]:   File "/usr/lib/python3.7/http/server.py", line 426, in handle
Aug 17 07:08:14 alert1001 prometheus-icinga-exporter[27522]:     self.handle_one_request()
Aug 17 07:08:14 alert1001 prometheus-icinga-exporter[27522]:   File "/usr/lib/python3.7/http/server.py", line 414, in handle_one_request
Aug 17 07:08:14 alert1001 prometheus-icinga-exporter[27522]:     method()
Aug 17 07:08:14 alert1001 prometheus-icinga-exporter[27522]:   File "/usr/lib/python3/dist-packages/prometheus_client/exposition.py", line 151, in do_GET
Aug 17 07:08:14 alert1001 prometheus-icinga-exporter[27522]:     output = encoder(registry)
Aug 17 07:08:14 alert1001 prometheus-icinga-exporter[27522]:   File "/usr/lib/python3/dist-packages/prometheus_client/openmetrics/exposition.py", line 14, in generate_latest
Aug 17 07:08:14 alert1001 prometheus-icinga-exporter[27522]:     for metric in registry.collect():
Aug 17 07:08:14 alert1001 prometheus-icinga-exporter[27522]:   File "/usr/lib/python3/dist-packages/prometheus_client/registry.py", line 75, in collect
Aug 17 07:08:14 alert1001 prometheus-icinga-exporter[27522]:     for metric in collector.collect():
Aug 17 07:08:14 alert1001 prometheus-icinga-exporter[27522]:   File "/usr/lib/python3/dist-packages/prometheus_icinga_exporter/exporter.py", line 243, in collect
Aug 17 07:08:14 alert1001 prometheus-icinga-exporter[27522]:     "Retry limit reached.  Status file reads were incomplete."
Aug 17 07:08:14 alert1001 prometheus-icinga-exporter[27522]: RuntimeError: Retry limit reached.  Status file reads were incomplete.

Event Timeline

Looks like the problem is that /var/icinga-tmpfs is full, thus the status.dat file can't be fully written and thus can't be fully read by icinga-exporter

I've cleaned up alert1001's /var/icinga-tmpfs and left alert2001 alone for inspection. Looks like lots of partial status files being written and never cleaned up

-rw-------  1 nagios nagios 10014720 Aug  6 06:33 icinga.tmp0fbZnM
-rw-------  1 nagios nagios 17862656 Aug 12 11:33 icinga.tmpFqVLZ1
-rw-------  1 nagios nagios 46108672 Aug  7 14:33 icinga.tmpIWXBHb
-rw-------  1 nagios nagios 24342528 Aug 14 02:33 icinga.tmpJ9rRHP
-rw-------  1 nagios nagios  5001216 Aug 11 23:33 icinga.tmpJSB6ha
-rw-------  1 nagios nagios 77697024 Aug 10 05:33 icinga.tmpQwHlGE
-rw-------  1 nagios nagios  8515584 Aug 15 21:33 icinga.tmpSL2iqx
-rw-------  1 nagios nagios  9932800 Aug 11 21:33 icinga.tmpSbYPsE
-rw-------  1 nagios nagios 49369088 Aug  9 20:33 icinga.tmpUOioSu
-rw-------  1 nagios nagios  4792320 Aug  9 17:33 icinga.tmpXLXIRY
-rw-------  1 nagios nagios 78831616 Aug  6 18:33 icinga.tmpZagZwv
-rw-------  1 nagios nagios 31854592 Aug 14 01:33 icinga.tmpbZ9qbi
-rw-------  1 nagios nagios 63598592 Aug 16 13:33 icinga.tmpdcN4Zt
-rw-------  1 nagios nagios 33959936 Aug 17 04:33 icinga.tmph5Z3SJ
-rw-------  1 nagios nagios 54128640 Aug 14 12:33 icinga.tmpkoAJXQ
-rw-------  1 nagios nagios  1638400 Aug 11 12:33 icinga.tmpmKnC70
-rw-------  1 nagios nagios 15581184 Aug 11 02:33 icinga.tmpn1UN9B
-rw-------  1 nagios nagios 53465088 Aug  8 20:33 icinga.tmpowt8T5
-rw-------  1 nagios nagios  2711552 Aug  8 05:33 icinga.tmpyEUzks
-rw-------  1 nagios nagios 73932800 Aug  9 04:33 icinga.tmpzXjhtq

Change 620710 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] icinga: ensure tmpfs cleanup

https://gerrit.wikimedia.org/r/620710

Change 620710 merged by Filippo Giunchedi:
[operations/puppet@production] icinga: ensure tmpfs cleanup

https://gerrit.wikimedia.org/r/620710

fgiunchedi claimed this task.

Good enough™ cleaning the tmpfs