Summary:
The prometheus-icinga-am service fails to start on the passive host alert2001, while the service is operational on the active host alert1001.
The error messages indicate a potential hostname resolution failure.
Troubleshooting Information:
Systemd status (alert2001):
× prometheus-icinga-am.service - Prometheus Icinga AlertManager Forwarder
Loaded: loaded (/lib/systemd/system/prometheus-icinga-am.service; disabled; preset: enabled)
Active: failed (Result: exit-code) since Fri 2024-03-01 01:07:52 UTC; 6h ago
Duration: 204ms
Process: 1802990 ExecStart=/usr/bin/prometheus-icinga-am $ARGS (code=exited, status=1/FAILURE)
Main PID: 1802990 (code=exited, status=1/FAILURE)
CPU: 202ms
Mar 01 01:07:52 alert2001 systemd[1]: prometheus-icinga-am.service: Scheduled restart job, restart counter is at 5.
Mar 01 01:07:52 alert2001 systemd[1]: Stopped prometheus-icinga-am.service - Prometheus Icinga AlertManager Forwarder.
Mar 01 01:07:52 alert2001 systemd[1]: prometheus-icinga-am.service: Start request repeated too quickly.
Mar 01 01:07:52 alert2001 systemd[1]: prometheus-icinga-am.service: Failed with result 'exit-code'.
Mar 01 01:07:52 alert2001 systemd[1]: Failed to start prometheus-icinga-am.service - Prometheus Icinga AlertManager Forwarder.Journalctl Output (alert2001):
Mar 01 01:07:51 alert2001 systemd[1]: prometheus-icinga-am.service: Main process exited, code=exited, status=1/FAILURE
Mar 01 01:07:51 alert2001 systemd[1]: prometheus-icinga-am.service: Failed with result 'exit-code'.
Mar 01 01:07:51 alert2001 systemd[1]: prometheus-icinga-am.service: Scheduled restart job, restart counter is at 4.
Mar 01 01:07:51 alert2001 systemd[1]: Stopped prometheus-icinga-am.service - Prometheus Icinga AlertManager Forwarder.
Mar 01 01:07:51 alert2001 systemd[1]: Started prometheus-icinga-am.service - Prometheus Icinga AlertManager Forwarder.
Mar 01 01:07:52 alert2001 prometheus-icinga-am[1802990]: Traceback (most recent call last):
Mar 01 01:07:52 alert2001 prometheus-icinga-am[1802990]: File "/usr/bin/prometheus-icinga-am", line 11, in <module>
Mar 01 01:07:52 alert2001 prometheus-icinga-am[1802990]: load_entry_point('prometheus-icinga-exporter==0.20', 'console_scripts', 'prometheus-icinga-am')()
Mar 01 01:07:52 alert2001 prometheus-icinga-am[1802990]: File "/usr/lib/python3/dist-packages/prometheus_icinga_exporter/am.py", line 532, in main
Mar 01 01:07:52 alert2001 prometheus-icinga-am[1802990]: start_http_server(int(port), addr=address)
Mar 01 01:07:52 alert2001 prometheus-icinga-am[1802990]: File "/usr/lib/python3/dist-packages/prometheus_client/exposition.py", line 169, in start_wsgi_server
Mar 01 01:07:52 alert2001 prometheus-icinga-am[1802990]: TmpServer.address_family, addr = _get_best_family(addr, port)
Mar 01 01:07:52 alert2001 prometheus-icinga-am[1802990]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Mar 01 01:07:52 alert2001 prometheus-icinga-am[1802990]: File "/usr/lib/python3/dist-packages/prometheus_client/exposition.py", line 158, in _get_best_family
Mar 01 01:07:52 alert2001 prometheus-icinga-am[1802990]: infos = socket.getaddrinfo(address, port)
Mar 01 01:07:52 alert2001 prometheus-icinga-am[1802990]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Mar 01 01:07:52 alert2001 prometheus-icinga-am[1802990]: File "/usr/lib/python3.11/socket.py", line 962, in getaddrinfo
Mar 01 01:07:52 alert2001 prometheus-icinga-am[1802990]: for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
Mar 01 01:07:52 alert2001 prometheus-icinga-am[1802990]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Mar 01 01:07:52 alert2001 prometheus-icinga-am[1802990]: socket.gaierror: [Errno -2] Name or service not knownAnalysis:
- The logs suggest a hostname resolution issue based on the socket.gaierror: [Errno -2] Name or service not known error.
- The service contains the following line: /usr/bin/prometheus-icinga-am $ARGS. This requires further investigation, I wonder if the variable is not being interpolated correctly.
Questions:
Related to prometheus-icinga-am.service. A comment on the service override specifies that the service is run as Group=www-data in order to be able to execute status.cgi however, the owner of the /usr/lib/cgi-bin/icinga/status.cgi is root:root. Is it still necessary for the service to run as Group=www-data?