Page MenuHomePhabricator

prometheus-icinga-am.service Fails to Start on alert2001
Open, MediumPublicBUG REPORT

Description

Summary:

The prometheus-icinga-am service fails to start on the passive host alert2001, while the service is operational on the active host alert1001.

The error messages indicate a potential hostname resolution failure.

Troubleshooting Information:

Systemd status (alert2001):

× prometheus-icinga-am.service - Prometheus Icinga AlertManager Forwarder
     Loaded: loaded (/lib/systemd/system/prometheus-icinga-am.service; disabled; preset: enabled)
     Active: failed (Result: exit-code) since Fri 2024-03-01 01:07:52 UTC; 6h ago
   Duration: 204ms
    Process: 1802990 ExecStart=/usr/bin/prometheus-icinga-am $ARGS (code=exited, status=1/FAILURE)
   Main PID: 1802990 (code=exited, status=1/FAILURE)
        CPU: 202ms

Mar 01 01:07:52 alert2001 systemd[1]: prometheus-icinga-am.service: Scheduled restart job, restart counter is at 5.
Mar 01 01:07:52 alert2001 systemd[1]: Stopped prometheus-icinga-am.service - Prometheus Icinga AlertManager Forwarder.
Mar 01 01:07:52 alert2001 systemd[1]: prometheus-icinga-am.service: Start request repeated too quickly.
Mar 01 01:07:52 alert2001 systemd[1]: prometheus-icinga-am.service: Failed with result 'exit-code'.
Mar 01 01:07:52 alert2001 systemd[1]: Failed to start prometheus-icinga-am.service - Prometheus Icinga AlertManager Forwarder.

Journalctl Output (alert2001):

Mar 01 01:07:51 alert2001 systemd[1]: prometheus-icinga-am.service: Main process exited, code=exited, status=1/FAILURE
Mar 01 01:07:51 alert2001 systemd[1]: prometheus-icinga-am.service: Failed with result 'exit-code'.
Mar 01 01:07:51 alert2001 systemd[1]: prometheus-icinga-am.service: Scheduled restart job, restart counter is at 4.
Mar 01 01:07:51 alert2001 systemd[1]: Stopped prometheus-icinga-am.service - Prometheus Icinga AlertManager Forwarder.
Mar 01 01:07:51 alert2001 systemd[1]: Started prometheus-icinga-am.service - Prometheus Icinga AlertManager Forwarder.
Mar 01 01:07:52 alert2001 prometheus-icinga-am[1802990]: Traceback (most recent call last):
Mar 01 01:07:52 alert2001 prometheus-icinga-am[1802990]:   File "/usr/bin/prometheus-icinga-am", line 11, in <module>
Mar 01 01:07:52 alert2001 prometheus-icinga-am[1802990]:     load_entry_point('prometheus-icinga-exporter==0.20', 'console_scripts', 'prometheus-icinga-am')()
Mar 01 01:07:52 alert2001 prometheus-icinga-am[1802990]:   File "/usr/lib/python3/dist-packages/prometheus_icinga_exporter/am.py", line 532, in main
Mar 01 01:07:52 alert2001 prometheus-icinga-am[1802990]:     start_http_server(int(port), addr=address)
Mar 01 01:07:52 alert2001 prometheus-icinga-am[1802990]:   File "/usr/lib/python3/dist-packages/prometheus_client/exposition.py", line 169, in start_wsgi_server
Mar 01 01:07:52 alert2001 prometheus-icinga-am[1802990]:     TmpServer.address_family, addr = _get_best_family(addr, port)
Mar 01 01:07:52 alert2001 prometheus-icinga-am[1802990]:                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Mar 01 01:07:52 alert2001 prometheus-icinga-am[1802990]:   File "/usr/lib/python3/dist-packages/prometheus_client/exposition.py", line 158, in _get_best_family
Mar 01 01:07:52 alert2001 prometheus-icinga-am[1802990]:     infos = socket.getaddrinfo(address, port)
Mar 01 01:07:52 alert2001 prometheus-icinga-am[1802990]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Mar 01 01:07:52 alert2001 prometheus-icinga-am[1802990]:   File "/usr/lib/python3.11/socket.py", line 962, in getaddrinfo
Mar 01 01:07:52 alert2001 prometheus-icinga-am[1802990]:     for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
Mar 01 01:07:52 alert2001 prometheus-icinga-am[1802990]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Mar 01 01:07:52 alert2001 prometheus-icinga-am[1802990]: socket.gaierror: [Errno -2] Name or service not known

Analysis:

  • The logs suggest a hostname resolution issue based on the socket.gaierror: [Errno -2] Name or service not known error.
  • The service contains the following line: /usr/bin/prometheus-icinga-am $ARGS. This requires further investigation, I wonder if the variable is not being interpolated correctly.

Questions:
Related to prometheus-icinga-am.service. A comment on the service override specifies that the service is run as Group=www-data in order to be able to execute status.cgi however, the owner of the /usr/lib/cgi-bin/icinga/status.cgi is root:root. Is it still necessary for the service to run as Group=www-data?

Event Timeline

The issue rang a bell, and indeed we've fixed the issue in https://gerrit.wikimedia.org/r/c/operations/puppet/+/981407 although on the standby host the override file with the fix is never deployed because icinga-am is set to not run (and rightfully so).

The proper fix is to do the right thing within the python code and rebuild the debian package. In the interim though I've put in place the configs on alert2001 so the daemon starts (and I've stopped it since it isn't supposed to run). Note that this would have been fixed by puppet anyways when we switched over to alert2001 being the active host. I think we can repurpose this task to fix prometheus-icinga-exporter repo/package

Hi @fgiunchedi , thanks for sharing your insights on this task. I'm taking a look at it again and I agree that repurposing this task to fix prometheus-icinga-exporter is a good idea.

Looking at the prometheus_icinga_exporter/exporter.py code I think that the IPv6 address support breaks on line #337 because we split the address from the port using the : character as separator (an industry standard) however, the : character is a character used on IPv6 addresses which ends up breaking the code.

The RFC 3986 ("Uniform Resource Identifier (URI): Generic Syntax") Section 3.2.2 says the following: an Internet Protocol literal address, version 6 [RFC3513] or later, is distinguished by enclosing the IP literal within square brackets ("[" and "]"). This means that the IPv6 addresses passed as arguments must be in the form of [IPv6_address]:port.

To achieve this I think that implementing a function (maybe a lambda) before the address:port split to correctly parse either IPv4 or IPv6 addresses.

I think that we would need to spawn a new HTTP server instance to serve IPv6 addresses as I think that we would need one server instance per each type of IP address. I think that starting a new thread per each type of IP address would be the ideal way to solve the issue. Please let me know what you think.