Page MenuHomePhabricator

ferm fail to start at boot in some cases
Closed, DuplicatePublic

Description

In some cases ferm fails to start at boot because of some failed resolution, as an example:

Oct 18 15:53:04 db2042 ferm[837]: DNS query for 'prometheus2003.codfw.wmnet' failed: query timed out
Oct 18 15:53:04 db2042 ferm[837]:  (warning).
Oct 18 15:53:04 db2042 systemd[1]: ferm.service: Main process exited, code=exited, status=255/n/a
Oct 18 15:53:04 db2042 systemd[1]: Failed to start ferm firewall configuration.
Oct 18 15:53:04 db2042 systemd[1]: ferm.service: Unit entered failed state.

This was on db2042 right after a reboot, but I've already seen this happening on other hosts too.
The subsequent puppet run didn't start the service either, although a simple start would fix the issue.

So we should investigate our puppet+systemd integration for ferm to make sure that ferm it's able to resolve hosts in its configuration when it starts and on failure (at least @reboot) it should be retried a couple of times and/or have puppet ensure that's it's running starting the service.