Page MenuHomePhabricator

nagios-nrpe-server.service: systemd unit references path below legacy directory /var/run/
Closed, ResolvedPublic

Description

I have seen the following error in the journal:

May 18 07:31:24 cp2040 systemd[1]: /lib/systemd/system/nagios-nrpe-server.service:12: PIDFile= references path below legacy directory /var/run/, updating /var/run/nagios/nrpe.pid → /run/nagios/nrpe.pid; please update the unit file accordingly.

The warning seems easy to fix by updating the PIDFile directive in modules/nrpe/templates/initscripts/nagios-nrpe-server.systemd.erb.

Event Timeline

ema created this task.May 18 2020, 8:12 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 18 2020, 8:12 AM
ema triaged this task as Low priority.May 18 2020, 8:13 AM
ema updated the task description. (Show Details)
fgiunchedi moved this task from Inbox to Backlog on the observability board.Jul 20 2020, 1:10 PM

Change 621967 had a related patch set uploaded (by Southparkfan; owner: Southparkfan):
[operations/puppet@production] nagios-nrpe-server systemd unit: change PIDFile based on running OS In Buster, the PID directory was changed to /run. While using the legacy /var/run won't break anything, it generates warnings in the logs.

https://gerrit.wikimedia.org/r/621967

I have uploaded a new patch using /run on all servers (regardless of OS). However, what about removing the unit and using the built-in one? That reduces the likelihood of units conflicting with intended behavior from upstream. An apt-get source for nagios-nrpe=3.2.1-2 shows the following unit (debian/nagios-nrpe-server.service):

[Unit]
Description=Nagios Remote Plugin Executor
Documentation=http://www.nagios.org/documentation
After=var-run.mount nss-lookup.target network.target local-fs.target remote-fs.target time-sync.target
Before=getty@tty1.service plymouth-quit.service xdm.service
Conflicts=nrpe.socket

[Install]
WantedBy=multi-user.target

[Service]
Type=simple
Restart=on-abort
PIDFile=/var/run/nagios/nrpe.pid
EnvironmentFile=-/etc/default/nagios-nrpe-server
ExecStart=/usr/sbin/nrpe -c /etc/nagios/nrpe.cfg -f $NRPE_OPTS
ExecReload=/bin/kill -HUP $MAINPID
ExecStopPost=/bin/rm -f /var/run/nagios/nrpe.pid
TimeoutStopSec=60
User=nagios
Group=nagios
PrivateTmp=true
OOMScoreAdjust=-500

The differences between the Buster and WMF unit are https://gerrit.wikimedia.org/r/c/operations/puppet/+/355109/1/modules/nrpe/templates/initscripts/nagios-nrpe-server.systemd.erb, https://gerrit.wikimedia.org/r/c/operations/puppet/+/355130/1/modules/nrpe/templates/initscripts/nagios-nrpe-server.systemd.erb and the lack of "-f $NRPE_OPTS" inside ExecStart.

I'd bet $NRPE_OPTS is coming from some /etc/default file. The default file (debian/nagios-nrpe-server.default) doesn't do anything, though:

# defaults file for nagios-nrpe-server
# (this file is a /bin/sh compatible fragment)

# NRPE_OPTS are any extra cmdline parameters you'd like to pass along to the
# nrpe daemon.
#
# The -n option disables SSL support.
#NRPE_OPTS="-n"

# NICENESS is if you want to run the server at a different nice() priority.
# (only used by the init script)
#NICENESS=5

# INETD is if you want to run the server via inetd (default=0, run as daemon).
# (only used by the init script)
#INETD=0

The change in Type, -d in ExecStart and removal of User/Group were made by @akosiaris. Are these changes mandatory for a proper working daemon? If so, shouldn't we submit a bug upstream?


The PID directory has been changed to /run since Buster: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=932353#10

This is where things become interesting. According to this comment, the PID directory was changed in 3.2.1-3 (released July 2019), a version that exists in the unstable changelog. However, stable seems stuck on 3.2.1-2. Any idea why that's the case?

Looking at git history, this service unit is shipped via Puppet since nagios-nrpe-server in jessie did not yet provide a systemd unit.

We still need to support jessie on a handful of legacy systems, but we could simply

  • wait until all jessie systems are gone (T224549) and then switch to simply using the native systemd unit shipped in the nagios-nrpe-server package
  • make the application of the puppetised systemd service dependent on os_version == jessie

The change in Type, -d in ExecStart and removal of User/Group were made by @akosiaris. Are these changes mandatory for a proper working daemon? If so, shouldn't we submit a bug upstream?

The puppetised service is based on the version in stretch, so these are present in the Debian upstream. Or did you mean Nagios NRPE as an upstream project? They in fact ship a different systemd service in startup/default-service.in (which e.g. doesn't use -d, but rather -f).

ema added a comment.Aug 25 2020, 6:57 AM

I have uploaded a new patch using /run on all servers (regardless of OS). However, what about removing the unit and using the built-in one? That reduces the likelihood of units conflicting with intended behavior from upstream. An apt-get source for nagios-nrpe=3.2.1-2 shows the following unit (debian/nagios-nrpe-server.service):

[...]

[Service]
Type=simple
Restart=on-abort
PIDFile=/var/run/nagios/nrpe.pid

This still uses the legacy /var/run though, hence it does not fix the issue.

We still need to support jessie on a handful of legacy systems, but we could simply

  • wait until all jessie systems are gone (T224549) and then switch to simply using the native systemd unit shipped in the nagios-nrpe-server package
  • make the application of the puppetised systemd service dependent on os_version == jessie

I propose option (c): change our unit to hardcode /run/nagios/nrpe.pid on all os versions, and file a Debian bug to use /run/ instead of /var/run.

This still uses the legacy /var/run though, hence it does not fix the issue.

Correct, see this:

The PID directory has been changed to /run since Buster: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=932353#10

This is where things become interesting. According to this comment, the PID directory was changed in 3.2.1-3 (released July 2019), a version that exists in the unstable changelog. However, stable seems stuck on 3.2.1-2. Any idea why that's the case?

A change from /var/run to /run was already applied to the package's unit, but for some reason that version (3.2.1-3) never made it to Buster (3.2.1-2 being the newest). You're a Debian developer, do you know to happen the reasoning behind that? Was it an oversight? If so, can 3.2.1-3 be released for Buster?

In that case, Moritz' plan seems the best one to me:

We still need to support jessie on a handful of legacy systems, but we could simply

  • make the application of the puppetised systemd service dependent on os_version == jessie

In that case we'll end up with the following:

  • Jessie systems use the puppetised unit, mandatory since Jessie doesn't ship with a built-in one
  • Stretch systems use the built-in unit, using /var/run (won't generate warnings in Stretch, no problem)
  • Buster systems use the unit from nrpe-server=3.1.2-3, with the /var/run fix

Reducing the usage of custom units will help reducing technical debt in the future, it is just that 3.2.1-3 thing being an issue right now. :)

A change from /var/run to /run was already applied to the package's unit, but for some reason that version (3.2.1-3) never made it to Buster (3.2.1-2 being the newest). You're a Debian developer, do you know to happen the reasoning behind that? Was it an oversight? If so, can 3.2.1-3 be released for Buster?

We're using the stable releases of Debian, the next release which includes 3.2.1-3 (or later) will be the subsequent Release (named Bullseye, expected for next year). See https://en.wikipedia.org/wiki/Debian#Branches

Only security fixes and important changes (see Criteria at https://lists.debian.org/debian-devel-announce/2019/08/msg00000.html) are backported to stable releases (and this change won't cut it, since it's just a bit of log spam)

@MoritzMuehlenhoff understood. Patch set 4 will use the custom unit (with /run) on systems older than bullseye. On bullseye systems we'll be using the built-in unit.

Change 621967 merged by Herron:
[operations/puppet@production] nagios-nrpe-server systemd unit: use /run for PID files

https://gerrit.wikimedia.org/r/621967

ema closed this task as Resolved.Sep 15 2020, 3:13 PM

Change 621967 merged by Herron:
[operations/puppet@production] nagios-nrpe-server systemd unit: use /run for PID files

https://gerrit.wikimedia.org/r/621967

That was it, the warning is gone! Thanks @Southparkfan et al.