Page MenuHomePhabricator

rsyslog service should fail on configuration errors
Open, MediumPublic

Description

While working on T290305 I've modified the rsyslog configuration of cache nodes so that it skips local syslog for most ATS messages. Before merging my changes, I tested them in a pontoon environment. The patch did not work as planned due to a missing "then":

if $programname contains "trafficserver" {
# should have been
if $programname contains "trafficserver" then {

Here's what systemd journal had to say when restarting rsyslog.service:

Sep 13 08:09:58 cptext systemd[1]: Starting System Logging Service...
Sep 13 08:09:58 cptext systemd[1]: Started System Logging Service.
Sep 13 08:09:58 cptext rsyslogd[28111]: error during parsing file /etc/rsyslog.d/20-trafficserver.conf, on or before line 1: invalid character '{' in expression - is there an invalid escape sequence somewhere? [v8.1901.0 try https://www.rsyslog.com/e/2207 ]
Sep 13 08:09:58 cptext rsyslogd[28111]: error during parsing file /etc/rsyslog.d/20-trafficserver.conf, on or before line 3: syntax error on token 'if' [v8.1901.0 try https://www.rsyslog.com/e/2207 ]
Sep 13 08:09:58 cptext rsyslogd[28111]: could not interpret master config file '/etc/rsyslog.conf'. [v8.1901.0 try https://www.rsyslog.com/e/2207 ]
Sep 13 08:09:58 cptext rsyslogd[28111]: imuxsock: Acquired UNIX socket '/run/systemd/journal/syslog' (fd 3) from systemd.  [v8.1901.0]
Sep 13 08:09:58 cptext rsyslogd[28111]:  [origin software="rsyslogd" swVersion="8.1901.0" x-pid="28111" x-info="https://www.rsyslog.com"] start
Sep 13 08:09:58 cptext rsyslog_exporter[28112]: 2021/09/13 08:09:58 Listening on 172.16.3.80:9105

Despite the error and the fact that rsyslog was quite broken (no local delivery), rsyslog.service started just fine as far as systemd was concerned. The unit was marked as "active (running)" and thus of course not listed in systemctl --failed.

The error is however detected properly when checking the configuration with rsyslogd -N1 -f /etc/rsyslog.conf, the command exits with status 1. We should:

  1. fix the current rsyslog errors: T292175, T292180
  2. ensure puppet tests the validity of rsyslog's configuration with /usr/sbin/rsyslogd -N1 -f /etc/rsyslog.conf before restarting it
  3. add $AbortOnUncleanConfig on

Event Timeline

Change 720913 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] rsyslog: config sanity check as systemd override

https://gerrit.wikimedia.org/r/720913

By going through rsyslog upstream bugs I found out about $AbortOnUncleanConfig, "off" by default:

This parameter permits to prevent rsyslog from running when the configuration file is not clean. “Not Clean” means there are errors or some other annoyances that rsyslogd reports on startup. This is a user-requested feature to have a strict startup mode. Note that with the current code base it is not always possible to differentiate between an real error and a warning-like condition. As such, the startup will also prevented if warnings are present. I consider this a good thing in being “strict”, but I admit there also currently is no other way of doing it.

It seems surprising to me that a syslog implementation does not stick to the usual UNIX mantra of failing loud and early in the face of errors, though there may be nuances I'm not aware of here. In particular, I'm not sure what exactly is considered as a "warning" vs an "error". Anyways, I've tried adding a /etc/rsyslog.d/00-abort-on-errors.conf file with $AbortOnUncleanConfig on on a test cache node, and by doing so rsyslog startup behaves as I expect it to: adding a syntax error to /etc/rsyslog.d/20-trafficserver.conf the rsyslog service fails to start, fixing the error rsyslog starts properly. I'm not sure whether we can apply this approach fleet-wide. Any thoughts from SRE Observability most welcome.

Change 720921 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] rsyslog: abort on unclean config

https://gerrit.wikimedia.org/r/720921

Thank you for the investigation and the reviews! I 100% agree that rsyslog config deploys must be safer than they are now. I think either approach will work as a bandaid, intuitively a rsyslog-native solution seems better to me (assuming the option is available starting with 8.1901.0-1~bpo9+wmf2 which is what we use on stretch-wikimedia). The downside I see though (in both solutions) is that rsyslog won't be running, therefore reducing availability compared to now.

Ideally we could invoke rsyslogd -N1 -f % as validate_cmd for the snippets but that doesn't seem possible in simple way (rsyslogd -f wants the main config file, not a snippet). I see at least a couple of potential solutions to ensure both safety and availability:

  1. Ask puppet to exec the config validation command before attempting to restart rsyslog, and fail the run if config validation doesn't pass. The downside is that the invalid configuration is already on disk, so a rsyslog restart will try to load, but that's not different to what we have now.
  2. Build a wrapper to be used as validate_cmd that takes the new (temporary) config snippet, assembles the existing and new configuration, and asks rsyslog to validate that.

I feel solution 1 is easier to implement and simpler overall, while solution 2 is more "proper" but trickier to get right (e.g. the wrapper will also need to know which snippet it is supposed to replace) so I'm leaning towards solution 1. There's likely other solutions I'm missing though!

I think either approach will work as a bandaid, intuitively a rsyslog-native solution seems better to me (assuming the option is available starting with 8.1901.0-1~bpo9+wmf2 which is what we use on stretch-wikimedia).

Good point! I checked on debmonitor and 8.1901.0-1 is the lowest and by far the most common version we've got deployed (that's the version in Buster too), so it seems to me that we should be fine in that regard.

The downside I see though (in both solutions) is that rsyslog won't be running, therefore reducing availability compared to now.

At least in the case of the issue I ran into, rsyslog was running but not logging, which is probably worse than not running at all and getting a critical alert.

I see at least a couple of potential solutions to ensure both safety and availability:

  1. Ask puppet to exec the config validation command before attempting to restart rsyslog, and fail the run if config validation doesn't pass. The downside is that the invalid configuration is already on disk, so a rsyslog restart will try to load, but that's not different to what we have now.
  2. Build a wrapper to be used as validate_cmd that takes the new (temporary) config snippet, assembles the existing and new configuration, and asks rsyslog to validate that.

I feel solution 1 is easier to implement and simpler overall, while solution 2 is more "proper" but trickier to get right (e.g. the wrapper will also need to know which snippet it is supposed to replace) so I'm leaning towards solution 1. There's likely other solutions I'm missing though!

I agree both on the fact that it would be very useful to validate the configuration before restarting rsyslog and that it may be tricky to do it right. :)

If rsyslog configuration snippets where static files instead of ERB templates, another useful thing to do would be implementing some sort of repo-local unit tests for them, similarly to what we do for mtail, ATS Lua config and so on. Alas, I see that's not the case (eg: ./modules/mediawiki/templates/rsyslog.conf.erb).

@fgiunchedi I'm gonna wait for your review before merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/720921, and drop the alternative, systemd-based approach.

ema triaged this task as Medium priority.Sep 17 2021, 10:07 AM

Change 720913 abandoned by Ema:

[operations/puppet@production] rsyslog: config sanity check as systemd override

Reason:

rsyslog-native solution deemed more desirable: https://gerrit.wikimedia.org/r/c/operations/puppet/+/720921

https://gerrit.wikimedia.org/r/720913

Change 720921 merged by Ema:

[operations/puppet@production] rsyslog: abort on unclean config

https://gerrit.wikimedia.org/r/720921

Change 720921 merged by Ema:

[operations/puppet@production] rsyslog: abort on unclean config

https://gerrit.wikimedia.org/r/720921

We merged this today and rsyslog failed restarting on multiple hosts due to unclean rsyslog configurations. On the bright side, this means that the change works as expected. On the other hand, this also means that rsyslog is currently misconfigured on many hosts (for some definition of misconfigured). The change was reverted to get rsyslog back up on the affected nodes.

Some of the failures include:

Sep 29 08:24:34 thanos-fe1002 rsyslogd[301933]: error during parsing file /etc/rsyslog.d/30-output-kafka.conf, on or before line 105: queue 
            directory '/var/spool/rsyslog' and file name prefix 'output_kafka_json' already used. This is not possible. Please make it unique. [v8.2102.0 try 
            https://www.rsyslog.com/e/2207 ]

And, on multiple nodes:

Sep 29 08:24:20 elastic1056 rsyslogd[144761]: module 'mmjsonparse' already in this config, cannot be added  [v8.1901.0 try https://www.rsyslog.com/e/2221 ]
Sep 29 08:24:20 elastic1056 rsyslogd[144761]: module 'omkafka' already in this config, cannot be added  [v8.1901.0 try https://www.rsyslog.com/e/2221 ]

The errors can be detected by running /usr/sbin/rsyslogd -N1 -f /etc/rsyslog.conf.

Change 724707 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] rsyslog: remove 00-abort-unclean-config.conf

https://gerrit.wikimedia.org/r/724707

Change 724707 merged by Ema:

[operations/puppet@production] rsyslog: remove 00-abort-unclean-config.conf

https://gerrit.wikimedia.org/r/724707