Page MenuHomePhabricator

rsyslog occasional segfault on centrallog hosts
Closed, ResolvedPublic

Description

Since 2020/08/05 at ~16 it seems rsyslog has started failing on centrallog hosts, e.g.

Aug 06 08:27:25 centrallog1001 rsyslogd[28353]: action 'fwd_centrallog2001.codfw.wmnet:6514' suspended (module 'builtin:omfwd'), retry 0. There should be messages before this one giving the reason for suspension. 
Aug 06 08:27:31 centrallog1001 rsyslogd[28353]: main Q:Reg: high activity - starting 1 additional worker thread(s), currently 1 active worker threads. [v8.1901.0 try https://www.rsyslog.com/e/2439 ]
Aug 06 08:27:31 centrallog1001 rsyslog_exporter[28354]: 2020/08/06 08:27:31 input ended, exiting normally
Aug 06 08:27:31 centrallog1001 systemd[1]: rsyslog.service: Main process exited, code=killed, status=11/SEGV
Aug 06 08:27:31 centrallog1001 systemd[1]: rsyslog.service: Failed with result 'signal'.
Aug 06 08:27:31 centrallog1001 systemd[1]: rsyslog.service: Service RestartSec=100ms expired, scheduling restart.
Aug 06 08:27:31 centrallog1001 systemd[1]: rsyslog.service: Scheduled restart job, restart counter is at 13.
Aug 06 08:27:31 centrallog1001 systemd[1]: Stopped System Logging Service.
Aug 06 08:27:31 centrallog1001 systemd[1]: Starting System Logging Service...
Aug 06 08:27:31 centrallog1001 rsyslog_exporter[31213]: 2020/08/06 08:27:31 Listening on :9105
Aug 06 08:27:32 centrallog1001 rsyslogd[31212]: imuxsock: Acquired UNIX socket '/run/systemd/journal/syslog' (fd 3) from systemd.  [v8.1901.0]
Aug 06 08:27:32 centrallog1001 rsyslogd[31212]:  [origin software="rsyslogd" swVersion="8.1901.0" x-pid="31212" x-info="https://www.rsyslog.com"] start
Aug 06 08:27:32 centrallog1001 systemd[1]: Started System Logging Service.

Looks like the latest rsyslog version is significantly more stable on centrallog1001! We'll be running 8.2008.0-1~bpo10+1 on centrallog hosts only then, implementation wise these are the steps:

  • Upgrade centrallog2001
  • Create an APT component for rsyslog
  • Use the component on centrallog hosts only (os_version <= buster) and make sure the component's APT priorities are correct, so that on reimage the component's version is installed
  • Upload rsyslog 8.2008.0-1~bpo10+1 and librelp 1.7.0-1~bpo10+1 to said component

Event Timeline

I've captured core dumps on centrallog2001:/root/T259780 for this issue, unclear yet what the root cause is. The trigger was a big influx of firewall drop logs for NRPE (port 5666) from all hosts in the fleet, specifically coming from the fact that new alert* hosts were not in monitoring_hosts and thus not allowed to talk to nrped (fixed in https://gerrit.wikimedia.org/r/618719)

akosiaris triaged this task as Medium priority.Aug 7 2020, 9:35 AM

Mentioned in SAL (#wikimedia-operations) [2020-10-07T15:12:46Z] <godog> upgrade rsyslog to 8.2008.0-1~bpo10+1 on centrallog1001 - T259780

This comment was removed by fgiunchedi.

Mentioned in SAL (#wikimedia-operations) [2020-10-19T08:37:10Z] <godog> upgrade rsyslog to 8.2008.0-1~bpo10+1 on centrallog2001 - T259780

Change 634922 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] aptrepo: add component/rsyslog

https://gerrit.wikimedia.org/r/634922

Change 634922 merged by Filippo Giunchedi:
[operations/puppet@production] aptrepo: add component/rsyslog

https://gerrit.wikimedia.org/r/634922

This particular failure mode seems to be fixed with rsyslog 8.2008.0-1~bpo10+1, I can't find any other rsyslog segmentation faults since deploying the new version.

Change 656953 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] rsyslog: install rsyslog from component/rsyslog on Buster

https://gerrit.wikimedia.org/r/656953

Change 656953 merged by Filippo Giunchedi:
[operations/puppet@production] rsyslog: install rsyslog from component/rsyslog on Buster

https://gerrit.wikimedia.org/r/656953

fgiunchedi claimed this task.
fgiunchedi updated the task description. (Show Details)

Resolving since we're running a fixed rsyslog version now.