Page MenuHomePhabricator

rsyslog disk spool files filled the filesystem on thanos-fe2001
Closed, ResolvedPublic

Description

Over the weekend thanos-fe2001 ran out of disk space on /. Turns out /var/spool/rsyslog/ was full of queue files such as:

-rw------- 1 root root 1049237 May 16 11:14 centrallog1001.eqiad.wmnet:6514.00093952
-rw------- 1 root root  606208 May 16 12:11 centrallog2002.codfw.wmnet:6514.00000977

The same thing was happening on thanos-fe2002, though the filesystem wasn't full yet. It looked like the transport to centrallog hosts was stuck somehow, I'm assuming post-reboot for Bullseye upgrades. I've restarted rsyslog on thanos-fe2002 and it looks like it is flushing its queue now.

Event Timeline

Also thanos-fe2003 doesn't seem to be affected (i.e. no spool files), for 2001 and 2002 at startup we also have this:

thanos-fe2001/syslog.log-20220512.gz:May 11 10:09:28 thanos-fe2001 rsyslogd: fwd_centrallog1001.eqiad.wmnet:6514 queue[DA]: queue files exist on disk, re-starting with 1673 messages. This will keep the disk queue file open, details: https://rainer.gerhards.net/2013/07/rsyslog-why-disk-assisted-queues-keep-a-file-open.html [v8.2102.0 try https://www.rsyslog.com/e/0 ]
thanos-fe2002/syslog.log:May 11 10:29:30 thanos-fe2002 rsyslogd: fwd_centrallog2002.codfw.wmnet:6514 queue[DA]: queue files exist on disk, re-starting with 384 messages. This will keep the disk queue file open, details: https://rainer.gerhards.net/2013/07/rsyslog-why-disk-assisted-queues-keep-a-file-open.html [v8.2102.0 try https://www.rsyslog.com/e/0 ]

Change 799870 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] rsyslog: bound disk-assisted queues

https://gerrit.wikimedia.org/r/799870

Change 799870 merged by Filippo Giunchedi:

[operations/puppet@production] rsyslog: bound disk-assisted queues

https://gerrit.wikimedia.org/r/799870

Mentioned in SAL (#wikimedia-operations) [2022-06-07T08:26:32Z] <godog> set on-disk max queue size for rsyslog fleet wide - T308439

fgiunchedi claimed this task.

Change is merged and deploying, optimistically resolving