Page MenuHomePhabricator

rsyslog's in:imtcp thread stuck on recvfrom loop from down/rebooted hosts
Open, HighPublic

Description

It happened on lithium a couple of times, and this morning on wezen. In icinga it is usually listed with the rsyslog service in UNKNOWN status with description Service timeout. The pattern is the following:

top -h shows:

 PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
6925 root      20   0  706056  70056   4892 R 99.9  0.4   2441:03 in:imtcp

strace -p 6925 shows:

recvfrom(706, 0x7fc477691650, 31, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
recvfrom(706, 0x7fc477691650, 31, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
recvfrom(706, 0x7fc477691650, 31, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
recvfrom(706, 0x7fc477691650, 31, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
recvfrom(706, 0x7fc477691650, 31, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
recvfrom(706, 0x7fc477691650, 31, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
[...]

Then from /proc/:

elukey@wezen:~$ sudo file /proc/6925/fd/706
/proc/6925/fd/706: broken symbolic link to socket:[118444005]

And finally:

elukey@wezen:~$ sudo lsof | grep 118444005
rsyslogd   6920                   root  706u     IPv4          118444005         0t0        TCP in:imuxso  6920  6923             root  706u     IPv4          118444005         0t0        TCP wezen.codfw.wmnet:syslog-tls->tegmen.wikimedia.org:44892 (ESTABLISHED)
in:imklog  6920  6924             root  706u     IPv4          118444005         0t0        TCP wezen.codfw.wmnet:syslog-tls->tegmen.wikimedia.org:44892 (ESTABLISHED)
in:imtcp   6920  6925             root  706u     IPv4          118444005         0t0        TCP wezen.codfw.wmnet:syslog-tls->tegmen.wikimedia.org:44892 (ESTABLISHED)
in:imudp   6920  6926             root  706u     IPv4          118444005         0t0        TCP wezen.codfw.wmnet:syslog-tls->tegmen.wikimedia.org:44892 (ESTABLISHED)
rs:main    6920  6927             root  706u     IPv4          118444005         0t0        TCP wezen.codfw.wmnet:syslog-tls->tegmen.wikimedia.org:44892 (ESTABLISHED)
in:imtcp   6920  6928             root  706u     IPv4          118444005         0t0        TCP wezen.codfw.wmnet:syslog-tls->tegmen.wikimedia.org:44892 (ESTABLISHED)
in:imtcp   6920  6929             root  706u     IPv4          118444005         0t0        TCP wezen.codfw.wmnet:syslog-tls->tegmen.wikimedia.org:44892 (ESTABLISHED)
in:imtcp   6920  6930             root  706u     IPv4          118444005         0t0        TCP wezen.codfw.wmnet:syslog-tls->tegmen.wikimedia.org:44892 (ESTABLISHED)
in:imtcp   6920  6931             root  706u     IPv4          118444005         0t0        TCP wezen.codfw.wmnet:syslog-tls->tegmen.wikimedia.org:44892 (ESTABLISHED)

gdb shows (didn't find debug symbols to install):

(gdb) thread apply all bt

Thread 1 (Thread 0x7fc483fff700 (LWP 6925)):
#0  0x00007fc48dd3cb17 in ?? () from /usr/lib/x86_64-linux-gnu/rsyslog/lmnetstrms.so
#1  0x00007fc48db36ff3 in ?? () from /usr/lib/x86_64-linux-gnu/rsyslog/lmtcpsrv.so
#2  0x00007fc48db372c7 in ?? () from /usr/lib/x86_64-linux-gnu/rsyslog/lmtcpsrv.so
#3  0x00005608e3302970 in ?? ()
#4  0x00007fc48ff67064 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#5  0x00007fc48f07862d in clone () from /lib/x86_64-linux-gnu/libc.so.6

Usually a restart fixes the problem. Seems very similar to https://github.com/rsyslog/rsyslog/issues/318.

elukey@wezen:~$ dpkg -l | grep rsyslog
ii  rsyslog                         8.23.0-2~bpo8+1                  amd64        reliable system and kernel logging daemon
ii  rsyslog-gnutls                  8.23.0-2~bpo8+1                  amd64        TLS protocol support for rsyslog

Event Timeline

elukey triaged this task as High priority.Jul 12 2018, 6:25 AM
elukey created this task.
elukey updated the task description. (Show Details)

Interesting! I thought it was something similar to we've experienced this problem before in T136312: Encrypt syslog traffic and https://github.com/rsyslog/rsyslog/issues/1728. Supposedly the latest rsyslog version fixes it but this seems a different issue altogether?

Mentioned in SAL (#wikimedia-operations) [2018-10-05T09:37:48Z] <elukey> restart rsyslog on lithium - broken connection to tegmen - T199406

Mentioned in SAL (#wikimedia-operations) [2018-12-20T07:10:16Z] <elukey> restart rsyslog on lithium - in:imtcp stuck in recvfrom ms-be2047.codfw.wmnet - T199406

Mentioned in SAL (#wikimedia-operations) [2019-05-06T14:57:25Z] <godog> capture strace / core for rsyslog on wezen / lithium and restart - T199406

Same issue today, rsyslog was stuck on lithium and wezen, strace shows a whole lot of this:

37672 recvfrom(837, 0x7f092c765c30, 55, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
37672 recvfrom(837, 0x7f092c765c30, 55, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
37672 recvfrom(837, 0x7f092c765c30, 55, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
37672 recvfrom(837, 0x7f092c765c30, 55, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)

And sure enough, fd 837 is the connection to cp1083 which died around the same time rsyslog got stuck:

rsyslogd 37653 root  837u  IPv4           64101317        0t0        TCP lithium.eqiad.wmnet:syslog-tls->cp1083.eqiad.wmnet:23981 (ESTABLISHED)

While annoying, I think before digging further we should focus on upgrading to stretch central syslog servers instead.

While annoying, I think before digging further we should focus on upgrading to stretch central syslog servers instead.

+1

Mentioned in SAL (#wikimedia-operations) [2019-05-23T09:22:30Z] <godog> bounce rsyslog on lithium - listener stuck /T199406

Change 520207 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] (WIP) toil: add rsyslog TLS remedy

https://gerrit.wikimedia.org/r/520207

This failure happened on Sun as well, with wezen and centrallog1001 "stuck" on recvfrom from cloudvirt1015, e.g. on centrallog1001

21108 recvfrom(1465, 0x7f5e053eee63, 278, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
21108 recvfrom(1465, 0x7f5e053eee63, 278, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
21108 recvfrom(1465, 0x7f5e053eee63, 278, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
21108 recvfrom(1465, 0x7f5e053eee63, 278, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
21108 recvfrom(1465, 0x7f5e053eee63, 278, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
21108 recvfrom(1465, 0x7f5e053eee63, 278, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
rsyslogd 21094 root 1465u     IPv4           28025343        0t0        TCP centrallog1001.eqiad.wmnet:syslog-tls->cloudvirt1015.eqiad.wmnet:34854 (ESTABLISHED)
fgiunchedi renamed this task from rsyslog's in:imtcp thread stuck on old sockets to rsyslog's in:imtcp thread stuck on recvfrom loop from down/rebooted hosts.Jul 29 2019, 11:08 AM

Mentioned in SAL (#wikimedia-operations) [2019-07-30T16:22:45Z] <godog> bounce rsyslog on centrallog1001 - T199406

Change 520207 merged by Filippo Giunchedi:
[operations/puppet@production] toil: add rsyslog TLS remedy

https://gerrit.wikimedia.org/r/520207

Change 528395 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] toil: rsyslog_tls_remedy log to journald

https://gerrit.wikimedia.org/r/528395

Change 528395 merged by Filippo Giunchedi:
[operations/puppet@production] toil: rsyslog_tls_remedy log to journald

https://gerrit.wikimedia.org/r/528395

fgiunchedi changed the task status from Open to Stalled.Oct 2 2019, 5:09 PM

Setting as stalled for now, the immediate issue has been bandaided

Note that with the latest rsyslog version (cfr T259780: rsyslog occasional segfault on centrallog hosts) this bug might be fixed too, and thus we can remove the bandaid

Change 657052 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] role: remove rsyslog TLS remedy

https://gerrit.wikimedia.org/r/657052

Change 657052 merged by Filippo Giunchedi:
[operations/puppet@production] role: remove rsyslog TLS remedy

https://gerrit.wikimedia.org/r/657052

Change 657293 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] toil: remove rsyslog_tls_remedy

https://gerrit.wikimedia.org/r/657293

Change 657293 merged by Filippo Giunchedi:
[operations/puppet@production] toil: remove rsyslog_tls_remedy

https://gerrit.wikimedia.org/r/657293

fgiunchedi claimed this task.

Resolving, with recent rsyslog on centrallog hosts we haven't experienced this bug

Unfortunately I spoke too fast, the bug is still there, e.g.

4370  recvfrom(50, 0x7f062021acc3, 53, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
4370  recvfrom(50, 0x7f062021acc3, 53, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
25511 <... poll resumed> )              = 0 (Timeout)
25511 poll([{fd=98, events=POLLIN}, {fd=86, events=POLLIN}], 2, 999 <unfinished ...>
4370  recvfrom(50, 0x7f062021acc3, 53, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
4370  recvfrom(50, 0x7f062021acc3, 53, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
4370  recvfrom(50, 0x7f062021acc3, 53, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
4370  recvfrom(50, 0x7f062021acc3, 53, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
4370  recvfrom(50, 0x7f062021acc3, 53, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
4370  recvfrom(50, 0x7f062021acc3, 53, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
4370  recvfrom(50, 0x7f062021acc3, 53, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
4370  recvfrom(50, 0x7f062021acc3, 53, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
4370  recvfrom(50, 0x7f062021acc3, 53, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)

So the remedy is going back in :(

Change 661720 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] Revert "toil: remove rsyslog_tls_remedy"

https://gerrit.wikimedia.org/r/661720

Change 661720 merged by Filippo Giunchedi:
[operations/puppet@production] Revert "toil: remove rsyslog_tls_remedy"

https://gerrit.wikimedia.org/r/661720