Page MenuHomePhabricator

rsyslog segfault on an-test-presto1001
Closed, ResolvedPublic

Description

I created last week an-test-presto1001, it is running Debian 10.7 (the last Buster snapshot in theory). I get the following Segfault from rsyslog:

elukey@an-test-presto1001:~$ sudo gdb /usr/sbin/rsyslogd
[..]
Reading symbols from /usr/sbin/rsyslogd...Reading symbols from /usr/lib/debug/.build-id/2d/263bb26159abb608a8ea5b9bc297715da3250c.debug...done.
done.
(gdb) run -n -iNONE
[..]

Thread 3 "rs:fwd_centrall" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff66e8700 (LWP 11625)]
0x000055555559b0c7 in rsCStrDestruct (ppThis=ppThis@entry=0x7ffff0001b40) at stringbuf.c:220
220	stringbuf.c: No such file or directory.
(gdb) bt
#0  0x000055555559b0c7 in rsCStrDestruct (ppThis=ppThis@entry=0x7ffff0001b40) at stringbuf.c:220
#1  0x00005555555ab128 in varDestruct (ppThis=0x7ffff66e79a0) at var.c:72
#2  0x00005555555965d5 in MsgDeserialize (pMsg=0x7ffff00018b0, pStrm=pStrm@entry=0x55555566c220) at msg.c:1422
#3  0x00005555555a235a in objDeserializeWithMethods (ppObj=0x7ffff66e7ae8, pszTypeExpected=pszTypeExpected@entry=0x5555555d6e93 "msg", lenTypeExpected=lenTypeExpected@entry=3, 
    pStrm=0x55555566c220, fFixup=fFixup@entry=0x0, pUsr=pUsr@entry=0x0, objConstruct=0x555555594200 <msgConstructForDeserializer>, objConstructFinalize=0x0, 
    objDeserialize=0x555555596560 <MsgDeserialize>) at obj.c:930
#4  0x00005555555ae56a in qDeqDisk (pThis=0x555555665340, ppMsg=<optimized out>) at queue.c:1029
#5  0x00005555555afabd in qqueueDeq (ppMsg=0x7ffff66e7ae8, pThis=0x555555665340) at queue.c:1804
#6  DequeueConsumableElements (pSkippedMsgs=0x7ffff66e7b54, piRemainingQueueSize=<synthetic pointer>, pWti=<optimized out>, pThis=0x555555665340) at queue.c:1804
#7  DequeueConsumable (pThis=pThis@entry=0x555555665340, pWti=pWti@entry=0x555555665b80, pSkippedMsgs=pSkippedMsgs@entry=0x7ffff66e7b54) at queue.c:1879
#8  0x00005555555aff00 in DequeueForConsumer (pSkippedMsgs=0x7ffff66e7b54, pWti=0x555555665b80, pThis=0x555555665340) at queue.c:2081
#9  ConsumerReg (pThis=0x555555665340, pWti=0x555555665b80) at queue.c:2081
#10 0x00005555555acd82 in wtiWorker (pThis=pThis@entry=0x555555665b80) at wti.c:429
#11 0x00005555555ab63a in wtpWorker (arg=0x555555665b80) at wtp.c:431
#12 0x00007ffff7d8dfa3 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#13 0x00007ffff77f44cf in clone () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) 
(gdb) thread apply all bt

Thread 5 (Thread 0x7ffff5ae6700 (LWP 11627)):
#0  0x00007ffff77f44c1 in clone () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007ffff7d8deb0 in ?? () from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007ffff5ae6700 in ?? ()
#3  0x0000000000000000 in ?? ()

Thread 4 (Thread 0x7ffff5ee7700 (LWP 11626)):
#0  0x00007ffff77e9819 in poll () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007ffff754b882 in poll (__timeout=-1, __nfds=1, __fds=0x7fffe8000b40) at /usr/include/x86_64-linux-gnu/bits/poll2.h:46
#2  runInput (pThrd=<optimized out>) at imuxsock.c:1523
#3  0x00005555555bdb6d in thrdStarter (arg=0x555555671940) at ../threads.c:239
#4  0x00007ffff7d8dfa3 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#5  0x00007ffff77f44cf in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 3 (Thread 0x7ffff66e8700 (LWP 11625)):
#0  0x000055555559b0c7 in rsCStrDestruct (ppThis=ppThis@entry=0x7ffff0001b40) at stringbuf.c:220
#1  0x00005555555ab128 in varDestruct (ppThis=0x7ffff66e79a0) at var.c:72
#2  0x00005555555965d5 in MsgDeserialize (pMsg=0x7ffff00018b0, pStrm=pStrm@entry=0x55555566c220) at msg.c:1422
#3  0x00005555555a235a in objDeserializeWithMethods (ppObj=0x7ffff66e7ae8, pszTypeExpected=pszTypeExpected@entry=0x5555555d6e93 "msg", lenTypeExpected=lenTypeExpected@entry=3, 
    pStrm=0x55555566c220, fFixup=fFixup@entry=0x0, pUsr=pUsr@entry=0x0, objConstruct=0x555555594200 <msgConstructForDeserializer>, objConstructFinalize=0x0, 
    objDeserialize=0x555555596560 <MsgDeserialize>) at obj.c:930
#4  0x00005555555ae56a in qDeqDisk (pThis=0x555555665340, ppMsg=<optimized out>) at queue.c:1029
#5  0x00005555555afabd in qqueueDeq (ppMsg=0x7ffff66e7ae8, pThis=0x555555665340) at queue.c:1804
#6  DequeueConsumableElements (pSkippedMsgs=0x7ffff66e7b54, piRemainingQueueSize=<synthetic pointer>, pWti=<optimized out>, pThis=0x555555665340) at queue.c:1804
#7  DequeueConsumable (pThis=pThis@entry=0x555555665340, pWti=pWti@entry=0x555555665b80, pSkippedMsgs=pSkippedMsgs@entry=0x7ffff66e7b54) at queue.c:1879
#8  0x00005555555aff00 in DequeueForConsumer (pSkippedMsgs=0x7ffff66e7b54, pWti=0x555555665b80, pThis=0x555555665340) at queue.c:2081
#9  ConsumerReg (pThis=0x555555665340, pWti=0x555555665b80) at queue.c:2081
#10 0x00005555555acd82 in wtiWorker (pThis=pThis@entry=0x555555665b80) at wti.c:429
#11 0x00005555555ab63a in wtpWorker (arg=0x555555665b80) at wtp.c:431
#12 0x00007ffff7d8dfa3 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#13 0x00007ffff77f44cf in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 2 (Thread 0x7ffff7536700 (LWP 11618)):
#0  0x00007ffff7d9400c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
--Type <RET> for more, q to quit, c to continue without paging--
#1  0x00005555555b70a3 in lookupTableReloader (self=0x55555562cc70) at lookup.c:855
#2  0x00007ffff7d8dfa3 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00007ffff77f44cf in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 1 (Thread 0x7ffff756fc40 (LWP 11613)):
#0  0x00007ffff77f44c1 in clone () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007ffff7d8cddf in create_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007ffff7d8e80e in pthread_create@@GLIBC_2.2.5 () from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00005555555bddec in thrdCreate (thrdMain=0x7ffff756ad80 <runInput>, afterRun=0x7ffff756ad20 <afterRun>, bNeedsCancel=bNeedsCancel@entry=1 '\001', 
    name=0x55555561b850 "imklog") at ../threads.c:285
#4  0x000055555558e77e in runInputModules () at rsconf.c:879
#5  activate (cnf=0x555555606fd0) at rsconf.c:879
#6  activate (cnf=0x555555606fd0) at rsconf.c:847
#7  0x000055555556ca30 in initAll (argc=<optimized out>, argv=<optimized out>) at rsyslogd.c:1624
#8  0x000055555556a049 in main (argc=3, argv=0x7fffffffec78) at rsyslogd.c:2116
Feb 01 08:20:21 an-test-presto1001 systemd[1]: Starting System Logging Service...
Feb 01 08:20:21 an-test-presto1001 rsyslogd[14149]: imuxsock: Acquired UNIX socket '/run/systemd/journal/syslog' (fd 3) from systemd.  [v8.1901.0]
Feb 01 08:20:21 an-test-presto1001 rsyslogd[14149]: error starting up disk queue, using pure in-memory mode [v8.1901.0 try https://www.rsyslog.com/e/2026 ]
Feb 01 08:20:21 an-test-presto1001 systemd[1]: Started System Logging Service.
Feb 01 08:20:21 an-test-presto1001 rsyslogd[14149]: fwd_centrallog1001.eqiad.wmnet:6514 queue: error creating disk queue - giving up. [v8.1901.0 try https://www.rsyslog.com/e/2026 ]
Feb 01 08:20:21 an-test-presto1001 rsyslogd[14149]:  [origin software="rsyslogd" swVersion="8.1901.0" x-pid="14149" x-info="https://www.rsyslog.com"] start
Feb 01 08:20:21 an-test-presto1001 systemd[1]: rsyslog.service: Main process exited, code=killed, status=11/SEGV
elukey@an-test-presto1001:~$ dpkg -l rsyslog
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name           Version      Architecture Description
+++-==============-============-============-=========================================
ii  rsyslog        8.1901.0-1   amd64        reliable system and kernel logging daemon

Event Timeline

I was able to make rsyslog to start deleting the content of /var/spool/rsyslog/* on the host (looks similar to https://github.com/rsyslog/rsyslog/issues/2654). The old files were backed up in /home/elukey/rsyslog for later inspection, likely something got corrupted? Did it happen in the past as well?

Thank you @elukey, I don't remember this issue being reported, did the reimage go as expected ? If there are other similar hosts to be reimaged/installed we should definitely keep an eye on those as well

Thank you @elukey, I don't remember this issue being reported, did the reimage go as expected ? If there are other similar hosts to be reimaged/installed we should definitely keep an eye on those as well

I think so yes, didn't spot anything weird on this front (I created the VM with the usual workflow..)

elukey claimed this task.

After a chat with Filippo we concluded that the issue was originated due to the temporary root partition being full (it happened for a bit due to presto logging errors too much), ending up in corrupted/incomplete files under the rsyslog spool dir. This seems to be a good theory (so partially my PEBCAK), I am inclined to close and re-open if the issue re-occurs (it would be nice to avoid a segfault in these cases, but there is already a GH issue opened for this).