Page MenuHomePhabricator

uwsgi's logsocket_plugin.so causes segfaults during log rotation
Open, NormalPublic

Description

Didn't find anything opened in Phab, apologies if this is a duplicate :)

Icinga complains about netbox often during the last days, it seems that log-rotation triggers a segfault:

root@netmon1002:~# grep segfault /var/log/kern.log* | sort
/var/log/kern.log.1:Dec 24 06:25:37 netmon1002 kernel: [316905.732867] uwsgi[970]: segfault at 55be46c0a680 ip 00007f3b6f2c2c99 sp 00007ffcb0211170 error 6 in logsocket_plugin.so[7f3b6f2c2000+1000]
/var/log/kern.log.1:Dec 25 06:25:38 netmon1002 kernel: [403311.343778] uwsgi[40701]: segfault at 55918315ea70 ip 00007f61c2d48c99 sp 00007fff314c9630 error 6 in logsocket_plugin.so[7f61c2d48000+1000]
/var/log/kern.log.1:Dec 25 06:37:19 netmon1002 kernel: [404012.567719] uwsgi[36281]: segfault at 55dcebe54090 ip 00007f6eb4633c99 sp 00007ffcc2fb1df0 error 6 in logsocket_plugin.so[7f6eb4633000+1000]
/var/log/kern.log.1:Dec 28 06:25:35 netmon1002 kernel: [662521.325553] uwsgi[37829]: segfault at 555dc53e7010 ip 00007f438b5dbc99 sp 00007fff4d27cff0 error 6 in logsocket_plugin.so[7f438b5db000+1000]
/var/log/kern.log.1:Dec 29 06:25:37 netmon1002 kernel: [748926.888305] uwsgi[21700]: segfault at 562cc1bee450 ip 00007ff0dacf7c99 sp 00007ffc0135d330 error 6 in logsocket_plugin.so[7ff0dacf7000+1000]
/var/log/kern.log.1:Dec 29 06:36:57 netmon1002 kernel: [749607.180131] uwsgi[18950]: segfault at 55daab7fc0f0 ip 00007fb9cb777c99 sp 00007ffd570b01f0 error 6 in logsocket_plugin.so[7fb9cb777000+1000]
/var/log/kern.log.1:Dec 29 07:07:11 netmon1002 kernel: [751421.405018] uwsgi[22578]: segfault at 563d7d1a0f60 ip 00007f041c2a9c99 sp 00007ffe578d3e70 error 6 in logsocket_plugin.so[7f041c2a9000+1000]
/var/log/kern.log:Dec 30 06:25:36 netmon1002 kernel: [835330.168809] uwsgi[21279]: segfault at 5604838e72a0 ip 00007fb43b81ec99 sp 00007ffe6bf9ed70 error 6 in logsocket_plugin.so[7fb43b81e000+1000]

Details

Related Gerrit Patches:

Event Timeline

elukey created this task.Dec 30 2018, 8:03 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 30 2018, 8:03 AM
Volans added a subscriber: ayounsi.Dec 31 2018, 5:11 PM

Thanks @elukey, I'll have a look in the next days, at first look it seems that:

  • the logrotate config is the default one from the uwsgi package and doesn't point to where we actually log and where they are rotated independently (/srv/log/netbox).
  • we don't use options like log-reopen in uwsgi, something to look into to fine tune the configuration so that it doesn't trigger this error.
herron assigned this task to Volans.Jan 2 2019, 5:40 PM
herron triaged this task as High priority.
jijiki added a subscriber: jijiki.Jan 2 2019, 5:47 PM

Ah hah you beat me to merging it.

I am looking at this a bit, and I find it curious that logsocket plugin specifically is where the crash is happening since that is what is logging to logstash presumably (and not involved with the on-disk logs).

Side note, I have not been able to reproduce this in our test instance of netmon, even though it happens regularly in production.

Also the position in the library is the same in all of these segfaults (and I don't know enough x86 assembly to really understand whats happening there).

Joe added a subscriber: Joe.Apr 8 2019, 6:36 AM

FWIW, this happened again today. Can we please try to get to the bottom of this?

elukey added a comment.May 5 2019, 2:39 PM

I am wondering if we could do the following:

  1. make sure that uwsgi is able to dump a core when segfaulting (afaics it can't atm due to ulimits). Maybe temporary adding LimitCORE=infinity to the uwsgi unit and reloading?
  2. wait for a segfault to happen (either manually trying to reload as logrotate does or waiting for the next occurrence) and then use gdb with the uwsgi-dbg debian symbols package to get a meaningful stacktrace.

@Volans @crusnov what do you think?

I agree with this approach, and it's what I was pursuing some months ago. I have merged some time back support for uwsgi::app to set its LimitCORE for this very purpose[1]. Putting this into production should be trivial.

[1] https://gerrit.wikimedia.org/r/c/operations/puppet/+/493294

Change 508173 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] service::uwsgi: add the core_limit parameter

https://gerrit.wikimedia.org/r/508173

Change 508174 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] netbox: set the uwsgi's systemd LimitCore setting to 30G

https://gerrit.wikimedia.org/r/508174

elukey added a comment.May 6 2019, 7:00 AM

Added a couple of code reviews as attempt to add the LimitCore to the netbox's systemd unit. If this is not the idea that you guys had, please feel free to discard them :)

Change 508173 merged by Elukey:
[operations/puppet@production] service::uwsgi: add the core_limit parameter

https://gerrit.wikimedia.org/r/508173

Change 508174 merged by Elukey:
[operations/puppet@production] netbox: set the uwsgi's systemd LimitCore setting to 30G

https://gerrit.wikimedia.org/r/508174

Mentioned in SAL (#wikimedia-operations) [2019-05-06T09:35:52Z] <elukey> restart netbox on netmon1002 (trying to reproduce the segfault) - T212697

elukey added a comment.May 6 2019, 9:42 AM

A systemctl restart triggered a segfault, and a core was available under /var/tmp/core. This is what gdb says:

Core was generated by `/usr/bin/uwsgi --die-on-term --ini /etc/uwsgi/apps-enabled/netbox.ini'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007ffa725bac99 in uwsgi_socket_logger (ul=0x5645dde12c00,
    message=0x5645ddff3df0 "{\"@timestamp\":\"2019-05-06T09:34:10\",\"type\":\"netbox\",\"logger_name\":\"uwsgi\",\"host\":\"netmon1002\",\"level\":\"INFO\",\"message\":\"compiled with version: 6.3.0 20170516 on 17 March 2018 15:41:47\"}s\027\336EV", len=185)
    at plugins/logsocket/logsocket_plugin.c:56
56	plugins/logsocket/logsocket_plugin.c: No such file or directory.
(gdb) thread apply all bt

Thread 1 (Thread 0x7ffa78d9b780 (LWP 31292)):
#0  0x00007ffa725bac99 in uwsgi_socket_logger (ul=0x5645dde12c00,
    message=0x5645ddff3df0 "{\"@timestamp\":\"2019-05-06T09:34:10\",\"type\":\"netbox\",\"logger_name\":\"uwsgi\",\"host\":\"netmon1002\",\"level\":\"INFO\",\"message\":\"compiled with version: 6.3.0 20170516 on 17 March 2018 15:41:47\"}s\027\336EV", len=185)
    at plugins/logsocket/logsocket_plugin.c:56
#1  0x00005645dd369ce6 in uwsgi_log_func_do (encoders=<optimized out>, ul=0x5645dde12c00,
    msg=0x5645de178560 "compiled with version: 6.3.0 20170516 on 17 March 2018 15:41:47\n\300}\027\336EV", len=len@entry=64) at core/logging.c:1377
#2  0x00005645dd36cc0e in uwsgi_master_log () at core/logging.c:1421
#3  0x00005645dd3a8f25 in uwsgi_flush_logs () at core/uwsgi.c:1891
#4  0x00007ffa7697c940 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#5  0x00007ffa7697c99a in exit () from /lib/x86_64-linux-gnu/libc.so.6
#6  0x00005645dd35da6f in uwsgi_exit (status=<optimized out>) at core/utils.c:4411
#7  0x00005645dd3aaf83 in uwsgi_segfault (signum=11) at core/uwsgi.c:1839
#8  <signal handler called>
#9  0x00007ffa725bac99 in uwsgi_socket_logger (ul=0x5645dde12c00,
    message=0x5645ddeb2c10 "{\"@timestamp\":\"2019-05-06T09:34:10\",\"type\":\"netbox\",\"logger_name\":\"uwsgi\",\"host\":\"netmon1002\",\"level\":\"INFO\",\"message\":\"*** Starting uWSGI 2.0.14-debian (64bit) on [Mon May  6 09:34:09 2019] ***\"}", len=196)
    at plugins/logsocket/logsocket_plugin.c:56
#10 0x00005645dd369ce6 in uwsgi_log_func_do (encoders=<optimized out>, ul=0x5645dde12c00,
    msg=0x5645df3f51e0 "*** Starting uWSGI 2.0.14-debian (64bit) on [Mon May  6 09:34:09 2019] ***\n", len=len@entry=75) at core/logging.c:1377
#11 0x00005645dd36cc0e in uwsgi_master_log () at core/logging.c:1421
#12 0x00005645dd39cb16 in uwsgi_master_manage_events (interesting_fd=4) at core/master_events.c:11
#13 0x00005645dd36f875 in master_loop (argv=0x5645dde080c0, environ=<optimized out>) at core/master.c:757
#14 0x00005645dd3aedfa in uwsgi_run () at core/uwsgi.c:3280
#15 0x00005645dd35a38e in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at core/uwsgi.c:2012

Highlight to:

#7  0x00005645dd3aaf83 in uwsgi_segfault (signum=11) at core/uwsgi.c:1839
#8  <signal handler called>
#9  0x00007ffa725bac99 in uwsgi_socket_logger (ul=0x5645dde12c00,
    message=0x5645ddeb2c10 "{\"@timestamp\":\"2019-05-06T09:34:10\",\"type\":\"netbox\",\"logger_name\":\"uwsgi\",\"host\":\"netmon1002\",\"level\":\"INFO\",\"message\":\"*** Starting uWSGI 2.0.14-debian (64bit) on [Mon May  6 09:34:09 2019] ***\"}", len=196)
    at plugins/logsocket/logsocket_plugin.c:56

IIUC the segfault happens at frame 9, and then the related handler is called. This is the problematic code:

https://github.com/unbit/uwsgi/blob/uwsgi-2.0/plugins/logsocket/logsocket_plugin.c#L56

With gdb, I used frame 9 to print the state of the ul argument:

(gdb) frame
#9  0x00007ffa725bac99 in uwsgi_socket_logger (ul=0x5645dde12c00,
    message=0x5645ddeb2c10 "{\"@timestamp\":\"2019-05-06T09:34:10\",\"type\":\"netbox\",\"logger_name\":\"uwsgi\",\"host\":\"netmon1002\",\"level\":\"INFO\",\"message\":\"*** Starting uWSGI 2.0.14-debian (64bit) on [Mon May  6 09:34:09 2019] ***\"}", len=196)
    at plugins/logsocket/logsocket_plugin.c:56
56	in plugins/logsocket/logsocket_plugin.c
(gdb) print *ul
$15 = {name = 0x7ffa725badd9 "socket", id = 0x5645dde12be0 "logstash", func = 0x7ffa725bab40 <uwsgi_socket_logger>, configured = 1, fd = 91,
  data = 0x0, addr = {sa = {sa_family = 2, sa_data = ",\372\n\002\002$\000\000\000\000\000\000\000"}, sa_in = {sin_family = 2, sin_port = 64044,
      sin_addr = {s_addr = 604111370}, sin_zero = "\000\000\000\000\000\000\000"}, sa_un = {sun_family = 2,
      sun_path = ",\372\n\002\002$\000\000\000\000\000\000\000\000\260\217:\335EV", '\000' <repeats 11 times>, " \000\000\000\000\000\000\374\314<\335EV\000\000\000\000\000\000\000\000\000\000\300^=\335EV\000\000`\204:\335EV\000\000|\354`\335EV\000\000\000\000\000\000\000\000\000\000\r\315<\335EV\000\000\002\000\000\000\000\000\000\000\370^=\335EV"}, sa_in6 = {sin6_family = 2, sin6_port = 64044, sin6_flowinfo = 604111370, sin6_addr = {
        __in6_u = {__u6_addr8 = "\000\000\000\000\000\000\000\000\260\217:\335EV\000", __u6_addr16 = {0, 0, 0, 0, 36784, 56634, 22085, 0},
          __u6_addr32 = {0, 0, 3711602608, 22085}}}, sin6_scope_id = 0}}, addr_len = 16, count = 22085, msg = {msg_name = 0x5645dde12c28,
    msg_namelen = 16, msg_iov = 0x5645df3c0ad0, msg_iovlen = 1, msg_control = 0x0, msg_controllen = 0, msg_flags = 0}, buf = 0x0,
  arg = 0x5645dde128b0 "logstash.svc.eqiad.wmnet:11514", next = 0x0}

Some notes:

  • ul->configured seems set to 1
  • ul->data is 0x0
  • ul->count is 22085

When we get to line 56, is it clear that accessing the ul->msg.msg_iov[22085] doesn't work:

(gdb) print ul->msg.msg_iov[22085]
Cannot access memory at address 0x5645df416f20

Not an expert in iov so this could end up straightforward to fix solve, I have no idea :) If nobody comes up with a solution I'd open a task to upstream asking for their suggestions.

Volans added a comment.May 6 2019, 6:30 PM

@elukey thanks a lot for deep dive and the bug upstream!

@MoritzMuehlenhoff we have an interesting segfault that happens for uwsgi when systemctl restart the netbox unit, but only in production and not in labs. Cas and I tried several times to reproduce the bug without any luck. After opening https://github.com/unbit/uwsgi/issues/2010, there is a promising patch that may fix the issue, so in order to keep mental sanity I propose the following:

  • download the source of the uwsgi deb from Debian
  • apply the path outlined in https://github.com/unbit/uwsgi/issues/2010 (one liner in the logsocket_plugin.so lib)
  • build and deploy the resulting uwsgi-core deb (that contains the .so afaics) to netmon1002 and see if the segfault is fixed.

Does it sound acceptable? Or do you prefer another way?

The patch seems sane and simple. I concur with this plan fwiw.

elukey added a comment.May 7 2019, 5:50 PM

On boron I have built uwsgi-core_2.0.14+20161117-3+deb9u2~wmf1_amd64.deb with the following patch:

elukey@boron:~/uwsgi-2.0.14+20161117$ cat debian/patches/1011_fix_logsocket_plugin.patch
Index: uwsgi-2.0.14+20161117/plugins/logsocket/logsocket_plugin.c
===================================================================
--- uwsgi-2.0.14+20161117.orig/plugins/logsocket/logsocket_plugin.c
+++ uwsgi-2.0.14+20161117/plugins/logsocket/logsocket_plugin.c
@@ -43,6 +43,7 @@ ssize_t uwsgi_socket_logger(struct uwsgi
                else {
                        ul->msg.msg_iov = uwsgi_malloc(sizeof(struct iovec));
                        ul->msg.msg_iovlen = 1;
+                        ul->count = 0;
                }

                if (comma) {

Does it sound acceptable? Or do you prefer another way?

Sounds great!

Mentioned in SAL (#wikimedia-operations) [2019-05-08T07:45:23Z] <elukey> install uwsgi-core_2.0.14+20161117-3+deb9u2+wmf1 on netmon1002 to test a uwsgi bug fix - T212697

elukey added a comment.May 8 2019, 7:51 AM

The fix looks very promising, I have restarted 3 times in a row uwsgi-netbox and no trace of the segfault. Let's wait for tomorrow's round of logrotate restarts to confirm that we are good.

elukey added a comment.May 8 2019, 7:53 AM

Side note: I don't find any message in Kibana related to netmon1002 or netbox, not sure if normal or not.

Mentioned in SAL (#wikimedia-operations) [2019-05-08T09:24:32Z] <elukey> install uwsgi-core_2.0.14+20161117-3+deb9u2+wmf1 on netmon2001 to test a uwsgi bug fix - T212697

elukey added a comment.May 9 2019, 7:02 AM

No segfaults today, logrotate ran fine! I don't see any log in logstash though, worth to investigate? (Maybe in a separate task if confirmed by others).

I updated the github issue, once a commit is done to uwsgi's master I'll open a bug report to debian.

Mentioned in SAL (#wikimedia-operations) [2019-05-09T08:23:58Z] <elukey> upload uwsgi 2.0.14+20161117-3+deb9u2+wmf1 packages to stretch-wikimedia - T212697

Change 509009 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] netbox: Mask the default uwsgi service for Netbox

https://gerrit.wikimedia.org/r/509009

Change 509009 merged by Muehlenhoff:
[operations/puppet@production] netbox: Mask the default uwsgi service for Netbox

https://gerrit.wikimedia.org/r/509009

Upstream commit: https://github.com/unbit/uwsgi/commit/d642e635b3d558ce91e80442c74f4d16b9d81146

Next step is to open a bug to Debian to ask a patch to the uwsgi package!

elukey added a comment.EditedMay 13 2019, 8:22 AM

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=928927

Once this gets resolved and a new version of uwsgi is released for Debian, we will need to roll it our where needed.

elukey lowered the priority of this task from High to Normal.May 13 2019, 8:23 AM