Page MenuHomePhabricator

RESTBase shutting down spontaneously
Closed, ResolvedPublic

Description

This has happened a half-dozen times by now, where a RESTBase master would just shut down. Latest example from restbase1012:

root@restbase1012:~# systemctl status restbase
● restbase.service - "restbase service"
   Loaded: loaded (/lib/systemd/system/restbase.service; enabled)
   Active: inactive (dead) since Fri 2016-06-03 16:31:45 UTC; 12min ago
  Process: 11607 ExecStart=/usr/bin/firejail --blacklist=root --blacklist=/home/* --tmpfs=/tmp --caps --seccomp /usr/bin/nodejs restbase/server.js -c /etc/restbase/config.yaml (code=exited, status=0/SUCCESS)
 Main PID: 11607 (code=exited, status=0/SUCCESS)

Jun 03 16:31:44 restbase1012 firejail[11607]: Parent pid 11607, child pid 11608
Jun 03 16:31:44 restbase1012 firejail[11607]: parent is shutting down, bye...

Nothing visible in the logs for this period. Is this firejail spontaneously shutting it down?

Event Timeline

Eevans subscribed.

This occurred again just now on restbase1013:

1eevans@restbase1013:~$ sudo systemctl status -l restbase
2● restbase.service - "restbase service"
3 Loaded: loaded (/lib/systemd/system/restbase.service; enabled)
4 Active: inactive (dead) since Wed 2016-07-13 17:12:31 UTC; 7min ago
5 Main PID: 16945 (code=exited, status=0/SUCCESS)
6
7Jul 13 17:12:30 restbase1013 firejail[16945]: Parent pid 16945, child pid 16946
8Jul 13 17:12:30 restbase1013 firejail[16945]: parent is shutting down, bye...

The parent / child pid line seems to come from https://github.com/netblue30/firejail/blob/a344c555ff282c23a8274d10ad0f75eb4fae6836/src/firejail/main.c#L2180, and seems to be related to a child process starting. The exit happens right after in the logs, so perhaps the waitpid() further down returns immediately, which then calls myexit(0)?

The parent / child pid line seems to come from https://github.com/netblue30/firejail/blob/a344c555ff282c23a8274d10ad0f75eb4fae6836/src/firejail/main.c#L2180, and seems to be related to a child process starting. The exit happens right after in the logs, so perhaps the waitpid() further down returns immediately, which then calls myexit(0)?

FWIW, firejail exits with the status passed to myexit(int), and I think systemd is telling us that is 0.

Main PID: 16945 (code=exited, status=0/SUCCESS)

Another thing to add here. The timing of these log entries is very suspicious:

Jul 13 17:12:30 restbase1013 firejail[16945]: Parent pid 16945, child pid 16946
Jul 13 17:12:30 restbase1013 firejail[16945]: parent is shutting down, bye...

I think the first line should be logged at startup. But nothing is (I can't find the startup logged prior to this shutdown, and it logged nothing when I started it back up). I suspect the first message is sitting buffered in stdout ([[ https://github.com/netblue30/firejail/blob/a344c555ff282c23a8274d10ad0f75eb4fae6836/src/firejail/main.c#L2180 | all of this is the result of printfs ]]) and only appears to happen at the same time as the "bye".

FWIW, firejail exits with the status passed to myexit(int), and I think systemd is telling us that is 0.

Yup, which is why I suspected that we are hitting the fall-through case, which just calls myexit(0).

elukey triaged this task as High priority.Jul 19 2016, 12:50 PM

@Eevans, @GWicke: How should we proceed? I don't see any official owner of this task and it seems really important. Ops can definitely help but probably the efforts should be driven by Services?

Thanks!

mobrovac added a subscriber: MoritzMuehlenhoff.

We are thinking in the direction that this is being caused by the interaction between systemd and firejail, judging by the log outputs.

@MoritzMuehlenhoff would it be worth dropping firejail altogether on Jessie boxes at this point and use systemd's cgroup capabilities to contain the services?

This mass-happened today:

(15:43:50) icinga-wm: PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.48.110, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused)))
(15:43:50) icinga-wm: PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.112, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused)))
(15:43:51) icinga-wm: PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.32.80, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused)))
(15:43:51) icinga-wm: PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) is CRITICAL: Test retrieve images and videos of en.wp Cat page via media route returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/mobile-summary/{title} (retrieve page preview of Dog page) is CRITICAL: Test retrieve page preview of Dog page returned the unexpec
(15:43:59) icinga-wm: PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: /feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 returned the unexpected status 504 (expecting: 200)
(15:43:59) icinga-wm: PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) is CRITICAL: Test retrieve images and videos of en.wp Cat page via media route returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/mobile-summary/{title} (retrieve page preview of Dog page) is CRITICAL: Test retrieve page preview of Dog page returned the unexpec
(15:44:19) icinga-wm: PROBLEM - restbase endpoints health on praseodymium is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.16.149, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused)))
(15:44:41) icinga-wm: PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.32.178, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused)))
(15:44:41) icinga-wm: PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.48.134, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused)))

I will disable firejail for RB for the time being until we come up with a real solution.

Any production service running on systemd and not having

Restart=always

is a large liability as shown by the outage we just experienced.

This behaviour was introduced in I9dfc37a5 and I think it was largely an error.

Change 300275 had a related patch set uploaded (by Giuseppe Lavagetto):
restbase: have systemd restart failed nodes

https://gerrit.wikimedia.org/r/300275

Change 300276 had a related patch set uploaded (by Mobrovac):
RESTBase: disable firejail

https://gerrit.wikimedia.org/r/300276

Any production service running on systemd and not having

Restart=always

is a large liability as shown by the outage we just experienced.

This behaviour was introduced in I9dfc37a5 and I think it was largely an error.

I beg to differ. Is it really tolerable to have services killed spontaneously in prod? I don't think so.

Crashes happen. We need to be able to survive a mass crash (we can on the appservers precisely because upstart restarts the services)

This is not against solving the underlying problem (be it firejail or whatever else), it's just using the init system instead of a human to restart a crashed application.

From https://gerrit.wikimedia.org/r/#/c/300276/ by @Joe:

We still have no proof it's firejail killing restbase and we have no idea of a root cause.
Also happening everywhere at the same time suggests something different than "firejail randomly killing rb".

root@restbase1008:~# journalctl -u restbase
Jul 21 13:41:08 restbase1008 firejail[45822]: Parent pid 45822, child pid 45824
Jul 21 13:41:08 restbase1008 firejail[45822]: parent is shutting down, bye...

... from https://github.com/netblue30/firejail/blob/6462230ff0e5cf09a1c35ee5c737c86765edaa09/src/firejail/main.c#L129

After investigating this for a while I am now fairly certain that that the master process exit was indeed caused by a DNS resolution failure. I have pushed changes for the kad and limitation libraries to better handle transient DNS resolution errors during UDP writes. I also updated service-runner master to explicitly pull in limitation v0.1.9, although older versions will also pull in limitation v0.1.9, as it is semver-compatible with the specified pattern ^0.1.8. Overall, these changes should let service-runner services handle DNS outages gracefully, without affecting the master process.

Tho investigate why we didn't get any error messages or status codes in the logs, I simulated an uncaught exception in the service-runner master process:

{"name":"restbase","hostname":"pad","pid":23393,"level":60,"err":{"message":"Test","name":"Error","stack":"Error: Test\n    at null._onTimeout (/home/gabriel/src/restbase/node_modules/service-runner/lib/master.js:15:11)\n    at Timer.listOnTimeout (timers.js:92:15)","levelPath":"fatal/service-runner/unhandled"},"msg":"Test","time":"2016-07-21T21:53:18.131Z","v":0}
gabriel@pad:~/src/restbase$ echo $?
1

This suggests that there are several issues in firejail:

  • Error return status codes (like -1) are turned into status 0.
  • stdout is apparently ignored & does not make it into the systemd journal.

Change 300275 merged by Giuseppe Lavagetto:
restbase: have systemd restart failed nodes

https://gerrit.wikimedia.org/r/300275

  • stdout is apparently ignored & does not make it into the systemd journal.

This is unrelated to firejail and needs to be fixed in service::node, systemd only redirects stdout/stderror when using StandardOutput and StandardError in the systemd unit (see the systemd.exec manpage for the details of the format)

  • Error return status codes (like -1) are turned into status 0.

I can't reproduce (https://phabricator.wikimedia.org/P3569) and I'm sure such a grave bug would have been noticed earlier. Can you provide more details under which conditions that would occur?

After deploying the changes mentioned in T136957#2485532 yesterday, it looks like today's network issues did not result in any RB master processes exiting. This strongly suggests that we did indeed identify and address the root cause.

  • stdout is apparently ignored & does not make it into the systemd journal.

This is unrelated to firejail and needs to be fixed in service::node, systemd only redirects stdout/stderror when using StandardOutput and StandardError in the systemd unit (see the systemd.exec manpage for the details of the format)

I see, good point. Given that production services normally don't log anything to stdout, we should probably capture stderr & stdout by default.

Edit: Patch at https://gerrit.wikimedia.org/r/301309.

  • Error return status codes (like -1) are turned into status 0.

I can't reproduce (https://phabricator.wikimedia.org/P3569) and I'm sure such a grave bug would have been noticed earlier. Can you provide more details under which conditions that would occur?

We don't have much information beyond the minimal lines returned by firejail. When I reproduced the error that pretty certainly lead to the master process exiting, I consistently got a return status of 1 from RB. I am pretty sure that there is nothing catching errors & returning them into 0 exits. That said, we don't have conclusive proof of RB not having exited with 0.

About the only thing I could imagine based on previous experience with upstart is that firejail could be confused about which process is actually the master. As far as I know it is not (ab)using ptrace though, so this seems less likely.

Change 301309 had a related patch set uploaded (by GWicke):
Service::node: Capture stdout and stderr in journal

https://gerrit.wikimedia.org/r/301309

This suggests that there are several issues in firejail:

  • Error return status codes (like -1) are turned into status 0.

I can confirm this:

root@restbase1008:~# /usr/bin/firejail --blacklist=/root --blacklist=/home --caps --seccomp  false
Parent pid 27055, child pid 27056
The new log directory is /proc/27056/root/var/log
Interface           IP                  Mask                Status              
lo                  127.0.0.1           255.0.0.0           UP                  
lo:LVS              10.2.2.17           255.255.255.255     UP                  
eth0                10.64.32.178        255.255.252.0       UP                  
eth0                10.64.32.187        255.255.255.255     UP                  
eth0                10.64.32.195        255.255.255.255     UP                  
eth0                10.64.32.196        255.255.255.255     UP                  

Child process initialized

parent is shutting down, bye...
root@restbase1008:~# echo $?
0

This is on restbase1008, with firejail 0.9.26. On my stretch system running firejail 0.9.40-3 though, I can't reproduce. So this was probably fixed in a newer upstream release.

  • stdout is apparently ignored & does not make it into the systemd journal.

I can also confirm this one:

root@restbase1008:~# /usr/bin/firejail --blacklist=/root --blacklist=/home --caps --seccomp "echo foo"
Parent pid 29417, child pid 29418
The new log directory is /proc/29418/root/var/log
Interface           IP                  Mask                Status              
lo                  127.0.0.1           255.0.0.0           UP                  
lo:LVS              10.2.2.17           255.255.255.255     UP                  
eth0                10.64.32.178        255.255.252.0       UP                  
eth0                10.64.32.187        255.255.255.255     UP                  
eth0                10.64.32.195        255.255.255.255     UP                  
eth0                10.64.32.196        255.255.255.255     UP                  

Child process initialized
foo

parent is shutting down, bye...

"foo" was printed in stdout indeed. The same behavior applies to stderr as well.

Looking in the journal for restbase, we can find the initial message from firejail ("Parent pid ...") but not the rest ("The new log directory..."). This probably means that the firejail's child is not being logged and that would include everything printed by the executed process as well.

So yes, this is probably some systemd/firejail interaction issue and not RESTBase or node's fault.

firejail 0.38 is available in jessie-wikipedia, but restbase* isn't upgraded to the new version yet (haven't checked how 0.9.38 compares to 0.9.40)

So yes, this is probably some systemd/firejail interaction issue and not RESTBase or node's fault.

I tried to confirm this with the following systemd unit on a jessie system

[Unit]
Description="test-firejail service"
After=network.target

[Service]
# up ulimit -n a bit
LimitNOFILE=10000
WorkingDirectory=/tmp
ExecStart=/usr/bin/firejail --blacklist=/root --blacklist=/home --caps --seccomp "echo foo" 

[Install]
WantedBy=multi-user.target

and got this as a result

Jul 28 16:17:26 scb2001 systemd[1]: Starting "test-firejail service"...
Jul 28 16:17:26 scb2001 systemd[1]: Started "test-firejail service".
Jul 28 16:17:26 scb2001 firejail[119054]: foo
Jul 28 16:17:26 scb2001 firejail[119054]: Parent pid 119054, child pid 119055
Jul 28 16:17:26 scb2001 firejail[119054]: The new log directory is /proc/119055/root/var/log
Jul 28 16:17:26 scb2001 firejail[119054]: parent is shutting down, bye...

so when a program exits cleanly nothing devious happens and logs seem to make it to the journal. So I tried to launch in the same way a simple python script that printed a line every second and then quit with sys.exit(1); again, all logs were reported:

import time
import sys
i = 1
while i:
    print "Iteration %d" % i
    i += 1
    if i == 5:
        sys.exit(1)
    time.sleep(1)

And I can see output in the journal:

Jul 28 16:25:27 scb2001 systemd[1]: Starting "test-firejail service"...
Jul 28 16:25:27 scb2001 systemd[1]: Started "test-firejail service".
Jul 28 16:25:30 scb2001 firejail[119732]: Iteration 1
Jul 28 16:25:30 scb2001 firejail[119732]: Iteration 2
Jul 28 16:25:30 scb2001 firejail[119732]: Iteration 3
Jul 28 16:25:30 scb2001 firejail[119732]: Iteration 4
Jul 28 16:25:30 scb2001 firejail[119732]: Parent pid 119732, child pid 119733
Jul 28 16:25:30 scb2001 firejail[119732]: The new log directory is /proc/119733/root/var/log
Jul 28 16:25:30 scb2001 firejail[119732]: parent is shutting down, bye...

finally, I tried with the same python script letting it run continuously, and killing it with kill -9. In this case I got

Jul 28 16:28:54 scb2001 systemd[1]: Starting "test-firejail service"...
Jul 28 16:28:54 scb2001 systemd[1]: Started "test-firejail service".
Jul 28 16:29:21 scb2001 firejail[121488]: Parent pid 121488, child pid 121489
Jul 28 16:29:21 scb2001 firejail[121488]: The new log directory is /proc/121489/root/var/log
Jul 28 16:29:21 scb2001 firejail[121488]: parent is shutting down, bye...

No output from the process! The same happens when I use systemctl stop test-firejail.service, no output is reported:

Jul 28 16:30:13 scb2001 systemd[1]: Started "test-firejail service".
Jul 28 16:30:21 scb2001 systemd[1]: Stopping "test-firejail service"...
Jul 28 16:30:21 scb2001 systemd[1]: test-firejail.service: main process exited, code=exited, status=1/FAILURE
Jul 28 16:30:21 scb2001 firejail[121605]: Parent pid 121605, child pid 121606
Jul 28 16:30:21 scb2001 firejail[121605]: The new log directory is /proc/121606/root/var/log
Jul 28 16:30:21 scb2001 firejail[121605]: Signal 15 caught, shutting down the child process
Jul 28 16:30:21 scb2001 firejail[121605]: parent is shutting down, bye...
Jul 28 16:30:21 scb2001 systemd[1]: Stopped "test-firejail service".
Jul 28 16:30:21 scb2001 systemd[1]: Unit test-firejail.service entered failed state.

Notice the failed state (because the child doesn't handle any signal itself, I guess).

So I added

StandardOutput=tty

to the unit and magically I found everything was being logged in real time to /dev/console (the default when you set StandardOutput to tty).

This seems to confirm that firejail when not attached to a tty buffers its output to stdout and stderr and for some reason it doesn't appear in the journal when it exits.

EDIT: I confirmed the output is buffered; what is amaizing is that while the output is buffered, what is still in the buffer when firejail exits doesn't seem to make it to the journal in any way.

Change 301309 abandoned by GWicke:
Service::node: Capture stdout and stderr in journal

Reason:
We narrowed down this issue to firejail buffering stdoud / stderr. The systemd default is already to redirect to the journal, so this patch is not actually needed.

https://gerrit.wikimedia.org/r/301309

The restbase cluster has been upgraded to firejail 0.9.40.3 which addresses the problem with the incorrect exit code handling. The log passthrough problem is still TBD.

Change 300276 abandoned by Mobrovac:
RESTBase: disable firejail

https://gerrit.wikimedia.org/r/300276

I think the log passthrough problem has been resolved as well and now services get a syslog.log file as well next to the main log file. So, should we resolve this ?

mobrovac claimed this task.
mobrovac removed a project: Patch-For-Review.
mobrovac removed a subscriber: gerritbot.

Yup. We haven't seen this happening since upgrading firejail and pushing logs to a file. Closing.