Page MenuHomePhabricator

thumbor memory limits for main process and subprocesses
Closed, DeclinedPublic

Description

Thumbor is running with MemoryLimit=512M at the moment and the limit is hit frequently:

thumbor1001:~$ sudo head -2 /var/log/syslog
Sep 14 06:25:04 thumbor1001 rsyslogd: [origin software="rsyslogd" swVersion="8.4.2" x-pid="1131" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
Sep 14 06:25:42 thumbor1001 prometheus-node-exporter[96676]: time="2016-09-14T06:25:42Z" level=error msg="ERROR: mdadm collector failed after 0.000215s: error parsing mdstatus: error parsing mdstat: too few matches found in buildline:       \tresync=PENDING" source="node_exporter.go:91"
thumbor1001:~$ sudo fgrep 'Kill process' -c /var/log/syslog
116
thumbor1001:~$ date
Wed Sep 14 09:07:24 UTC 2016
thumbor1001:~$

(the prometheus-node-exporter issue has been reported upstream)

Since the limit also includes external programs spawned by thumbor I'll try raising the limit to 1G and see if that helps.

Additionally we should also limit memory used by thumbor's subprocesses separatedly, so that a subprocess OOMing doesn't take down the thumbor instance.

Revisions and Commits

Event Timeline

Change 310500 had a related patch set uploaded (by Filippo Giunchedi):
thumbor: set MemoryLimit to 1G

https://gerrit.wikimedia.org/r/310500

Change 310500 merged by Filippo Giunchedi:
thumbor: set MemoryLimit to 1G

https://gerrit.wikimedia.org/r/310500

Is there any way to know which requests were killed by that?

We could probably look at the access log from thumbor and correlate the time. e.g. the last time it happened at 09:55:45 for thumbor@8823

Sep 14 09:55:34 thumbor1001 thumbor@8823[57889]: tornado.access:WARNING 404 GET /wikipedia/commons/thumb/d/db/StPhotios.jpg/220px-StPhotios.jpg (127.0.0.1) 212.77ms
Sep 14 09:55:41 thumbor1001 thumbor@8823[57889]: thumbor:WARNING ERROR retrieving image http://ms-fe.svc.eqiad.wmnet/v1/AUTH_mw/wikipedia-en-local-public.20/2/20/Tango-video-x-generic.png: HTTP 404: Not Found
Sep 14 09:55:41 thumbor1001 thumbor@8823[57889]: tornado.access:WARNING 404 GET /wikipedia/en/thumb/2/20/Tango-video-x-generic.png/40px-Tango-video-x-generic.png (127.0.0.1) 74.56ms
Sep 14 09:55:45 thumbor1001 thumbor@8823[57889]: Parent pid 57889, child pid 57891
Sep 14 09:55:45 thumbor1001 thumbor@8823[57889]: Parent is shutting down, bye...
Sep 14 09:55:46 thumbor1001 thumbor@8823[82366]: Reading profile /etc/firejail/thumbor.profile
Sep 14 09:55:46 thumbor1001 thumbor@8823[82366]: Parent pid 82366, child pid 82368
Sep 14 09:55:47 thumbor1001 thumbor@8823[82366]: thumbor:WARNING ERROR retrieving image http://ms-fe.svc.eqiad.wmnet/v1/AUTH_mw/wikipedia-commons-local-public.94/9/94/Provincia_di_Brescia-Stemma.png: HTTP 404: Not Found
Sep 14 09:55:47 thumbor1001 thumbor@8823[82366]: tornado.access:WARNING 404 GET /wikipedia/commons/thumb/9/94/Provincia_di_Brescia-Stemma.png/20px-Provincia_di_Brescia-Stemma.png (127.0.0.1) 72.47ms

It looks like it's dying without an error. I.e. it might have been processing a request and everything was fine from thumbor's perspective. I'll turn the debug logging on to see if I can catch a live one.

I see that it seems to happen around 17 times/hour on average. Meaning 0.007% of requests would die that way. And if only thumb hits are considered, that's 0.12% of thumb hits.

I've now turned on DEBUG logging and am waiting for one of those to happen.

First instance of it seems to be while trying to create a huge PDF thumbnail:

Sep 14 12:01:56 thumbor1001 thumbor@8827[23855]: [2016-09-14 12:01:56,289 - DEBUG - images] [ImagesHandler] translate: {'lang': None, 'end': u'\u041c\u0435\u0442\u0440\u0438\u0447\u043d\u0430_\u043a\u043d\u0438\u0433\u0430_\u0441\u0435\u043b\u0430_\u0412\u043e\u0441\u043a\u0440\u0435\u0441\u0435\u043d\u0441\u044c\u043a_1842_\u0440\u043e\u043a\u0443.pdf', 'language': u'commons', 'extension': u'pdf', 'format': u'jpg', 'shard2': u'1d', 'filename': u'\u041c\u0435\u0442\u0440\u0438\u0447\u043d\u0430_\u043a\u043d\u0438\u0433\u0430_\u0441\u0435\u043b\u0430_\u0412\u043e\u0441\u043a\u0440\u0435\u0441\u0435\u043d\u0441\u044c\u043a_1842_\u0440\u043e\u043a\u0443', 'project': u'wikipedia', 'width': u'5537', 'lossy': None, 'shard1': u'1', 'seek': None, 'page': u'1', 'qlow': None}

[...]

Sep 14 12:02:49 thumbor1001 thumbor@8827[23855]: [2016-09-14 12:02:49,466 - DEBUG - imagemagick] [IM] resize: 5537.0 8398.0
Sep 14 12:02:50 thumbor1001 thumbor@8827[23855]: Parent pid 23855, child pid 23908
Sep 14 12:02:50 thumbor1001 thumbor@8827[23855]: Parent is shutting down, bye...

So far, that makes sense. Requesting the same thumbnail on mediawiki production: https://upload.wikimedia.org/wikipedia/commons/thumb/1/1d/%D0%9C%D0%B5%D1%82%D1%80%D0%B8%D1%87%D0%BD%D0%B0_%D0%BA%D0%BD%D0%B8%D0%B3%D0%B0_%D1%81%D0%B5%D0%BB%D0%B0_%D0%92%D0%BE%D1%81%D0%BA%D1%80%D0%B5%D1%81%D0%B5%D0%BD%D1%81%D1%8C%D0%BA_1842_%D1%80%D0%BE%D0%BA%D1%83.pdf/page1-5537px-%D0%9C%D0%B5%D1%82%D1%80%D0%B8%D1%87%D0%BD%D0%B0_%D0%BA%D0%BD%D0%B8%D0%B3%D0%B0_%D1%81%D0%B5%D0%BB%D0%B0_%D0%92%D0%BE%D1%81%D0%BA%D1%80%D0%B5%D1%81%D0%B5%D0%BD%D1%81%D1%8C%D0%BA_1842_%D1%80%D0%BE%D0%BA%D1%83.pdf.jpg spins for a long time and eventually errors, stating that it's been killed. Presumably for the exact same OOM reasons.

Let's wait for more examples, but it seems to be a legitimate reason to die so far.

However it's a situation where thumbor dying could kill other requests that were pending. That's the downside of using a native library like IM's Wand instead of a subprocess that could die on its own. I'll check if it's possible to have Wand error cleanly when it runs out of memory. This suggests it might be possible to set a limit but doesn't say what happens from Python's perspective when it's hit: http://www.imagemagick.org/discourse-server/viewtopic.php?t=16563

Second example, this time it dies trying to process at 17MB PNG extracted from a PDF:

Sep 14 12:23:28 thumbor1001 thumbor@8824[23870]: [2016-09-14 12:23:28,387 - DEBUG - images] [ImagesHandler] translate: {'lang': None, 'end': u'Thresher_Display.PDF', 'language': u'commons', 'extension': u'PDF', 'format': u'jpg', 'shard2': u'65', 'filename': u'Thresher_Display', 'project': u'wikipedia', 'width': u'120', 'lossy': None, 'shard1': u'6', 'seek': None, 'page': u'1', 'qlow': None}
Sep 14 12:23:28 thumbor1001 thumbor@8824[23870]: [2016-09-14 12:23:28,387 - DEBUG - swift] [Swift] get
Sep 14 12:23:28 thumbor1001 thumbor@8824[23870]: [2016-09-14 12:23:28,426 - DEBUG - request] [REQUEST_STORAGE] get: http%3A//ms-fe.svc.eqiad.wmnet/v1/AUTH_mw/wikipedia-commons-local-public.65/6/65/Thresher_Display.PDF
Sep 14 12:23:28 thumbor1001 thumbor@8824[23870]: [2016-09-14 12:23:28,427 - DEBUG - request] [REQUEST_STORAGE] found
Sep 14 12:23:28 thumbor1001 thumbor@8824[23870]: [2016-09-14 12:23:28,427 - DEBUG - request] [REQUEST_STORAGE] missing
Sep 14 12:23:28 thumbor1001 thumbor@8824[23870]: [2016-09-14 12:23:28,581 - DEBUG - proxy] [Proxy] Looking for a pdf engine
Sep 14 12:23:28 thumbor1001 thumbor@8824[23870]: [2016-09-14 12:23:28,583 - DEBUG - __init__] [ShellRunner] Command: ['/usr/bin/timeout', '--foreground', '60', '/usr/bin/gs', '-sDEVICE=png16m', '-sOutputFile=%stdout', '-dFirstPage=1', '-dLastPage=1', '-r150', '-dBATCH', '-dNOPAUSE', '-dSAFER', '-q', '-f/tmp/tmpunc6I1/source_file']
Sep 14 12:23:53 thumbor1001 thumbor@8821[76424]: [2016-09-14 12:23:53,719 - DEBUG - client] RESP HEADERS: {u'content-length': u'0', u'last-modified': u'Wed, 14 Sep 2016 12:23:54 GMT', u'etag': u'8b0d10594bd5d47765d8910b8673160a', u'x-trans-id': u'txf203b86af88242b386639-0057d94159', u'date': u'Wed, 14 Sep 2016 12:23:53 GMT', u'content-type': u'text/html; charset=UTF-8'}
Sep 14 12:24:01 thumbor1001 thumbor@8824[23870]: [2016-09-14 12:24:01,983 - DEBUG - __init__] Stdout: <too long to display (17450065 bytes)>
Sep 14 12:24:01 thumbor1001 thumbor@8824[23870]: [2016-09-14 12:24:01,983 - DEBUG - __init__] Stderr:
Sep 14 12:24:01 thumbor1001 thumbor@8824[23870]: [2016-09-14 12:24:01,983 - DEBUG - __init__] Return code: 0
Sep 14 12:24:01 thumbor1001 thumbor@8824[23870]: [2016-09-14 12:24:01,983 - DEBUG - __init__] Duration: 33399.907
Sep 14 12:24:04 thumbor1001 thumbor@8824[23870]: Parent pid 23870, child pid 23915
Sep 14 12:24:04 thumbor1001 thumbor@8824[23870]: Parent is shutting down, bye...

Also makes, sense, although the PNG is an intermediary step. I originally picked PNG for quality purposes, since it gets turned into a JPG right after, but wrangling a PNG that size is very inefficient compared to resizing a huge JPG with the jpeg:size option.

That being said, production mediawiki is incapable of resizing that PDF either: https://commons.wikimedia.org/wiki/File:Thresher_Display.PDF due to its insane resolution. This idea or using an intermediary JPG may or may not work, we'll have to try to find out. I'll file that as a separate task.

Actually I now realize that the first one was also a PDF, and died for the same reason, it just got slightly further in the IM logging output before it died.

I've found a different kind, this time it's VIPS dying because we ask it for a giant PNG made from a giant TIFF:

Sep 14 12:39:15 thumbor1001 thumbor@8834[71498]: [2016-09-14 12:39:15,233 - DEBUG - images] [ImagesHandler] translate: {'lang': None, 'end': u'Old_Fire_Department_Building,_South_Pine_Street,_Mount_Holly,_Burlington_County,_NJ_HABS_NJ,3-MOUHO,9-_(sheet_0_of_2).tif', 'language': u'commons', 'extension': u'tif', 'format': u'jpg', 'shard2': u'6f', 'filename': u'Old_Fire_Department_Building,_South_Pine_Street,_Mount_Holly,_Burlington_County,_NJ_HABS_NJ,3-MOUHO,9-_(sheet_0_of_2)', 'project': u'wikipedia', 'width': u'9312', 'lossy': u'lossy-', 'shard1': u'6', 'seek': None, 'page': u'1', 'qlow': None}
Sep 14 12:39:15 thumbor1001 thumbor@8834[71498]: [2016-09-14 12:39:15,234 - DEBUG - swift] [Swift] get
Sep 14 12:39:15 thumbor1001 thumbor@8834[71498]: [2016-09-14 12:39:15,297 - DEBUG - request] [REQUEST_STORAGE] get: http%3A//ms-fe.svc.eqiad.wmnet/v1/AUTH_mw/wikipedia-commons-local-public.6f/6/6f/Old_Fire_Department_Building%2C_South_Pine_Street%2C_Mount_Holly%2C_Burlington_County%2C_NJ_HABS_NJ%2C3-MOUHO%2C9-_%28sheet_0_of_2%29.tif
Sep 14 12:39:15 thumbor1001 thumbor@8834[71498]: [2016-09-14 12:39:15,297 - DEBUG - request] [REQUEST_STORAGE] found
Sep 14 12:39:15 thumbor1001 thumbor@8834[71498]: [2016-09-14 12:39:15,297 - DEBUG - request] [REQUEST_STORAGE] missing
Sep 14 12:39:15 thumbor1001 thumbor@8834[71498]: [2016-09-14 12:39:15,400 - DEBUG - proxy] [Proxy] Looking for a tiff engine
Sep 14 12:39:15 thumbor1001 thumbor@8834[71498]: [2016-09-14 12:39:15,400 - DEBUG - __init__] [ExiftoolRunner] command: ['/usr/bin/exiftool', '-ImageSize', '-s', '-s', '-s', '/tmp/tmplpRtmY']
Sep 14 12:39:15 thumbor1001 thumbor@8834[71498]: [2016-09-14 12:39:15,400 - DEBUG - __init__] [ShellRunner] Command: ['/usr/bin/timeout', '--foreground', '60', '/usr/bin/exiftool', '-ImageSize', '-s', '-s', '-s', '/tmp/tmplpRtmY']
Sep 14 12:39:15 thumbor1001 thumbor@8834[71498]: [2016-09-14 12:39:15,532 - DEBUG - __init__] Stdout: 9312x7584
Sep 14 12:39:15 thumbor1001 thumbor@8834[71498]: [2016-09-14 12:39:15,532 - DEBUG - __init__] Stderr:
Sep 14 12:39:15 thumbor1001 thumbor@8834[71498]: [2016-09-14 12:39:15,532 - DEBUG - __init__] Return code: 0
Sep 14 12:39:15 thumbor1001 thumbor@8834[71498]: [2016-09-14 12:39:15,532 - DEBUG - __init__] Duration: 131.441
Sep 14 12:39:15 thumbor1001 thumbor@8834[71498]: [2016-09-14 12:39:15,533 - DEBUG - vips] [VIPS] Shrinking with command
Sep 14 12:39:15 thumbor1001 thumbor@8834[71498]: [2016-09-14 12:39:15,534 - DEBUG - __init__] [ShellRunner] Command: ['/usr/bin/timeout', '--foreground', '60', '/usr/bin/vips', 'shrink', '/tmp/tmplgowi_/source_file[page=0]', '/tmp/tmplgowi_/vips_result.png', '1', '1']
Sep 14 12:39:17 thumbor1001 thumbor@8834[71498]: [2016-09-14 12:39:17,008 - DEBUG - __init__] Stdout:
Sep 14 12:39:17 thumbor1001 thumbor@8834[71498]: [2016-09-14 12:39:17,008 - DEBUG - __init__] Stderr:
Sep 14 12:39:17 thumbor1001 thumbor@8834[71498]: [2016-09-14 12:39:17,008 - DEBUG - __init__] Return code: 0
Sep 14 12:39:17 thumbor1001 thumbor@8834[71498]: [2016-09-14 12:39:17,009 - DEBUG - __init__] Duration: 1474.358
Sep 14 12:39:17 thumbor1001 thumbor@8834[71498]: Parent pid 71498, child pid 71501
Sep 14 12:39:17 thumbor1001 thumbor@8834[71498]: Parent is shutting down, bye...

This time, though, mediawiki fares better: https://upload.wikimedia.org/wikipedia/commons/thumb/6/6f/Old_Fire_Department_Building%2C_South_Pine_Street%2C_Mount_Holly%2C_Burlington_County%2C_NJ_HABS_NJ%2C3-MOUHO%2C9-_(sheet_0_of_2).tif/lossy-page1-9312px-Old_Fire_Department_Building%2C_South_Pine_Street%2C_Mount_Holly%2C_Burlington_County%2C_NJ_HABS_NJ%2C3-MOUHO%2C9-_(sheet_0_of_2).tif.jpg

Looking at the Thumbor code, it seems like the mistake if that the VIPS engine generates a PNG no matter what, when for TIFFs the final output is a JPG. This intermediary PNG wastes memory, we should have VIPS generate a JPG directly and not process it in any way subsequently. I'll file that as a task.

Another find was an OOM on a large JPG (45MB) with a target width of 0. Which will be caught early as an invalid request thanks to T145614 in the future.

Change 310539 had a related patch set uploaded (by Filippo Giunchedi):
thumbor: increase icinga retries for service units

https://gerrit.wikimedia.org/r/310539

Trying to resize a 53MB JPG OOMs in a predictable fashion, and also fails in production: https://upload.wikimedia.org/wikipedia/commons/thumb/1/17/1857_Bird%27s_Eye_View_of_Chicago_-_LOC.jpg/120px-1857_Bird%27s_Eye_View_of_Chicago_-_LOC.jpg

Nothing to improve there, we're already using jpeg:size correctly. I guess even that technique has a limit in how much memory it can save.

I'll turn DEBUG logging back off. It seems like most, if not all OOMs are legit, and that I've found the areas I can improve. I.e. avoiding intermediary PNGs: T145637 T145638

Change 310539 merged by Filippo Giunchedi:
thumbor: increase icinga retries for service units

https://gerrit.wikimedia.org/r/310539

fgiunchedi renamed this task from thumbor cgroup OOM to thumbor memory limits for main process and subprocesses.Sep 15 2016, 3:00 PM
fgiunchedi updated the task description. (Show Details)

@ori suggested looking into tmpreaper, which might be something useful in general to avoid ancient tmp files laying around, now that we'll be using them more than we originally expected.

@fgiunchedi would there be an easy to track those OOM kills as a metric? I can't look at syslog myself and check if my latest round of updates improved the situation.

@Gilles good question, I don't think we have a good way to pull metrics from logs yet. I was meaning to try https://github.com/google/mtail though so this might be a good occasion

That looks like it could do the job, I'll try it out and see if it can be easily backported.

This comment was removed by Gilles.

OK, now that I've implemented this in a different way, I've figured out why cgexec wasn't working. Firejail blocks access to /sys/fs which means that the cgroup VFS, and by extension any cgroup operations, is blocked inside a firejailed process. I couldn't find an option to turn that off, so I filed an issue on github: https://github.com/netblue30/firejail/issues/862

In the meantime, is there an alternative to firejail I could look into?

The blocking of /sys/fs is currently hard-coded in the fs_proc_sys_dev_boot() function. We could temporarily run a patched build until this is properly sorted out upstream. There's a few other solutions (e.g. minijail), but we'll run into other limitations and it's useful to run a uniform tool across various services.

Is this only needed for setting memory limits on externally spawned processes or also for anything else?

Change 316305 had a related patch set uploaded (by Gilles):
Upgrade to 0.1.28

https://gerrit.wikimedia.org/r/316305

Is this only needed for setting memory limits on externally spawned processes or also for anything else?

Yes, it's only for setting the memory limits of the spawned subprocesses.

An alternative is to use cgrulesengd with a configuration for each program used as a subprocess. But my initial attempt to use the existing puppet classes for cgrulesengd (currently only used on trusty) failed. It complains about cgconfig.conf not being present, and when I attempt to generate one based on the current cgroup config with cgsnapshot, it complains about a syntax error in the generated file. I gave up at that point, as the whole thing seemed messy with a config file seemingly having to map the existing cgroups, it looked to me like it would be a fragile thing.

It's not the end of the world for the initial launch if thumbor dies every time it OOMs because it has the same limit as its subprocesses. Nginx will retry on other instances. Maybe we can wait to see what the firejail author says? He seems quite responsive on other issues filed by folks, often implementing requested options.

Agreed, let's wait for firejail upstream to comment, they're fairly responsive in general.

Change 316305 merged by Filippo Giunchedi:
Upgrade to 0.1.28

https://gerrit.wikimedia.org/r/316305

I can build firejail test packages with the cherrypicked patch tomorrow

I've just realized that this can't work for another reason: the pids are namespaced under firejail, so there's no way to add the correct pid to the cgroup tasks file :(

gilles@deployment-imagescaler01:~$ cat /usr/bin/getpid.py
import os
import subprocess

def preexec():
  pid = os.getpid()
  print pid
  f = open('/sys/fs/cgroup/memory/thumbor-subprocesses/tasks','a+')
  f.write('%s\n' % pid)
  f.close()

p = subprocess.Popen(['/usr/bin/whoami'], preexec_fn=preexec)
p.wait()
gilles@deployment-imagescaler01:~$ sudo -u thumbor python /usr/bin/getpid.py 
20014
thumbor
gilles@deployment-imagescaler01:~$ sudo -u thumbor firejail -- python /usr/bin/getpid.py 
Reading profile /etc/firejail/default.profile
Reading profile /etc/firejail/disable-common.inc
Reading profile /etc/firejail/disable-programs.inc
Reading profile /etc/firejail/disable-passwdmgr.inc

** Note: you can use --noprofile to disable default.profile **

Parent pid 20022, child pid 20023
Child process initialized
3
Traceback (most recent call last):
  File "/usr/bin/getpid.py", line 11, in <module>
    p = subprocess.Popen(['/usr/bin/whoami'], preexec_fn=preexec)
  File "/usr/lib/python2.7/subprocess.py", line 710, in __init__
    errread, errwrite)
  File "/usr/lib/python2.7/subprocess.py", line 1335, in _execute_child
    raise child_exception
IOError: [Errno 13] Permission denied: '/sys/fs/cgroup/memory/thumbor-subprocesses/tasks'

Parent is shutting down, bye...
gilles@deployment-imagescaler01:~$

Well, the above was blocked by the sys/fs thing, but since pid is shown as "3" it obviously doesn't work any better with the new option. Which doesn't quite work anyway:

gilles@deployment-imagescaler01:~$ sudo -u thumbor firejail --noblacklist=/sys/fs -- python /usr/bin/getpid.py 
Reading profile /etc/firejail/default.profile
Reading profile /etc/firejail/disable-common.inc
Reading profile /etc/firejail/disable-programs.inc
Reading profile /etc/firejail/disable-passwdmgr.inc

** Note: you can use --noprofile to disable default.profile **

Parent pid 20065, child pid 20066
Child process initialized
3
Traceback (most recent call last):
  File "/usr/bin/getpid.py", line 11, in <module>
    p = subprocess.Popen(['/usr/bin/whoami'], preexec_fn=preexec)
  File "/usr/lib/python2.7/subprocess.py", line 710, in __init__
    errread, errwrite)
  File "/usr/lib/python2.7/subprocess.py", line 1335, in _execute_child
    raise child_exception
IOError: [Errno 2] No such file or directory: '/sys/fs/cgroup/memory/thumbor-subprocesses/tasks'

Parent is shutting down, bye...

gilles@deployment-imagescaler01:~$ sudo -u thumbor firejail --noblacklist=/sys/fs -- ls -al /sys/fs/cgroup/
Reading profile /etc/firejail/default.profile
Reading profile /etc/firejail/disable-common.inc
Reading profile /etc/firejail/disable-programs.inc
Reading profile /etc/firejail/disable-passwdmgr.inc

** Note: you can use --noprofile to disable default.profile **

Parent pid 20155, child pid 20156
Child process initialized
total 0
dr-xr-xr-x 2 nobody nogroup 0 Oct 11 09:35 .
drwxr-xr-x 6 nobody nogroup 0 Oct 11 09:35 ..

Parent is shutting down, bye...
gilles@deployment-imagescaler01:~

I think that the namespaced pid is the real dealbreaker here. I don't think it's possible to assign the subprocesses to a different cgroup under firejail. I guess security trumps everything else and we'll have to live with thumbor dying regularly on offending files.

For completion I also tried cgexec, which doesn't fare any better, probably because the noblacklist option doesn't quite work:

gilles@deployment-imagescaler01:~$ sudo -u thumbor firejail --noblacklist=/sys/fs -- cgexec -g memory:thumbor-subprocesses whoamiReading profile /etc/firejail/default.profile
Reading profile /etc/firejail/disable-common.inc
Reading profile /etc/firejail/disable-programs.inc
Reading profile /etc/firejail/disable-passwdmgr.inc

** Note: you can use --noprofile to disable default.profile **

Parent pid 22856, child pid 22857
Child process initialized
libcgroup initialization failed: Cgroup is not mounted

Parent is shutting down, bye...
gilles@deployment-imagescaler01:~$

Since cgexec forks, though, I'm pretty sure that it will try to do just what my Python code did and fail anyway because of the namespaced pid:

22938 pts/0    S+     0:00 sudo -u thumbor cgexec -g memory:thumbor-subprocesses sleep 30
22939 pts/0    S+     0:00 sleep 30

I don't really see any way out of this problem.

It seems like the only viable solution is to get cgrulesengd to work, with rules for the expected processes.

Gilles moved this task from Backlog to Doing on the Thumbor board.

Actually I've just realized that cgrulesengd cannot work either, because it would assign all instances of a given command to a same cgroup. When they actually each need to be in a cgroup of their own.

Which is the case for the thumbor processes:

CGroup: /system.slice/system-thumbor.slice/thumbor@8802.service

I'm not aware of cgrulesengd being able to create a unique cgroup for each subprocess.

I believe we're out of options and that firejail prevents us from having a different limit for subprocesses because of the pid namespacing.