Page MenuHomePhabricator

thumbor imagemagick filling up /tmp on thumbor1002
Closed, ResolvedPublic

Description

Looks like thumbor/imagemagick was writing to a temporary file (/tmp/magick-23f_-WTiCz0G3) with no end in sight, filling up /tmp

root@thumbor1002:/tmp# du -hcs *
36G     systemd-private-c1b8c8e934cd4a0d9662ff263e893d93-thumbor@8834.service-RGT6hN
37G     total

thumbor   45099 86.5  0.2 3130872 196412 ?      Dl   12:16 165:34      \_ /usr/bin/python /usr/bin/thumbor --port 8834 --ip 127.0.0.1 --keyfile /etc/thumbor.key --conf /etc/thumbor.d/
thumbor  132107  0.0  0.0  10076   828 ?        S    12:57   0:00          \_ /usr/bin/timeout --foreground 60 /usr/bin/ffmpeg -ss 41 -i http://ms-fe.svc.eqiad.wmnet/v1/AUTH_mw/wikipedia-commons-local-public.ab/a/ab/Lipka_igrashok_01.webm -y -vframes 1 -an -f image2 -nostats -loglevel error -
thumbor  132108  0.0  0.1 1980908 122704 ?      Sl   12:57   0:00          |   \_ /usr/bin/ffmpeg -ss 41 -i http://ms-fe.svc.eqiad.wmnet/v1/AUTH_mw/wikipedia-commons-local-public.ab/a/ab/Lipka_igrashok_01.webm -y -vframes 1 -an -f image2 -nostats -loglevel error -
thumbor   38150  0.0  0.0  10076   832 ?        S    13:19   0:00          \_ /usr/bin/timeout --foreground 60 /usr/bin/ffmpeg -ss 44 -i http://ms-fe.svc.eqiad.wmnet/v1/AUTH_mw/wikipedia-commons-local-public.15/1/15/A-calibrated-agent-based-computer-model-of-stochastic-cell-dynamics-in-normal-human-colon-crypts-1742-4682-10-66-S2.ogv -y -vframes 1 -an -f image2 -nostats -loglevel error -

strace and lsof, note the 8 bytes write

[pid 46324] pwrite(18, "=(E._E\0\0", 8, 3257092011952) = 8
[pid 46324] pwrite(18, "\311#\n.:I\0\0", 8, 3277576465912) = 8
[pid 46324] pwrite(18, "\302\34t(CD\0\0", 8, 3298060919872) = 8
[pid 46324] pwrite(18, "\353\31\234%kA\0\0", 8, 3318545373832) = 8
[pid 46324] pwrite(18, "\272\37\24,\222F\0\0", 8, 3339029827792) = 8
[pid 46324] pwrite(18, "\321&\0221BL\0\0", 8, 3359514281752) = 8
[pid 46324] pwrite(18, "a#\254.\267F\0\0", 8, 3379998735712) = 8
[pid 46324] pwrite(18, "}\33\277&\345>\0\0", 8, 3400483189672) = 8
[pid 46324] pwrite(18, "F\31\247%\270;\0\0", 8, 3420967643632) = 8
[pid 46324] pwrite(18, "6!\25+\242A\0\0", 8, 3441452097592) = 8
[pid 46324] pwrite(18, "\r'U1\247F\0\0", 8, 3461936551552) = 8
[pid 46324] pwrite(18, "\210(\3172\"H\0\0", 8, 3482421005512) = 8
[pid 46324] pwrite(18, "F)\2153\340H\0\0", 8, 3502905459472) = 8
[pid 46324] pwrite(18, "#.k8\177L\0\0", 8, 3523389913432) = 8
[pid 46324] pwrite(18, "\0054L>`R\0\0", 8, 3543874367392^C) = 8
Process 45099 detached
Process 46324 detached
Process 83392 detached
Process 120899 detached
root@thumbor1002:/tmp# ^C
thumbor 45099 thumbor    0r   CHR                1,3            0t0      1028 /dev/null
thumbor 45099 thumbor    1u  unix 0xffff880854d227c0            0t0 121825146 socket
thumbor 45099 thumbor    2u  unix 0xffff880854d227c0            0t0 121825146 socket
thumbor 45099 thumbor    3u  IPv4          121839348            0t0       TCP localhost:8834 (LISTEN)
thumbor 45099 thumbor    4u  0000               0,11              0      9224 anon_inode
thumbor 45099 thumbor    5r  FIFO               0,10            0t0 121839349 pipe
thumbor 45099 thumbor    6w  FIFO               0,10            0t0 121839349 pipe
thumbor 45099 thumbor    7u  IPv4          123901036            0t0       TCP localhost:8834->localhost:57018 (CLOSE_WAIT)
thumbor 45099 thumbor    8u  IPv4          123863415            0t0       TCP thumbor1002.eqiad.wmnet:36504->ms-fe.svc.eqiad.wmnet:http (ESTABLISHED)
thumbor 45099 thumbor    9u  IPv4          123863417            0t0       TCP thumbor1002.eqiad.wmnet:36506->ms-fe.svc.eqiad.wmnet:http (ESTABLISHED)
thumbor 45099 thumbor   10u  IPv4          121926421            0t0       UDP *:34547 
thumbor 45099 thumbor   11r   CHR                1,9            0t0 121897705 /dev/urandom
thumbor 45099 thumbor   12u  IPv4          123872482            0t0       TCP thumbor1002.eqiad.wmnet:43619->ms-fe.svc.eqiad.wmnet:http (ESTABLISHED)
thumbor 45099 thumbor   13u  IPv4          123870612            0t0       TCP thumbor1002.eqiad.wmnet:43624->ms-fe.svc.eqiad.wmnet:http (ESTABLISHED)
thumbor 45099 thumbor   14u  IPv4          123888258            0t0       TCP thumbor1002.eqiad.wmnet:49946->ms-fe.svc.eqiad.wmnet:http (ESTABLISHED)
thumbor 45099 thumbor   15u  IPv4          123036569            0t0       TCP thumbor1002.eqiad.wmnet:56903->ms-fe.svc.eqiad.wmnet:http (ESTABLISHED)
thumbor 45099 thumbor   16r  FIFO               0,10            0t0 123122801 pipe
thumbor 45099 thumbor   17u  IPv4          123901040            0t0       TCP thumbor1002.eqiad.wmnet:57034->ms-fe.svc.eqiad.wmnet:http (ESTABLISHED)
thumbor 45099 thumbor   18u   REG                9,0 12290672376000   1053876 /tmp/magick-23f_-WTiCz0G3
thumbor 45099 thumbor   19r  FIFO               0,10            0t0 123117518 pipe
thumbor 45099 thumbor   23r  FIFO               0,10            0t0 122645743 pipe
thumbor 45099 thumbor   27r  FIFO               0,10            0t0 123310690 pipe
thumbor 45099 thumbor   29u  IPv4          123748627            0t0       TCP thumbor1002.eqiad.wmnet:60138->ms-fe.svc.eqiad.wmnet:http (ESTABLISHED)
thumbor 45099 thumbor   30r  FIFO               0,10            0t0 123126956 pipe
thumbor 45099 thumbor   31r  FIFO               0,10            0t0 123730903 pipe
thumbor 45099 thumbor   35r  FIFO               0,10            0t0 122315707 pipe
thumbor 45099 thumbor   38u  IPv4          123851742            0t0       TCP thumbor1002.eqiad.wmnet:59000->ms-fe.svc.eqiad.wmnet:http (ESTABLISHED)
thumbor 45099 thumbor   40r  FIFO               0,10            0t0 122870280 pipe
thumbor 45099 thumbor   43u  IPv4          123888256            0t0       TCP thumbor1002.eqiad.wmnet:49943->ms-fe.svc.eqiad.wmnet:http (ESTABLISHED)
thumbor 45099 thumbor   45r  FIFO               0,10            0t0 123089727 pipe
thumbor 45099 thumbor   47u  IPv4          123901038            0t0       TCP thumbor1002.eqiad.wmnet:57032->ms-fe.svc.eqiad.wmnet:http (ESTABLISHED)

Event Timeline

That sucks, it's not a temp file I deliberately create, it seems to be something ffmpeg creates on its own...

Why didn't "timeout" kill it after a minute of doing whatever it was doing? Do you know for certain if this was going on for more than a minute?

Looking at the error mediawiki returns when trying to render Lipka_igrashok_01.webm, the symptoms are identical to the case found in T145612

The ogv one works fine in mediawiki though. Do you know which of these commands was causing the issue?

@Gilles no I don't know if the processes were going on for more than a minute, though I'm not sure those ffmpeg are the root cause.

Judging by thumbor1002 dashboard grafana the heavy disk writing started at around 14:10

This should be easier to figure out if it reoccurs with manhole in place

@fgiunchedi for this situation as well, it would be easier for me to debug if I was able to run 'du' on the temp folders, but my user can't.

Gilles claimed this task.

Reopening, I've seen a reoccurence today on thumbor1002 where /tmp was going out of disk space. Thumbor recovered by itself though and the tmpfile got cleaned. ATM there's another instance though:

root@thumbor1002:/tmp# lsof -p 84950 | grep -i tmp
thumbor 84950 thumbor    4u  unix 0xffff8810513e9000          0t0 264606872 /srv/thumbor/tmp/thumbor@8827/manhole-8827
thumbor 84950 thumbor  223u   REG                9,0 320640320000    924259 /tmp/magick-2cR4EE91oAnab

root@thumbor1002:/tmp# cat /proc/84950/environ 
SHELL=/bin/shTMPDIR=/srv/thumbor/tmp/thumbor@8827USER=thumborPULSE_CLIENTCONFIG=/run/firejail/mnt/pulse/client.confQT_X11_NO_MITSHM=1PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/binPWD=/LANG=en_US.UTF-8SHLVL=1HOME=/var/lib/thumborLOGNAME=thumborPROMPT_COMMAND=export PS1="\[\e[1;32m\][\u@\h \W]\$\[\e[0m\] "container=firejail_=/usr/bin/thumborroot@thumbor1002:/tmp# 

root@thumbor1002:/tmp# du -hcs *8827*
836M	systemd-private-14e6859f69824c18aeeea4c7f02624e4-thumbor@8827.service-shljgt

via manhole

root@thumbor1002:~# sudo -u thumbor socat - unix-connect:/srv/thumbor/tmp/thumbor@8827/manhole-8827

######### ProcessID=2, ThreadID=139964673640192 #########
File: "/usr/lib/python2.7/threading.py", line 783, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
  self.run()
File: "/usr/lib/python2.7/dist-packages/manhole.py", line 225, in run
  self.handle(self.client, self.locals)
File: "/usr/lib/python2.7/dist-packages/manhole.py", line 268, in handle
  run_repl(locals)
File: "/usr/lib/python2.7/dist-packages/manhole.py", line 313, in run_repl
  dump_stacktraces()
File: "/usr/lib/python2.7/dist-packages/manhole.py", line 578, in dump_stacktraces
  for filename, lineno, name, line in traceback.extract_stack(stack):

######### ProcessID=2, ThreadID=139965036824320 #########
File: "/usr/lib/python2.7/threading.py", line 783, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
  self.run()
File: "/usr/lib/python2.7/dist-packages/manhole.py", line 198, in run
  client.join()
File: "/usr/lib/python2.7/threading.py", line 949, in join
  self.__block.wait()
File: "/usr/lib/python2.7/threading.py", line 340, in wait
  waiter.acquire()

######### ProcessID=2, ThreadID=139965205939968 #########
File: "/usr/bin/thumbor", line 9, in <module>
  load_entry_point('thumbor==6.0.1', 'console_scripts', 'thumbor')()
File: "/usr/lib/python2.7/dist-packages/thumbor/server.py", line 134, in main
  tornado.ioloop.IOLoop.instance().start()
File: "/usr/lib/python2.7/dist-packages/tornado/ioloop.py", line 810, in start
  self._run_callback(callback)
File: "/usr/lib/python2.7/dist-packages/tornado/ioloop.py", line 592, in _run_callback
  ret = callback()
File: "/usr/lib/python2.7/dist-packages/tornado/stack_context.py", line 314, in wrapped
  ret = fn(*args, **kwargs)
File: "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 963, in <lambda>
  self.future, lambda f: self.run())
File: "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 879, in run
  yielded = self.gen.send(value)
File: "/usr/lib/python2.7/dist-packages/thumbor/handlers/__init__.py", line 180, in get_image
  self.filters_runner.apply_filters(thumbor.filters.PHASE_AFTER_LOAD, transform)
File: "/usr/lib/python2.7/dist-packages/thumbor/filters/__init__.py", line 81, in apply_filters
  callback()
File: "/usr/lib/python2.7/dist-packages/thumbor/handlers/__init__.py", line 178, in transform
  self.context.transformer.transform(after_transform_cb)
File: "/usr/lib/python2.7/dist-packages/thumbor/transformer.py", line 102, in transform
  self.smart_detect()
File: "/usr/lib/python2.7/dist-packages/thumbor/transformer.py", line 137, in smart_detect
  self.do_image_operations()
File: "/usr/lib/python2.7/dist-packages/thumbor/transformer.py", line 230, in do_image_operations
  callback=inner
File: "/usr/lib/python2.7/dist-packages/thumbor/context.py", line 276, in queue
  self._execute_in_foreground(operation, callback)
File: "/usr/lib/python2.7/dist-packages/thumbor/context.py", line 263, in _execute_in_foreground
  result.set_result(operation())
File: "/usr/lib/python2.7/dist-packages/thumbor/transformer.py", line 214, in img_operation_worker
  self.resize()
File: "/usr/lib/python2.7/dist-packages/thumbor/transformer.py", line 311, in resize
  self.engine.resize(self.target_width or 1, self.target_height or 1)  # avoiding 0px images
File: "/usr/lib/python2.7/dist-packages/wikimedia_thumbor/engine/proxy/proxy.py", line 172, in resize
  return self.__getattr__('resize')(width, height)
File: "/usr/lib/python2.7/dist-packages/wikimedia_thumbor/engine/imagemagick/imagemagick.py", line 302, in resize
  self.image.resize(width=int(width), height=int(height))
File: "/usr/lib/python2.7/dist-packages/wand/image.py", line 419, in wrapped
  result = function(self, *args, **kwargs)
File: "/usr/lib/python2.7/dist-packages/wand/image.py", line 1239, in resize
  filter, blur)
#############################################


Python 2.7.9 (default, Jun 29 2016, 13:08:31)
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
(ManholeConsole)

cgroups don't offer any facility for disk quotas (the blkio subsystem can only be used to throttle the rate of I/O). But if you can call setrlimit() (or use ulimit), then you can use RLIMIT_FSIZE to specify the maximum size of files that the process may create.

Just like the previous time, it seems to be IM creating this temp file and going crazy. I agree that ideally we would cap the disk usage and thumbor would fail more gracefully.

In this case, though, it's a call to Wand inside the IM engine, so not a subprocess. We'll probably have to wrap thumbor itself in a ulimit call if possible.

It would be nice to find out what image caused this, because what happens to thumbor with an rlimit in that situation is a bit hard to predict.

Ah, turns out that IM has a number of limit mechanisms: http://www.imagemagick.org/script/command-line-options.php#limit

This is probably worth investigating first, as it will probably give a more graceful failure.

The defaults are terrible: unlimited time and unlimited disk...

Actually I think that the systemd limit is undesirable as the limit would be over the lifetime of the Thumbor process. I.e. it would cause Thumbor processes that have been running for a while and happened to write X GB worth of temp files over a long period of time to be killed.

I think this means that I'm going to need to rewrite the IM engine to use a subprocess, for which we'll be able to set a resource limit.

RLIMIT_FSIZE

Actually I think that the systemd limit is undesirable as the limit would be over the lifetime of the Thumbor process. I.e. it would cause Thumbor processes that have been running for a while and happened to write X GB worth of temp files over a long period of time to be killed.

You can call setrlimit( RLIMIT_FSIZE, ... ) from Python to set the limit for the thumbor process, and continue to use wand. RLIMIT_FSIZE is a limit on the maximum size of any one file, so it does not accumulate over the lifetime of the process, assuming different jobs use distinct temporary files.

I think I'll go read the sytemd code and/or write a test app creating junk files. The systemd docs mentioned the feature is exactly the same as setrlimit. It feels safer to me to set that at the systemd level if possible.

I'll give the Wand thing one last shot, maybe I wasn't calling the function at the right time. I figure that IM might be able to work differently with a given limit, whereas if the limit is imposed from outside it just dies when trying to create a huge file. I'll test that too, since a simple PNG decompression bug allows me to reproduce the conditions encountered here.

The systemd docs mentioned the feature is exactly the same as setrlimit. It feels safer to me to set that at the systemd level if possible.

Yes, this seems correct:

https://github.com/systemd/systemd/blob/dd8352659c9428b196706d04399eec106a8917ed/src/basic/rlimit-util.c#L304

Change 319802 had a related patch set uploaded (by Gilles):
Prevent Thumbor from creating files bigger than 1GB

https://gerrit.wikimedia.org/r/319802

So, IM isn't being smart or anything, setting a limit makes it bail and error instead of trying to work within that limit:

gilles@ubuntu:~/Documents/Mediawiki$ convert  -limit disk 1GB thumbor-plugins/tests/integration/originals/spark.png -thumbnail 100x foo.png
convert: DistributedPixelCache '127.0.0.1' @ error/distribute-cache.c/ConnectPixelCacheServer/217.
convert: cache resources exhausted `thumbor-plugins/tests/integration/originals/spark.png' @ error/cache.c/OpenPixelCache/3669.
convert: Memory allocation failed `thumbor-plugins/tests/integration/originals/spark.png' @ error/png.c/MagickPNGErrorHandler/1645.
convert: corrupt image `thumbor-plugins/tests/integration/originals/spark.png' @ error/png.c/ReadPNGImage/3978.
convert: no images defined `foo.png' @ error/convert.c/ConvertImageCommand/3210.

Making the limit work at the IM level would just let us avoid a Thumbor process death.

IM limits are worth pursuing regardless for the time limit, which the IM engine is currently exempt of. I'll file another task for that.

Change 319807 had a related patch set uploaded (by Gilles):
Set environment variables for ImageMagick running inside Thumbor

https://gerrit.wikimedia.org/r/319807

Change 319802 merged by Filippo Giunchedi:
Prevent Thumbor from creating files bigger than 1GB

https://gerrit.wikimedia.org/r/319802

Change 320216 had a related patch set uploaded (by Gilles):
Define Thumbor file size rlimit in firejail, not systemd

https://gerrit.wikimedia.org/r/320216

Change 320216 merged by Filippo Giunchedi:
Define Thumbor file size rlimit in firejail, not systemd

https://gerrit.wikimedia.org/r/320216

Change 319807 merged by Filippo Giunchedi:
Set environment variables for ImageMagick running inside Thumbor

https://gerrit.wikimedia.org/r/319807

Change 320636 had a related patch set uploaded (by Filippo Giunchedi):
thumbor: revert MAGICK_TIME_LIMIT=60

https://gerrit.wikimedia.org/r/320636

Change 320636 merged by Filippo Giunchedi:
thumbor: revert MAGICK_TIME_LIMIT=60

https://gerrit.wikimedia.org/r/320636

The original disk issue this task was about should be fixed now.