Page MenuHomePhabricator

Figure out a way to live-debug running production thumbor processes
Closed, ResolvedPublic

Description

In situations like T145878 it would be nice to be able to poke into the live python code of the affected process to figure out what's going on or at least find out which thumbnail url was requested.

I've given gdb attaching to a live process a first try on deployment-imagescaler01 and it doesn't seem to work properly, complaining that it's "Unable to locate python frame" when I ask for py-locals. I couldn't find a solution to that problem. I wonder if it needs more debugging symbols than what's currently installed on deployment-imagescaler01

An alternative seems to be to pre-wire thumbor with a signal handler: http://blog.devork.be/2009/07/how-to-bring-running-python-program.html this might be better as it would run the actual python debugger, which might be a better tool for the task than gdb.

Anyway, this requires investigation and experimentation as it wasn't immediately obvious what the best approach was when looking into T145878

Event Timeline

I've had good luck in the past with Pyrasite and manhole. manhole is less hacky but it needs to be explicitly loaded by the application in question. Since we own the code for the Thumbor service, I think it could be a good solution.

thanks @ori! yeah manhole seems like a good option, I don't see it packaged for Debian so we'll need to find a way to get it to the machine or package it

I've packaged manhole: https://github.com/gi11es/thumbor-debian/tree/master/python-manhole

I couldn't get the tests to run, they're rigged to be run as a bunch of venvs inside tox, which pybuild didn't seem to be happy with (and requires downloading things from pip during package build). Running the tests classes directly didn't work either. Anyway this is kind of an odd module operating in the bowels of python, it doesn't surprise me that the tests are hard to run during the package build.

@fgiunchedi can you build it and put it on jessie-wikimedia?

Sigh, firejail prevents manhole from working properly.

Without firejail:

2016-09-28 12:37:51 thumbor:DEBUG Installing manhole
Manhole[1475066271.4709]: Manhole UDS path: /tmp/manhole-19400
Manhole[1475066271.4712]: Waiting for new connection (in pid:19400) ...
Manhole[1475066271.4728]: Patched <built-in function fork> and <built-in function fork>.

With firejail:

2016-09-28 12:38:27 thumbor:DEBUG Installing manhole
Manhole[1475066307.6348]: Manhole UDS path: /tmp/manhole-2
Manhole[1475066307.6351]: Waiting for new connection (in pid:2) ...
Manhole[1475066307.6360]: Patched <built-in function fork> and <built-in function fork>.

It's always "2" which isn't the actual pid of the process and it's empty, netcatting to it returns immediately with no output.

I'm not sure if it's the unix socket being blocked or some other thing like accessing the pid

It seems to actually work after all, albeit with namespaced pid 2. I need to compensate for that though, as all thumbor processes will try to create the same /tmp/manhole-2

Works now, with each thumbor process creating its socket by thumbor port:

vagrant@mediawiki-vagrant:~$ ls -al /tmp/manhole*
srwxr-xr-x 1 thumbor thumbor 0 Sep 28 13:47 /tmp/manhole-8889
srwxr-xr-x 1 thumbor thumbor 0 Sep 28 13:47 /tmp/manhole-8890
srwxr-xr-x 1 thumbor thumbor 0 Sep 28 13:47 /tmp/manhole-8891

@fgiunchedi can you build it and put it on jessie-wikimedia?

yep, uploaded now to jessie-wikimedia

looks like this is working in production now:

root@thumbor1001:/tmp/systemd-private-df157af6e95c486c8c2f94c895a96346-thumbor@8838.service-4EmbLf/tmp# su thumbor -c 'socat - unix-connect:manhole-8838'

######### ProcessID=2, ThreadID=140259175155456 #########
File: "/usr/lib/python2.7/threading.py", line 783, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
  self.run()
File: "/usr/lib/python2.7/dist-packages/manhole.py", line 225, in run
  self.handle(self.client, self.locals)
File: "/usr/lib/python2.7/dist-packages/manhole.py", line 268, in handle
  run_repl(locals)
File: "/usr/lib/python2.7/dist-packages/manhole.py", line 313, in run_repl
  dump_stacktraces()
File: "/usr/lib/python2.7/dist-packages/manhole.py", line 578, in dump_stacktraces
  for filename, lineno, name, line in traceback.extract_stack(stack):

######### ProcessID=2, ThreadID=140259284186880 #########
File: "/usr/lib/python2.7/threading.py", line 783, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
  self.run()
File: "/usr/lib/python2.7/dist-packages/manhole.py", line 198, in run
  client.join()
File: "/usr/lib/python2.7/threading.py", line 949, in join
  self.__block.wait()
File: "/usr/lib/python2.7/threading.py", line 340, in wait
  waiter.acquire()

######### ProcessID=2, ThreadID=140259453302528 #########
File: "/usr/bin/thumbor", line 9, in <module>
  load_entry_point('thumbor==6.0.1', 'console_scripts', 'thumbor')()
File: "/usr/lib/python2.7/dist-packages/thumbor/server.py", line 134, in main
  tornado.ioloop.IOLoop.instance().start()
File: "/usr/lib/python2.7/dist-packages/tornado/ioloop.py", line 841, in start
  event_pairs = self._impl.poll(poll_timeout)
#############################################


Python 2.7.9 (default, Jun 29 2016, 13:08:31) 
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
(ManholeConsole)
>>>

though with some caveats, socat needs to be run as user thumbor. Running as root won't work since manhole won't recognise the connection as coming from uid 0:

Oct  4 10:39:07 thumbor1001 thumbor@8838[46320]: SuspiciousClient: Can't accept client with PID:0 UID:65534 GID:65534. It doesn't match the current EUID:998 or ROOT.

Yeah it had the same limitation when I tried locally.

@fgiunchedi is there any way I could get rights to access those temp folders and the manhole files inside of it? Even if I guess the manhole file path I still get:

gilles@thumbor1001:/tmp$ sudo -u thumbor socat - unix-connect:/tmp/systemd-private-df157af6e95c486c8c2f94c895a96346-thumbor@8840.service-SC1foT/tmp/manhole-8840
2016/10/05 12:33:42 socat[98150] E connect(5, AF=1 "/tmp/systemd-private-df157af6e95c486c8c2f94c895a96346-thumbor@8840.service-SC1foT/tmp/manhole-8840", 100): Permission denied

I now have access to manhole since we moved the content to /srv/thumbor/tmp/ owned by the thumbor user:

gilles@thumbor1001:/srv/thumbor/tmp/thumbor@8831$ ls -al
total 24
drwxr-xr-x  2 thumbor thumbor 4096 Oct 12 11:39 .
drwxr-xr-x 42 thumbor thumbor 4096 Oct 12 11:36 ..
srwxr-xr-x  1 thumbor thumbor    0 Oct 12 11:16 manhole-8831
-rw-------  1 thumbor thumbor 2901 Oct 12 11:38 tmp4tlcb1
-rw-------  1 thumbor thumbor  560 Oct 12 11:35 tmphfLTD6
-rw-------  1 thumbor thumbor  524 Oct 12 11:34 tmpKRs9iY
-rw-------  1 thumbor thumbor  524 Oct 12 11:38 tmpzRAgtV
gilles@thumbor1001:/srv/thumbor/tmp/thumbor@8831$ sudo -u thumbor socat - unix-connect:manhole-8831 

######### ProcessID=2, ThreadID=139896162645760 #########
File: "/usr/lib/python2.7/threading.py", line 783, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
  self.run()
File: "/usr/lib/python2.7/dist-packages/manhole.py", line 225, in run
  self.handle(self.client, self.locals)
File: "/usr/lib/python2.7/dist-packages/manhole.py", line 268, in handle
  run_repl(locals)
File: "/usr/lib/python2.7/dist-packages/manhole.py", line 313, in run_repl
  dump_stacktraces()
File: "/usr/lib/python2.7/dist-packages/manhole.py", line 578, in dump_stacktraces
  for filename, lineno, name, line in traceback.extract_stack(stack):

######### ProcessID=2, ThreadID=139896618731264 #########
File: "/usr/lib/python2.7/threading.py", line 783, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
  self.run()
File: "/usr/lib/python2.7/dist-packages/manhole.py", line 198, in run
  client.join()
File: "/usr/lib/python2.7/threading.py", line 949, in join
  self.__block.wait()
File: "/usr/lib/python2.7/threading.py", line 340, in wait
  waiter.acquire()

######### ProcessID=2, ThreadID=139896787846912 #########
File: "/usr/bin/thumbor", line 9, in <module>
  load_entry_point('thumbor==6.0.1', 'console_scripts', 'thumbor')()
File: "/usr/lib/python2.7/dist-packages/thumbor/server.py", line 134, in main
  tornado.ioloop.IOLoop.instance().start()
File: "/usr/lib/python2.7/dist-packages/tornado/ioloop.py", line 841, in start
  event_pairs = self._impl.poll(poll_timeout)
#############################################


Python 2.7.9 (default, Jun 29 2016, 13:08:31) 
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
(ManholeConsole)
>>>