Every now and then, CropTool locks up every now and then and have to be manually restarted. If I'm late restarting it, there's reports on Commons or GitHub (I haven't yet found a way to automatically restart the server since all cronjobs are submitted to the grid and I wasn't able to restart the webservice from the grid).
For some time I believed it was all due to T104799, but then started doubting. Then there was T182070#4305541 which brough the issue to my attention again.
Today, I asked #wikimedia-cloud for help looking at the processes before restarting the webservice. Here's the findings of @zhuyifei1999 and @bd808:
Lots of open connections:
# lsof -p 2592 | grep TCP | wc -l 187
No CPU usage:
# ps uf -u tools.croptool USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND tools.c+ 2592 0.1 0.0 54208 5444 ? Ss Jun28 1:39 /usr/sbin/lighttpd -f /var/run/lighttpd/croptool -D tools.c+ 2598 0.0 0.2 340556 20764 ? Ss Jun28 0:00 \_ /usr/bin/php-cgi tools.c+ 2600 0.2 0.6 647472 51796 ? Sl Jun28 2:04 | \_ /usr/bin/php-cgi tools.c+ 2601 0.2 0.7 656452 60020 ? Sl Jun28 2:19 | \_ /usr/bin/php-cgi tools.c+ 2599 0.0 0.2 340556 20760 ? Ss Jun28 0:00 \_ /usr/bin/php-cgi tools.c+ 2602 0.4 0.7 655884 62368 ? Sl Jun28 4:32 \_ /usr/bin/php-cgi tools.c+ 2603 0.5 0.8 662800 67344 ? Sl Jun28 5:35 \_ /usr/bin/php-cgi
Strack trace indicating PHP is blocked by malloc: https://www.irccloud.com/pastebin/Kf3rlR6T/
[17:20:02] <+bd808> the last stack trace I see as a paste from you looks like -- php ran out of memory while trying to create a backtrace for an exception and then tried to start handling that OOM error when it hit the deadlock.
[17:20:41] <+bd808> my guess is that xdebug's tracing is holding a non-reentrant lock
[17:26:27] <zhuyifei1999_> bd808: makes sense. libc itself is holding the lock
[17:28:06] <zhuyifei1999_> so malloc ran out of memory, grid sends php a sigint, php's signal handler gets called and tries to malloc again, non-reentrant
[17:28:44] <zhuyifei1999_> so it just deadlocks on itself
[17:28:54] <+bd808> zhuyifei1999_: yeah, I think we could search the web a bit and find that this is a known problem in php 5.x error handling