Page MenuHomePhabricator

CropTool sometimes locks and have to be manually restarted
Open, MediumPublic


Every now and then, CropTool locks up every now and then and have to be manually restarted. If I'm late restarting it, there's reports on Commons or GitHub (I haven't yet found a way to automatically restart the server since all cronjobs are submitted to the grid and I wasn't able to restart the webservice from the grid).

For some time I believed it was all due to T104799, but then started doubting. Then there was T182070#4305541 which brough the issue to my attention again.

Today, I asked #wikimedia-cloud for help looking at the processes before restarting the webservice. Here's the findings of @zhuyifei1999 and @bd808:

Lots of open connections:

# lsof -p 2592 | grep TCP | wc -l

No CPU usage:

# ps uf -u tools.croptool
tools.c+  2592  0.1  0.0  54208  5444 ?        Ss   Jun28   1:39 /usr/sbin/lighttpd -f /var/run/lighttpd/croptool -D
tools.c+  2598  0.0  0.2 340556 20764 ?        Ss   Jun28   0:00  \_ /usr/bin/php-cgi
tools.c+  2600  0.2  0.6 647472 51796 ?        Sl   Jun28   2:04  |   \_ /usr/bin/php-cgi
tools.c+  2601  0.2  0.7 656452 60020 ?        Sl   Jun28   2:19  |   \_ /usr/bin/php-cgi
tools.c+  2599  0.0  0.2 340556 20760 ?        Ss   Jun28   0:00  \_ /usr/bin/php-cgi
tools.c+  2602  0.4  0.7 655884 62368 ?        Sl   Jun28   4:32      \_ /usr/bin/php-cgi
tools.c+  2603  0.5  0.8 662800 67344 ?        Sl   Jun28   5:35      \_ /usr/bin/php-cgi

Strack trace indicating PHP is blocked by malloc:

[17:20:02] <+bd808> the last stack trace I see as a paste from you looks like -- php ran out of memory while trying to create a backtrace for an exception and then tried to start handling that OOM error when it hit the deadlock.
[17:20:41] <+bd808> my guess is that xdebug's tracing is holding a non-reentrant lock
[17:26:27] <zhuyifei1999_> bd808: makes sense. libc itself is holding the lock
[17:28:06] <zhuyifei1999_> so malloc ran out of memory, grid sends php a sigint, php's signal handler gets called and tries to malloc again, non-reentrant
[17:28:44] <zhuyifei1999_> so it just deadlocks on itself
[17:28:54] <+bd808> zhuyifei1999_: yeah, I think we could search the web a bit and find that this is a known problem in php 5.x error handling

Event Timeline

Lots of open connections:

Probably just lighttpd waiting on php to be available.

For the record, php parent processes are all waiting forever (on signal) and other child processes other than 2603 are all waiting on flock possibly held by 2063

Is it still stuck? All the php workers seem to be doing accept(0, .

Finding and fixing the problem would be great; automating the restart would be a close second.