Maniphest T198503

CropTool sometimes locks and have to be manually restarted
Open, MediumPublic
Actions

Assigned To

None

Authored By

	Danmichaelo
	Jun 29 2018, 5:02 PM

Description

Every now and then, CropTool locks up every now and then and have to be manually restarted. If I'm late restarting it, there's reports on Commons or GitHub (I haven't yet found a way to automatically restart the server since all cronjobs are submitted to the grid and I wasn't able to restart the webservice from the grid).

For some time I believed it was all due to T104799, but then started doubting. Then there was T182070#4305541 which brough the issue to my attention again.

Today, I asked #wikimedia-cloud for help looking at the processes before restarting the webservice. Here's the findings of @zhuyifei1999 and @bd808:

Lots of open connections:

# lsof -p 2592 | grep TCP | wc -l
187

No CPU usage:

# ps uf -u tools.croptool
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
tools.c+  2592  0.1  0.0  54208  5444 ?        Ss   Jun28   1:39 /usr/sbin/lighttpd -f /var/run/lighttpd/croptool -D
tools.c+  2598  0.0  0.2 340556 20764 ?        Ss   Jun28   0:00  \_ /usr/bin/php-cgi
tools.c+  2600  0.2  0.6 647472 51796 ?        Sl   Jun28   2:04  |   \_ /usr/bin/php-cgi
tools.c+  2601  0.2  0.7 656452 60020 ?        Sl   Jun28   2:19  |   \_ /usr/bin/php-cgi
tools.c+  2599  0.0  0.2 340556 20760 ?        Ss   Jun28   0:00  \_ /usr/bin/php-cgi
tools.c+  2602  0.4  0.7 655884 62368 ?        Sl   Jun28   4:32      \_ /usr/bin/php-cgi
tools.c+  2603  0.5  0.8 662800 67344 ?        Sl   Jun28   5:35      \_ /usr/bin/php-cgi

Strack trace indicating PHP is blocked by malloc: https://www.irccloud.com/pastebin/Kf3rlR6T/

[17:20:02] <+bd808> the last stack trace I see as a paste from you looks like -- php ran out of memory while trying to create a backtrace for an exception and then tried to start handling that OOM error when it hit the deadlock.
[17:20:41] <+bd808> my guess is that xdebug's tracing is holding a non-reentrant lock
[17:26:27] <zhuyifei1999_> bd808: makes sense. libc itself is holding the lock
[17:28:06] <zhuyifei1999_> so malloc ran out of memory, grid sends php a sigint, php's signal handler gets called and tries to malloc again, non-reentrant
[17:28:44] <zhuyifei1999_> so it just deadlocks on itself
[17:28:54] <+bd808> zhuyifei1999_: yeah, I think we could search the web a bit and find that this is a known problem in php 5.x error handling

Related Objects

Mentioned In: T114401: allow tool users to attach strace to their processes (at least on exec hosts)
T192788: CropTool broken: 502 Bad Gateway
Mentioned Here: T104799: lighttpd does not correctly close connections (CLOSE_WAIT)
T182070: tools-webgrid-lighttpd have ~ 90 procs stuck at 100% CPU time (mostly tools.jembot)

Event Timeline

Danmichaelo created this task.Jun 29 2018, 5:02 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 29 2018, 5:02 PM

zhuyifei1999 updated the task description. (Show Details)Jun 29 2018, 5:08 PM

Lots of open connections:

Probably just lighttpd waiting on php to be available.

For the record, php parent processes are all waiting forever (on signal) and other child processes other than 2603 are all waiting on flock possibly held by 2063

zhuyifei1999 updated the task description. (Show Details)Jun 29 2018, 5:14 PM

bd808 added a project: Tools.Jun 29 2018, 5:29 PM

Framawiki subscribed.Jun 30 2018, 4:13 PM

Danmichaelo mentioned this in T192788: CropTool broken: 502 Bad Gateway.Jun 30 2018, 4:53 PM

Nicolas_Raoul awarded a token.Jul 3 2018, 4:06 AM

Happened again today, but doesn't seem to exactly the same issue: https://commons.wikimedia.org/wiki/Commons_talk:CropTool#Croptool_down_again

Jeff_G subscribed.Jul 21 2018, 11:28 PM

And again: https://github.com/danmichaelo/croptool/issues/116

Is it still stuck? All the php workers seem to be doing accept(0, .

No, I restarted it

Finding and fixing the problem would be great; automating the restart would be a close second.

zhuyifei1999 mentioned this in T114401: allow tool users to attach strace to their processes (at least on exec hosts).Oct 22 2018, 4:22 PM

Aklapper changed the subtype of this task from "Deadline" to "Task".Apr 26 2023, 8:43 AM

CropTool sometimes locks and have to be manually restartedOpen, MediumPublicActions

Description

Related Objects

Event Timeline

CropTool sometimes locks and have to be manually restarted
Open, MediumPublic
Actions