Change Details

Ori noticed mw1114 was running more than one HHVM process, so I investigated. Turns out this is some weird thing with HHVM spawning child processes: <source> $ ps -C hhvm -o pid,ppid,cmd PID PPID CMD 1027 12497 /usr/bin/hhvm --config /etc/hhvm/fcgi.ini --mode server 12497 1 /usr/bin/hhvm --config /etc/hhvm/fcgi.ini --mode server 25988 12497 /usr/bin/hhvm --config /etc/hhvm/fcgi.ini --mode server </source> So those are effectively two child processes of the main process that upstart launches. Both children are simply waiting on the same futex: futex(0x7fce03707990, FUTEX_WAIT_PRIVATE, 2, NULL,... So, child processes getting stuck on the same futex seemed like a classical case of a fork() race condition, about which I recently read on rachelbythebay: https://rachelbythebay.com/w/2014/08/16/forkenv/ Looking at the stack traces obtained with quickstack, both processes seem like threads gone wild. Since we had no RAM left to get a full stack trace with gdb, I plainly killed one of them (also, it's still sunday morning after all, and we're running thin on memory as this processes apparently have all the main process memory copied over); this really seems like some kind of bug in HHVM that we should care about - and probably ask our Facebook friends about if we can't figure out more. The first action we should probably do is to make the alarm on hhvm processes in nagios to be more restrictive (and only catch the --config fastcgi ones, for instance), so that we don't just see this because you happened to run "ps" The stack traces are in /root/children_of_hhvm on mw1114