Ori noticed mw1114 was running more than one HHVM process, so I investigated.
Turns out this is some weird thing with HHVM spawning child processes:
$ ps -C hhvm -o pid,ppid,cmd
PID PPID CMD
1027 12497 /usr/bin/hhvm --config /etc/hhvm/fcgi.ini --mode server
12497 1 /usr/bin/hhvm --config /etc/hhvm/fcgi.ini --mode server
25988 12497 /usr/bin/hhvm --config /etc/hhvm/fcgi.ini --mode server
So those are effectively two child processes of the main process that
upstart launches.
Both children are simply waiting on the same futex:
futex(0x7fce03707990, FUTEX_WAIT_PRIVATE, 2, NULL,...
So, child processes getting stuck on the same futex seemed like a
classical case of a fork() race condition, about which I recently read
on rachelbythebay:
https://rachelbythebay.com/w/2014/08/16/forkenv/
Looking at the stack traces obtained with quickstack, both processes
seem like threads gone wild. Since we had no RAM left to get a full
stack trace with gdb, I plainly killed one of them (also, it's still
sunday morning after all, and we're running thin on memory as this
processes apparently have all the main process memory copied over);
this really seems like some kind of bug in HHVM that we should care
about - and probably ask our Facebook friends about if we can't figure
out more.
The first action we should probably do is to make the alarm on hhvm
processes in nagios to be more restrictive (and only catch the
--config fastcgi ones, for instance), so that we don't just see this
because you happened to run "ps"
The stack traces are in /root/children_of_hhvm on mw1114