TLDR workaround
Here are the clush commands @bd808 has been using to first check for and then kill processes that have leaked out of grid engine due to some cleanup failure:
$ clush -w @exec -w @webgrid -b 'ps axwo user:20,ppid,pid,cmd | grep -Ev "^($USER|root|daemon|diamond|_lldpd|messagebus|nagios|nslcd|ntp|prometheus|statd|syslog|Debian-exim|www-data|sgeadmin)"|grep -v perl|grep -E " 1 "' 2>&1 | grep -v 'exited with exit code 1'
$ clush -w @exec -w @webgrid -b 'ps axwo user:20,ppid,pid,cmd | grep -Ev "^($USER|root|daemon|diamond|_lldpd|messagebus|nagios|nslcd|ntp|prometheus|statd|syslog|Debian-exim|www-data|sgeadmin)"|grep -v perl|grep -E " 1 "|awk "{print \$3}"|xargs sudo kill -9'
For some time, I had a cron job running that listed and killed orphaned php-cgi processes (clush -g webgrid-lighttpd-precise -g webgrid-lighttpd-trusty 'while [ -n "$(pgrep -P 1 php-cgi | tee >(xargs -r sudo kill -HUP) /dev/stderr)" ]; do :; done'). The problem with these orphaned processes is that lighttpd only starts n php-cgi processes in total per user, i. e. if there are already n stale processes when lighttpd starts up, it will not spawn any new ones and thus can't process any requests.
I decided to take a closer look:
scfc@tools-puppetmaster-02:~$ clush -g webgrid-lighttpd-precise -g webgrid-lighttpd-trusty 'pgrep -P 1 php-cgi || true' tools-webgrid-lighttpd-1401.tools.eqiad.wmflabs: 31771 tools-webgrid-lighttpd-1412.tools.eqiad.wmflabs: 13238 tools-webgrid-lighttpd-1412.tools.eqiad.wmflabs: 13242 tools-webgrid-lighttpd-1414.tools.eqiad.wmflabs: 7446 scfc@tools-puppetmaster-02:~$
scfc@tools-webgrid-lighttpd-1401:~$ sudo cat /proc/31771/environ; echo PATH=/tmp/697935.1.webgrid-lighttpd:/usr/local/bin:/bin:/usr/binSHELL=/bin/bashUSER=tools.wsexportPHP_FCGI_CHILDREN=2PHP_FCGI_MAX_REQUESTS=500 scfc@tools-webgrid-lighttpd-1401:~$ fgrep 697935 /var/spool/gridengine/execd/tools-webgrid-lighttpd-1401/messages 12/05/2016 19:32:00| main|tools-webgrid-lighttpd-1401|W|job 697935 exceeds job hard limit "h_vmem" of queue "webgrid-lighttpd@tools-webgrid-lighttpd-1401.eqiad.wmflabs" (4307968000.00000 > limit:4294967296.00000) - sending SIGKILL scfc@tools-webgrid-lighttpd-1401:~$
So the job OOMed, the grid killed the main process (lighttpd) with SIGKILL and somehow (at least) one php-cgi process got left behind (on tools-webgrid-lighttpd-1412 it is two for job 716152).
This should probably be changed to use jobkill which uses SIGINT, 10 s pause, SIGTERM, 10 s pause, SIGKILL. The delay shouldn't be a problem for OOMed jobs, but will enable a cleaner shutdown on SIGTERM.
jobkill is currently not deployed to webgrid nodes, so that needs to be puppetized as well.