TLDR workaround
----
Here are the clush commands @bd808 has been using to first check for and then kill processes that have leaked out of grid engine due to some cleanup failure:
```name=find-orphans
$ clush -w @exec -w @webgrid -b 'ps axwo user:20,ppid,pid,cmd | grep -Ev "^($USER|root|daemon|diamond|_lldpd|messagebus|nagios|nslcd|ntp|prometheus|statd|syslog|Debian-exim|www-data|sgeadmin)"|grep -v perl|grep -E " 1 "'
```
```name=kill-orphans
$ clush -w @exec -w @webgrid -b 'ps axwo user:20,ppid,pid,cmd | grep -Ev "^($USER|root|daemon|diamond|_lldpd|messagebus|nagios|nslcd|ntp|prometheus|statd|syslog|Debian-exim|www-data|sgeadmin)"|grep -v perl|grep -E " 1 "|awk "{print \$3}"|xargs sudo kill -9'
```
---
For some time, I had a `cron` job running that listed and killed orphaned `php-cgi` processes (`clush -g webgrid-lighttpd-precise -g webgrid-lighttpd-trusty 'while [ -n "$(pgrep -P 1 php-cgi | tee >(xargs -r sudo kill -HUP) /dev/stderr)" ]; do :; done'`). The problem with these orphaned processes is that `lighttpd` only starts n `php-cgi` processes in total per user, i. e. if there are already n stale processes when `lighttpd` starts up, it will not spawn any new ones and thus can't process any requests.
I decided to take a closer look:
```
scfc@tools-puppetmaster-02:~$ clush -g webgrid-lighttpd-precise -g webgrid-lighttpd-trusty 'pgrep -P 1 php-cgi || true'
tools-webgrid-lighttpd-1401.tools.eqiad.wmflabs: 31771
tools-webgrid-lighttpd-1412.tools.eqiad.wmflabs: 13238
tools-webgrid-lighttpd-1412.tools.eqiad.wmflabs: 13242
tools-webgrid-lighttpd-1414.tools.eqiad.wmflabs: 7446
scfc@tools-puppetmaster-02:~$
```
```
scfc@tools-webgrid-lighttpd-1401:~$ sudo cat /proc/31771/environ; echo
PATH=/tmp/697935.1.webgrid-lighttpd:/usr/local/bin:/bin:/usr/binSHELL=/bin/bashUSER=tools.wsexportPHP_FCGI_CHILDREN=2PHP_FCGI_MAX_REQUESTS=500
scfc@tools-webgrid-lighttpd-1401:~$ fgrep 697935 /var/spool/gridengine/execd/tools-webgrid-lighttpd-1401/messages
12/05/2016 19:32:00| main|tools-webgrid-lighttpd-1401|W|job 697935 exceeds job hard limit "h_vmem" of queue "webgrid-lighttpd@tools-webgrid-lighttpd-1401.eqiad.wmflabs" (4307968000.00000 > limit:4294967296.00000) - sending SIGKILL
scfc@tools-webgrid-lighttpd-1401:~$
```
So the job OOMed, the grid killed the main process (`lighttpd`) with `SIGKILL` and somehow (at least) one `php-cgi` process got left behind (on `tools-webgrid-lighttpd-1412` it is two for job 716152).
This should probably be changed to use `jobkill` which uses `SIGINT`, 10 s pause, `SIGTERM`, 10 s pause, `SIGKILL`. The delay shouldn't be a problem for OOMed jobs, but will enable a cleaner shutdown on `SIGTERM`.
`jobkill` is currently not deployed to webgrid nodes, so that needs to be puppetized as well.