Change Details

For some time, I had a `cron` job running that listed and killed orphaned `php-cgi` processes (`clush -g webgrid-lighttpd-precise -g webgrid-lighttpd-trusty 'while [ -n "$(pgrep -P 1 php-cgi | tee >(xargs -r sudo kill -HUP) /dev/stderr)" ]; do :; done'`). The problem with these orphaned processes is that `lighttpd` only starts n `php-cgi` processes in total per user, i. e. if there are already n stale processes when `lighttpd` starts up, it will not spawn any new ones and thus can't process any requests. I decided to take a closer look: ``` scfc@tools-puppetmaster-02:~$ clush -g webgrid-lighttpd-precise -g webgrid-lighttpd-trusty 'pgrep -P 1 php-cgi || true' tools-webgrid-lighttpd-1401.tools.eqiad.wmflabs: 31771 tools-webgrid-lighttpd-1412.tools.eqiad.wmflabs: 13238 tools-webgrid-lighttpd-1412.tools.eqiad.wmflabs: 13242 tools-webgrid-lighttpd-1414.tools.eqiad.wmflabs: 7446 scfc@tools-puppetmaster-02:~$ ``` ``` scfc@tools-webgrid-lighttpd-1401:~$ sudo cat /proc/31771/environ; echo PATH=/tmp/697935.1.webgrid-lighttpd:/usr/local/bin:/bin:/usr/binSHELL=/bin/bashUSER=tools.wsexportPHP_FCGI_CHILDREN=2PHP_FCGI_MAX_REQUESTS=500 scfc@tools-webgrid-lighttpd-1401:~$ fgrep 697935 /var/spool/gridengine/execd/tools-webgrid-lighttpd-1401/messages 12/05/2016 19:32:00| main|tools-webgrid-lighttpd-1401|W|job 697935 exceeds job hard limit "h_vmem" of queue "webgrid-lighttpd@tools-webgrid-lighttpd-1401.eqiad.wmflabs" (4307968000.00000 > limit:4294967296.00000) - sending SIGKILL scfc@tools-webgrid-lighttpd-1401:~$ ``` So the job OOMed, the grid killed the main process (`lighttpd`) with `SIGKILL` and somehow (at least) one `php-cgi` process got left behind (on `tools-webgrid-lighttpd-1412` it is two for job 716152). This should probably be changed to use `jobkill` which uses `SIGINT`, 10 s pause, `SIGTERM`, 10 s pause, `SIGKILL`. The delay shouldn't be a problem for OOMed jobs, but will enable a cleaner shutdown on `SIGTERM`. `jobkill` is currently not deployed to webgrid nodes, so that needs to be puppetized as well.

TLDR workaround ---- Here are the clush commands @bd808 has been using to first check for and then kill processes that have leaked out of grid engine due to some cleanup failure: ```name=find-orphans $ clush -w @exec -w @webgrid -b 'ps axwo user:20,ppid,pid,cmd | grep -Ev "^($USER|root|daemon|diamond|_lldpd|messagebus|nagios|nslcd|ntp|prometheus|statd|syslog|Debian-exim|www-data|sgeadmin)"|grep -v perl|grep -E " 1 "' ``` ```name=kill-orphans $ clush -w @exec -w @webgrid -b 'ps axwo user:20,ppid,pid,cmd | grep -Ev "^($USER|root|daemon|diamond|_lldpd|messagebus|nagios|nslcd|ntp|prometheus|statd|syslog|Debian-exim|www-data|sgeadmin)"|grep -v perl|grep -E " 1 "|awk "{print \$3}"|xargs sudo kill -9' ``` --- For some time, I had a `cron` job running that listed and killed orphaned `php-cgi` processes (`clush -g webgrid-lighttpd-precise -g webgrid-lighttpd-trusty 'while [ -n "$(pgrep -P 1 php-cgi | tee >(xargs -r sudo kill -HUP) /dev/stderr)" ]; do :; done'`). The problem with these orphaned processes is that `lighttpd` only starts n `php-cgi` processes in total per user, i. e. if there are already n stale processes when `lighttpd` starts up, it will not spawn any new ones and thus can't process any requests. I decided to take a closer look: ``` scfc@tools-puppetmaster-02:~$ clush -g webgrid-lighttpd-precise -g webgrid-lighttpd-trusty 'pgrep -P 1 php-cgi || true' tools-webgrid-lighttpd-1401.tools.eqiad.wmflabs: 31771 tools-webgrid-lighttpd-1412.tools.eqiad.wmflabs: 13238 tools-webgrid-lighttpd-1412.tools.eqiad.wmflabs: 13242 tools-webgrid-lighttpd-1414.tools.eqiad.wmflabs: 7446 scfc@tools-puppetmaster-02:~$ ``` ``` scfc@tools-webgrid-lighttpd-1401:~$ sudo cat /proc/31771/environ; echo PATH=/tmp/697935.1.webgrid-lighttpd:/usr/local/bin:/bin:/usr/binSHELL=/bin/bashUSER=tools.wsexportPHP_FCGI_CHILDREN=2PHP_FCGI_MAX_REQUESTS=500 scfc@tools-webgrid-lighttpd-1401:~$ fgrep 697935 /var/spool/gridengine/execd/tools-webgrid-lighttpd-1401/messages 12/05/2016 19:32:00| main|tools-webgrid-lighttpd-1401|W|job 697935 exceeds job hard limit "h_vmem" of queue "webgrid-lighttpd@tools-webgrid-lighttpd-1401.eqiad.wmflabs" (4307968000.00000 > limit:4294967296.00000) - sending SIGKILL scfc@tools-webgrid-lighttpd-1401:~$ ``` So the job OOMed, the grid killed the main process (`lighttpd`) with `SIGKILL` and somehow (at least) one `php-cgi` process got left behind (on `tools-webgrid-lighttpd-1412` it is two for job 716152). This should probably be changed to use `jobkill` which uses `SIGINT`, 10 s pause, `SIGTERM`, 10 s pause, `SIGKILL`. The delay shouldn't be a problem for OOMed jobs, but will enable a cleaner shutdown on `SIGTERM`. `jobkill` is currently not deployed to webgrid nodes, so that needs to be puppetized as well.