webgrid-lighttpd queues kill OOM jobs with SIGKILL leaving php-cgi processes behind
Open, NormalPublic


TLDR workaround

Here are the clush commands @bd808 has been using to first check for and then kill processes that have leaked out of grid engine due to some cleanup failure:

$ clush -w @exec -w @webgrid -b 'ps axwo user:20,ppid,pid,cmd | grep -Ev "^($USER|root|daemon|diamond|_lldpd|messagebus|nagios|nslcd|ntp|prometheus|statd|syslog|Debian-exim|www-data|sgeadmin)"|grep -v perl|grep -E "     1 "' 2>&1 | grep -v 'exited with exit code 1'
$ clush -w @exec -w @webgrid -b 'ps axwo user:20,ppid,pid,cmd | grep -Ev "^($USER|root|daemon|diamond|_lldpd|messagebus|nagios|nslcd|ntp|prometheus|statd|syslog|Debian-exim|www-data|sgeadmin)"|grep -v perl|grep -E "     1 "|awk "{print \$3}"|xargs sudo kill -9'

For some time, I had a cron job running that listed and killed orphaned php-cgi processes (clush -g webgrid-lighttpd-precise -g webgrid-lighttpd-trusty 'while [ -n "$(pgrep -P 1 php-cgi | tee >(xargs -r sudo kill -HUP) /dev/stderr)" ]; do :; done'). The problem with these orphaned processes is that lighttpd only starts n php-cgi processes in total per user, i. e. if there are already n stale processes when lighttpd starts up, it will not spawn any new ones and thus can't process any requests.

I decided to take a closer look:

scfc@tools-puppetmaster-02:~$ clush -g webgrid-lighttpd-precise -g webgrid-lighttpd-trusty 'pgrep -P 1 php-cgi || true'
tools-webgrid-lighttpd-1401.tools.eqiad.wmflabs: 31771
tools-webgrid-lighttpd-1412.tools.eqiad.wmflabs: 13238
tools-webgrid-lighttpd-1412.tools.eqiad.wmflabs: 13242
tools-webgrid-lighttpd-1414.tools.eqiad.wmflabs: 7446
scfc@tools-webgrid-lighttpd-1401:~$ sudo cat /proc/31771/environ; echo
scfc@tools-webgrid-lighttpd-1401:~$ fgrep 697935 /var/spool/gridengine/execd/tools-webgrid-lighttpd-1401/messages 
12/05/2016 19:32:00|  main|tools-webgrid-lighttpd-1401|W|job 697935 exceeds job hard limit "h_vmem" of queue "webgrid-lighttpd@tools-webgrid-lighttpd-1401.eqiad.wmflabs" (4307968000.00000 > limit:4294967296.00000) - sending SIGKILL

So the job OOMed, the grid killed the main process (lighttpd) with SIGKILL and somehow (at least) one php-cgi process got left behind (on tools-webgrid-lighttpd-1412 it is two for job 716152).

This should probably be changed to use jobkill which uses SIGINT, 10 s pause, SIGTERM, 10 s pause, SIGKILL. The delay shouldn't be a problem for OOMed jobs, but will enable a cleaner shutdown on SIGTERM.

jobkill is currently not deployed to webgrid nodes, so that needs to be puppetized as well.

scfc created this task.Dec 15 2016, 3:33 AM
Restricted Application added a project: Cloud-Services. · View Herald TranscriptDec 15 2016, 3:33 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
scfc updated the task description. (Show Details)Dec 15 2016, 3:38 AM
bd808 updated the task description. (Show Details)
bd808 updated the task description. (Show Details)Sep 21 2018, 8:22 PM
Bstorm added a subscriber: Bstorm.Sep 24 2018, 4:14 PM

Adding a note because I just found it: this is where it needs to change if it were all in puppet https://github.com/wikimedia/puppet/blob/production/modules/toollabs/templates/gridengine/queue-webgrid.erb#L14

However, I have zero faith that will actually take effect without changing that in grid engine directly. I'm happy to try it and see what happens, though, since while the puppetization of these things looks incomplete, looks can be deceiving. Adding the jobkill script will at least be a proper puppet thing.

Change 469129 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] gridengine: webgrid exec nodes should use the jobkill script


Change 469129 merged by Bstorm:
[operations/puppet@production] gridengine: webgrid exec nodes should use the jobkill script


I was right about the puppet bit not actually updating the queue configurations "enough" for whatever reason. However, I altered the configs sufficiently that jobkill is now set on the web queues just like it is on the others.

Mentioned in SAL (#wikimedia-cloud) [2018-10-29T17:00:06Z] <bd808> Ran grid engine orphan process kill script from T153281

Mentioned in SAL (#wikimedia-cloud) [2018-11-16T21:15:58Z] <bd808> Ran grid engine orphan process kill script from T153281. Only 3 orphan php-cgi processes belonging to iluvatarbot found.

Bstorm added a comment.Wed, Feb 6, 1:11 AM

Ok, since now all queues use jobkill, I looked into why some orphans still come up when running these commands. It seems they do for totally different reasons (and that people run some screen and tmux sessions on the exec setup, since a couple of them were picked up).

One example is

01/15/2019 16:49:40|  main|tools-webgrid-lighttpd-1427|E|recursive rmdir(/tmp/516747.1.webgrid-lighttpd): opendir(/tmp/516747.1.webgrid-lighttpd) failed: No such file or directory

That job ended up in the orphaned state. The process is

tools.tools-info         1 17953 /usr/bin/php-cgi

So, something different happened there with a similar result.