webgrid-lighttpd queues kill OOM jobs with SIGKILL leaving php-cgi processes behind
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	scfc
	Dec 15 2016, 3:33 AM

Description

TLDR workaround

Here are the clush commands @bd808 has been using to first check for and then kill processes that have leaked out of grid engine due to some cleanup failure:

find-orphans

$ clush -w @exec -w @webgrid -b 'ps axwo user:20,ppid,pid,cmd | grep -Ev "^($USER|root|daemon|diamond|_lldpd|messagebus|nagios|nslcd|ntp|prometheus|statd|syslog|Debian-exim|www-data|sgeadmin)"|grep -v perl|grep -E "     1 "' 2>&1 | grep -v 'exited with exit code 1'

kill-orphans

$ clush -w @exec -w @webgrid -b 'ps axwo user:20,ppid,pid,cmd | grep -Ev "^($USER|root|daemon|diamond|_lldpd|messagebus|nagios|nslcd|ntp|prometheus|statd|syslog|Debian-exim|www-data|sgeadmin)"|grep -v perl|grep -E "     1 "|awk "{print \$3}"|xargs sudo kill -9'

For some time, I had a cron job running that listed and killed orphaned php-cgi processes (clush -g webgrid-lighttpd-precise -g webgrid-lighttpd-trusty 'while [ -n "$(pgrep -P 1 php-cgi | tee >(xargs -r sudo kill -HUP) /dev/stderr)" ]; do :; done'). The problem with these orphaned processes is that lighttpd only starts n php-cgi processes in total per user, i. e. if there are already n stale processes when lighttpd starts up, it will not spawn any new ones and thus can't process any requests.

I decided to take a closer look:

scfc@tools-puppetmaster-02:~$ clush -g webgrid-lighttpd-precise -g webgrid-lighttpd-trusty 'pgrep -P 1 php-cgi || true'
tools-webgrid-lighttpd-1401.tools.eqiad.wmflabs: 31771
tools-webgrid-lighttpd-1412.tools.eqiad.wmflabs: 13238
tools-webgrid-lighttpd-1412.tools.eqiad.wmflabs: 13242
tools-webgrid-lighttpd-1414.tools.eqiad.wmflabs: 7446
scfc@tools-puppetmaster-02:~$

scfc@tools-webgrid-lighttpd-1401:~$ sudo cat /proc/31771/environ; echo
PATH=/tmp/697935.1.webgrid-lighttpd:/usr/local/bin:/bin:/usr/binSHELL=/bin/bashUSER=tools.wsexportPHP_FCGI_CHILDREN=2PHP_FCGI_MAX_REQUESTS=500
scfc@tools-webgrid-lighttpd-1401:~$ fgrep 697935 /var/spool/gridengine/execd/tools-webgrid-lighttpd-1401/messages 
12/05/2016 19:32:00|  main|tools-webgrid-lighttpd-1401|W|job 697935 exceeds job hard limit "h_vmem" of queue "webgrid-lighttpd@tools-webgrid-lighttpd-1401.eqiad.wmflabs" (4307968000.00000 > limit:4294967296.00000) - sending SIGKILL
scfc@tools-webgrid-lighttpd-1401:~$

So the job OOMed, the grid killed the main process (lighttpd) with SIGKILL and somehow (at least) one php-cgi process got left behind (on tools-webgrid-lighttpd-1412 it is two for job 716152).

This should probably be changed to use jobkill which uses SIGINT, 10 s pause, SIGTERM, 10 s pause, SIGKILL. The delay shouldn't be a problem for OOMed jobs, but will enable a cleaner shutdown on SIGTERM.

jobkill is currently not deployed to webgrid nodes, so that needs to be puppetized as well.

Details

	Subject	Repo	Branch	Lines +/-
	gridengine: webgrid exec nodes should use the jobkill script	operations/puppet	production	+1 -1

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Declined		None	T153281 webgrid-lighttpd queues kill OOM jobs with SIGKILL leaving php-cgi processes behind
		Open		None	T132880 tools.jembot PHP processes run out of memory and leave orphan php-cgi processes regularly

Event Timeline

scfc created this task.Dec 15 2016, 3:33 AM

Restricted Application added a project: Cloud-Services. · View Herald TranscriptDec 15 2016, 3:33 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

scfc updated the task description. (Show Details)Dec 15 2016, 3:38 AM

• bd808 merged a task: T182070: tools-webgrid-lighttpd have ~ 90 procs stuck at 100% CPU time (mostly tools.jembot).Jul 12 2018, 11:42 PM

• bd808 added subscribers: hashar, Stashbot, JJMC89 and 9 others.

• bd808 removed a project: Cloud-Services.Jul 12 2018, 11:44 PM

• bd808 updated the task description. (Show Details)

• bd808 added a subtask: T132880: tools.jembot PHP processes run out of memory and leave orphan php-cgi processes regularly.Sep 21 2018, 8:19 PM

• bd808 mentioned this in T132880: tools.jembot PHP processes run out of memory and leave orphan php-cgi processes regularly.

• bd808 updated the task description. (Show Details)Sep 21 2018, 8:22 PM

hashar awarded a token.Sep 21 2018, 9:35 PM

Adding a note because I just found it: this is where it needs to change if it were all in puppet https://github.com/wikimedia/puppet/blob/production/modules/toollabs/templates/gridengine/queue-webgrid.erb#L14

However, I have zero faith that will actually take effect without changing that in grid engine directly. I'm happy to try it and see what happens, though, since while the puppetization of these things looks incomplete, looks can be deceiving. Adding the jobkill script will at least be a proper puppet thing.

• Bstorm added a project: cloud-services-team (Kanban).Oct 18 2018, 4:54 PM

Change 469129 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] gridengine: webgrid exec nodes should use the jobkill script

https://gerrit.wikimedia.org/r/469129

Change 469129 merged by Bstorm:
[operations/puppet@production] gridengine: webgrid exec nodes should use the jobkill script

https://gerrit.wikimedia.org/r/469129

I was right about the puppet bit not actually updating the queue configurations "enough" for whatever reason. However, I altered the configs sufficiently that jobkill is now set on the web queues just like it is on the others.

Mentioned in SAL (#wikimedia-cloud) [2018-10-29T17:00:06Z] <bd808> Ran grid engine orphan process kill script from T153281

Mentioned in SAL (#wikimedia-cloud) [2018-11-16T21:15:58Z] <bd808> Ran grid engine orphan process kill script from T153281. Only 3 orphan php-cgi processes belonging to iluvatarbot found.

Ok, since now all queues use jobkill, I looked into why some orphans still come up when running these commands. It seems they do for totally different reasons (and that people run some screen and tmux sessions on the exec setup, since a couple of them were picked up).

One example is

01/15/2019 16:49:40|  main|tools-webgrid-lighttpd-1427|E|recursive rmdir(/tmp/516747.1.webgrid-lighttpd): opendir(/tmp/516747.1.webgrid-lighttpd) failed: No such file or directory

That job ended up in the orphaned state. The process is

tools.tools-info         1 17953 /usr/bin/php-cgi

So, something different happened there with a similar result.

• Bstorm moved this task from Inbox to Graveyard on the cloud-services-team (Kanban) board.Dec 18 2019, 4:33 PM

hashar unsubscribed.Mar 9 2020, 8:03 AM

fnegri edited projects, added cloud-services-team; removed cloud-services-team (Kanban).Jan 18 2023, 6:53 PM

fnegri moved this task from Kanban to Graveyard on the cloud-services-team board.

Pppery removed a project: Patch-For-Review.Apr 2 2023, 12:52 AM

No more work is going to be done in the Grid.

webgrid-lighttpd queues kill OOM jobs with SIGKILL leaving php-cgi processes behindClosed, DeclinedPublicActions