tools.jembot PHP processes run out of memory and leave orphan php-cgi processes regularly
Open, HighPublic
Actions

Assigned To

None

Authored By

	valhallasw
	Apr 17 2016, 4:28 PM

Description

Graphite quick check: Top 8 webgrid-lighttpd instances
by average CPU over one hour

In T182070#4002783, @chasemp wrote:

In T182070#4002514, @hashar wrote:

It seems the faulty webgrid jobs have pilled up. If one could kill the stuck /usr/bin/php-cgi processes by tools.jembot, that would be nice :]

confirmed, I see tons of leakage via clush -w @all 'sudo pidstat -U tools.jembot' | grep jem

culled things with

clush -w @all 'sudo /usr/bin/pkill --signal 9 -u tools.jembot'

A recent snapshot of activity shows 1146 restarts of the webservice in the last 7 calendar days

Screen Shot 2018-03-05 at 13.52.46.png (224×485 px, 34 KB)

The error.log data seems to indicate that the main lighttpd process is being killed on a regular basis for exceeding its memory limit:

2018-03-05 21:30:23: (log.c.166) server started
2018-03-05 21:40:21: (server.c.1558) server stopped by UID = 0 PID = 25145
2018-03-05 21:40:24: (log.c.166) server started
2018-03-05 21:50:20: (server.c.1558) server stopped by UID = 0 PID = 30322
2018-03-05 21:50:23: (log.c.166) server started
2018-03-05 22:00:21: (server.c.1558) server stopped by UID = 0 PID = 7331
2018-03-05 22:00:23: (log.c.166) server started
2018-03-05 22:10:20: (server.c.1558) server stopped by UID = 0 PID = 19091
2018-03-05 22:10:22: (log.c.166) server started
2018-03-05 22:20:20: (server.c.1558) server stopped by UID = 0 PID = 7240
2018-03-05 22:20:22: (log.c.166) server started
2018-03-05 22:30:21: (server.c.1558) server stopped by UID = 0 PID = 10593
2018-03-05 22:30:23: (log.c.166) server started

Related Objects
Search...

Status	Assigned	Task
Resolved	Andrew	T179378 some labvirt servers are at full CPU capacity
Duplicate	None	T182070 tools-webgrid-lighttpd have ~ 90 procs stuck at 100% CPU time (mostly tools.jembot)
Resolved	valhallasw	T132879 High load on some webgrid nodes
Declined	None	T153281 webgrid-lighttpd queues kill OOM jobs with SIGKILL leaving php-cgi processes behind
Open	None	T132880 tools.jembot PHP processes run out of memory and leave orphan php-cgi processes regularly

Event Timeline

valhallasw created this task.Apr 17 2016, 4:28 PM

valhallasw mentioned this in T142834: tools.templatetransclusioncheck hangs.Aug 12 2016, 12:38 PM

bd808 renamed this task from tools.jembot spawns many php-cgi processes in busy loop (?) to tools.jembot crashes and leaves orphan php-cgi processes regularly.Mar 5 2018, 10:46 PM

bd808 reassigned this task from valhallasw to -jem-.

bd808 triaged this task as High priority.

bd808 updated the task description. (Show Details)

bd808 added subscribers: hashar, • chasemp.

There seem to be a large number of things that this single tool does, but the one that looks mostly likely to cause memory issues is the creation of annotated image thumbnails that are shared on Facebook (e.g. https://tools.wmflabs.org/jembot/ef/pub/20180305/1918-Joan-Josep%20Tharrats.png).

hashar awarded a token.Mar 7 2018, 9:25 AM

$ clush -w @all 'sudo pidstat -U tools.jembot' | grep jem
tools-webgrid-lighttpd-1409.tools.eqiad.wmflabs: 03:14:08 PM tools.jembot     10915    6.72    0.00    0.00    6.72     0  php-cgi
tools-webgrid-lighttpd-1409.tools.eqiad.wmflabs: 03:14:08 PM tools.jembot     10916    6.72    0.00    0.00    6.72     0  php-cgi
tools-webgrid-lighttpd-1409.tools.eqiad.wmflabs: 03:14:08 PM tools.jembot     10918    6.72    0.00    0.00    6.72     2  php-cgi
tools-webgrid-lighttpd-1409.tools.eqiad.wmflabs: 03:14:08 PM tools.jembot     10919    6.72    0.00    0.00    6.72     2  php-cgi
tools-webgrid-lighttpd-1417.tools.eqiad.wmflabs: 03:14:09 PM tools.jembot      9192    3.44    0.00    0.00    3.45     3  php-cgi
tools-webgrid-lighttpd-1417.tools.eqiad.wmflabs: 03:14:09 PM tools.jembot      9193    3.44    0.00    0.00    3.44     0  php-cgi
tools-webgrid-lighttpd-1417.tools.eqiad.wmflabs: 03:14:09 PM tools.jembot      9196    3.44    0.00    0.00    3.45     2  php-cgi
tools-webgrid-lighttpd-1417.tools.eqiad.wmflabs: 03:14:09 PM tools.jembot      9197    3.45    0.00    0.00    3.45     1  php-cgi
tools-webgrid-lighttpd-1422.tools.eqiad.wmflabs: 03:14:09 PM tools.jembot     20794    0.00    0.00    0.00    0.00     0  lighttpd
tools-webgrid-lighttpd-1422.tools.eqiad.wmflabs: 03:14:09 PM tools.jembot     20799    0.00    0.00    0.00    0.00     1  php-cgi
tools-webgrid-lighttpd-1422.tools.eqiad.wmflabs: 03:14:09 PM tools.jembot     20802    0.00    0.00    0.00    0.00     1  php-cgi
tools-webgrid-lighttpd-1421.tools.eqiad.wmflabs: 03:14:10 PM tools.jembot      5072    3.14    0.00    0.00    3.14     2  php-cgi
tools-webgrid-lighttpd-1421.tools.eqiad.wmflabs: 03:14:10 PM tools.jembot      5073    3.13    0.00    0.00    3.14     3  php-cgi
tools-webgrid-lighttpd-1421.tools.eqiad.wmflabs: 03:14:10 PM tools.jembot      5075    3.14    0.00    0.00    3.14     0  php-cgi
tools-webgrid-lighttpd-1421.tools.eqiad.wmflabs: 03:14:10 PM tools.jembot      5076    3.14    0.00    0.00    3.14     1  php-cgi
tools-webgrid-lighttpd-1428.tools.eqiad.wmflabs: 03:14:10 PM tools.jembot      3821    7.33    0.01    0.00    7.34     2  php-cgi
tools-webgrid-lighttpd-1428.tools.eqiad.wmflabs: 03:14:10 PM tools.jembot      3822    7.33    0.01    0.00    7.33     1  php-cgi
tools-webgrid-lighttpd-1428.tools.eqiad.wmflabs: 03:14:10 PM tools.jembot      3824    7.33    0.01    0.00    7.34     0  php-cgi
tools-webgrid-lighttpd-1428.tools.eqiad.wmflabs: 03:14:10 PM tools.jembot      3825    7.33    0.01    0.00    7.34     3  php-cgi

bd808 added a parent task: T182070: tools-webgrid-lighttpd have ~ 90 procs stuck at 100% CPU time (mostly tools.jembot).Mar 11 2018, 3:17 PM

This bot was just now gobbling up CPU throughout the cluster.

I stopped the webservice, then ran 'sudo cumin --force --timeout 500 -o json "project:tools" "/usr/bin/pkill --signal 9 -u tools.jembot"' and then restarted the webservice.

Things seem better, for now... but we need a better fix.

Mentioned in SAL (#wikimedia-cloud) [2018-09-04T14:33:51Z] <andrewbogott> restarted webservice, purged stray processes. Details on T132880

• Bstorm subscribed.Sep 4 2018, 2:49 PM

hashar updated the task description. (Show Details)Sep 21 2018, 7:59 PM

I have added a link to the task description to check the busiest webgrid-lighttpd instance: https://graphite-labs.wikimedia.org/render/?width=648&height=396&_salt=1537559933.891&hideLegend=false&target=cactiStyle(highestAverage(tools.*webgrid-lighttpd*.cpu.total.user%2C8))&from=-1hours

There are again a bunch of blocked process, including new ones (several tools.wsexport for 30-90 days):

webgrid-lighttpd	days-h:m:s	cpu%	process	account
1402	9-01:50:02	96.0	/usr/bin/php-cgi	tools.jembot
1402	9-01:50:38	96.0	/usr/bin/php-cgi	tools.jembot
1402	9-01:51:59	96.0	/usr/bin/php-cgi	tools.jembot
1402	9-01:52:37	96.0	/usr/bin/php-cgi	tools.jembot
1406	9-17:54:33	59.8	/usr/bin/php-cgi	tools.jembot
1406	9-17:55:01	59.8	/usr/bin/php-cgi	tools.jembot
1406	9-17:57:15	59.8	/usr/bin/php-cgi	tools.jembot
1406	9-17:57:33	59.8	/usr/bin/php-cgi	tools.jembot
1407	56-11:11:57	99.2	/usr/bin/php-cgi	tools.wsexport
1410	55-00:29:48	99.7	/usr/bin/php-cgi	tools.wsexport
1411	7-01:03:16	83.9	/usr/bin/php-cgi	tools.jembot
1411	7-01:06:02	84.0	/usr/bin/php-cgi	tools.jembot
1411	7-01:06:35	84.0	/usr/bin/php-cgi	tools.jembot
1411	7-01:07:25	84.0	/usr/bin/php-cgi	tools.jembot
1412	9-20:31:50	99.6	/usr/bin/php-cgi	tools.wsexport
1413	83-16:52:07	78.1	/usr/bin/php-cgi	tools.blockcalc
1415	30-14:00:15	99.6	/usr/bin/php-cgi	tools.wsexport
1416	59-00:18:33	99.7	/usr/bin/php-cgi	tools.dupdet
1416	62-20:26:47	98.7	/usr/bin/php-cgi	tools.dupdet
1416	63-08:27:51	99.8	/usr/bin/php-cgi	tools.wsexport
1417	5-15:52:49	94.0	/usr/bin/php-cgi	tools.jembot
1417	5-15:52:54	94.0	/usr/bin/php-cgi	tools.jembot
1417	5-15:53:46	94.0	/usr/bin/php-cgi	tools.jembot
1417	5-15:55:16	94.0	/usr/bin/php-cgi	tools.jembot
1418	36-15:52:20	99.7	/usr/bin/php-cgi	tools.wsexport
1419	1-10:14:14	99.6	/usr/bin/php-cgi	tools.wsexport
1419	43-03:15:37	97.8	/usr/bin/php-cgi	tools.sourcemd
1420	2-18:55:39	52.7	/usr/bin/php-cgi	tools.jembot
1420	4-19:06:20	90.7	/usr/bin/php-cgi	tools.jembot
1420	4-20:17:48	91.6	/usr/bin/php-cgi	tools.jembot
1420	4-21:03:20	92.2	/usr/bin/php-cgi	tools.jembot
1421	46-10:47:46	99.9	/usr/bin/php-cgi	tools.wsexport
1422	59-07:32:06	99.8	/usr/bin/php-cgi	tools.wsexport
1422	59-07:33:50	99.8	/usr/bin/php-cgi	tools.wsexport
1425	45-21:33:24	80.0	/usr/bin/php-cgi	tools.dupdet
1426	83-11:51:28	99.4	/usr/bin/php-cgi	tools.spellcheck
1427	22-20:55:54	99.3	/usr/bin/php-cgi	tools.sourcemd
1427	52-01:36:00	54.1	/usr/bin/php-cgi	tools.tools-info
1428	5-21:03:37	85.7	/usr/bin/php-cgi	tools.jembot
1428	5-21:04:34	85.7	/usr/bin/php-cgi	tools.jembot
1428	5-21:04:53	85.7	/usr/bin/php-cgi	tools.jembot
1428	5-21:05:30	85.7	/usr/bin/php-cgi	tools.jembot

graphite-labs.wikimedia.org.png (396×648 px, 30 KB)

The new parent task (T153281: webgrid-lighttpd queues kill OOM jobs with SIGKILL leaving php-cgi processes behind) describes the general problem that leads to leaked processes on the job grid. It also includes work around instructions for clearing the orphan procs to free up CPU.

bd808 renamed this task from tools.jembot crashes and leaves orphan php-cgi processes regularly to tools.jembot PHP processes run out of memory and leave orphan php-cgi processes regularly.Sep 21 2018, 9:00 PM

@-jem- have you ever tried moving your webservice from the grid engine to our Kubernetes cluster? I'm wondering if Kubernetes would be better at cleaning up when the memory limit kills things. We have both PHP 5.6 and PHP 7.2 available there:

Mentioned in SAL (#wikimedia-cloud) [2018-10-25T20:14:31Z] <andrewbogott> stopping/starting service in hopes of cleaning up stray processes, re: T132880

I just now killed off all jembot processes and restarted again.

I did do this: https://phabricator.wikimedia.org/T153281#4687515
Strictly speaking, it should not happen after a certain point. If continues happening after that change, maybe we'll have to create a new job killer script. However, this should be a long-term fix.

I confess that I didn't have a super strong case for killing things just now; one of the labvirts was under strain and I saw several (maybe 4-5?) jembot processes running there and ran straight for the hatchet. It would be useful to know how many procs is a normal amount.

bd808 updated the task description. (Show Details)Oct 29 2018, 4:54 PM

This is really the same problem as T153281: webgrid-lighttpd queues kill OOM jobs with SIGKILL leaving php-cgi processes behind from the point of view of the job grid. It would be awesome if the tool could be updated/changed/tweaked to avoid hitting the job grid bug, but in the larger scope of things we need fix the grid or add a cleanup system. It may be reasonable to delay digging really deeply into the root cause and fix until we finish T199271: Upgrade the tools gridengine system and see if this problem persists.

So I take it the jobkill script isn't fixing this, then? It was definitely set to send a SIGKILL when anything went bad previously. Anything submitted before my change would likely still obey the old setting. I guess we just have to decide when we are convinced the new setting didn't do squat :)

Maybe fully resetting everything jembot?

If that was effectively done by @Andrew , and then @bd808 had to do a culling again after, then I'm convinced the new setting did nothing at all.

In T132880#4703575, @Bstorm wrote:

If that was effectively done by @Andrew , and then @bd808 had to do a culling again after, then I'm convinced the new setting did nothing at all.

The round of kills I did today (T153281#4703488) did not find any orphan jembot processes. It did find orphans from croptool, wsexport, iabot, and a few other tools that are known by me to leak occasionally.

Oooh! Ok. That might mean that as things resubmit and restart over time, the problem might actually be fixed.

Looking again at top8 -1months highest average for tools.*webgrid-lighttpd*.cpu.total.user

It seems that is mostly fixed?

There are a couple sgewebgrid instances with stable high CPU which might indicate stall process. But that might be due to a different reason.

I am not sure what got it fixed. Was it to send SIGTERM instead of SIGKILL?

Removing task assignee as there has not been a reply for two years.
Not sure if this is still an issue, as per last comment?

hashar unsubscribed.Mar 9 2020, 8:03 AM

	F26142307: graphite-labs.wikimedia.org.png
	Sep 21 2018, 8:02 PM

	F14439402: Screen Shot 2018-03-05 at 13.52.46.png
	Mar 5 2018, 10:46 PM

	F28409789: toolshighcpu.png
	Mar 18 2019, 1:10 PM

tools.jembot PHP processes run out of memory and leave orphan php-cgi processes regularlyOpen, HighPublicActions

Description

Related ObjectsSearch...

Event Timeline

tools.jembot PHP processes run out of memory and leave orphan php-cgi processes regularly
Open, HighPublic
Actions

Related Objects
Search...