tools.jembot PHP processes run out of memory and leave orphan php-cgi processes regularly
Open, HighPublic

Description

Graphite quick check: Top 8 webgrid-lighttpd instances
by average CPU over one hour

It seems the faulty webgrid jobs have pilled up. If one could kill the stuck /usr/bin/php-cgi processes by tools.jembot, that would be nice :]

confirmed, I see tons of leakage via clush -w @all 'sudo pidstat -U tools.jembot' | grep jem

culled things with

clush -w @all 'sudo /usr/bin/pkill --signal 9 -u tools.jembot'

A recent snapshot of activity shows 1146 restarts of the webservice in the last 7 calendar days

The error.log data seems to indicate that the main lighttpd process is being killed on a regular basis for exceeding its memory limit:

2018-03-05 21:30:23: (log.c.166) server started
2018-03-05 21:40:21: (server.c.1558) server stopped by UID = 0 PID = 25145
2018-03-05 21:40:24: (log.c.166) server started
2018-03-05 21:50:20: (server.c.1558) server stopped by UID = 0 PID = 30322
2018-03-05 21:50:23: (log.c.166) server started
2018-03-05 22:00:21: (server.c.1558) server stopped by UID = 0 PID = 7331
2018-03-05 22:00:23: (log.c.166) server started
2018-03-05 22:10:20: (server.c.1558) server stopped by UID = 0 PID = 19091
2018-03-05 22:10:22: (log.c.166) server started
2018-03-05 22:20:20: (server.c.1558) server stopped by UID = 0 PID = 7240
2018-03-05 22:20:22: (log.c.166) server started
2018-03-05 22:30:21: (server.c.1558) server stopped by UID = 0 PID = 10593
2018-03-05 22:30:23: (log.c.166) server started

See also:

bd808 renamed this task from tools.jembot spawns many php-cgi processes in busy loop (?) to tools.jembot crashes and leaves orphan php-cgi processes regularly.Mar 5 2018, 10:46 PM
bd808 triaged this task as High priority.
bd808 reassigned this task from valhallasw to -jem-.
bd808 updated the task description. (Show Details)
bd808 added subscribers: hashar, chasemp.
bd808 added a subscriber: bd808.Mar 5 2018, 10:51 PM

There seem to be a large number of things that this single tool does, but the one that looks mostly likely to cause memory issues is the creation of annotated image thumbnails that are shared on Facebook (e.g. https://tools.wmflabs.org/jembot/ef/pub/20180305/1918-Joan-Josep%20Tharrats.png).

hashar awarded a token.Mar 7 2018, 9:25 AM
bd808 added a comment.Mar 11 2018, 3:15 PM
$ clush -w @all 'sudo pidstat -U tools.jembot' | grep jem
tools-webgrid-lighttpd-1409.tools.eqiad.wmflabs: 03:14:08 PM tools.jembot     10915    6.72    0.00    0.00    6.72     0  php-cgi
tools-webgrid-lighttpd-1409.tools.eqiad.wmflabs: 03:14:08 PM tools.jembot     10916    6.72    0.00    0.00    6.72     0  php-cgi
tools-webgrid-lighttpd-1409.tools.eqiad.wmflabs: 03:14:08 PM tools.jembot     10918    6.72    0.00    0.00    6.72     2  php-cgi
tools-webgrid-lighttpd-1409.tools.eqiad.wmflabs: 03:14:08 PM tools.jembot     10919    6.72    0.00    0.00    6.72     2  php-cgi
tools-webgrid-lighttpd-1417.tools.eqiad.wmflabs: 03:14:09 PM tools.jembot      9192    3.44    0.00    0.00    3.45     3  php-cgi
tools-webgrid-lighttpd-1417.tools.eqiad.wmflabs: 03:14:09 PM tools.jembot      9193    3.44    0.00    0.00    3.44     0  php-cgi
tools-webgrid-lighttpd-1417.tools.eqiad.wmflabs: 03:14:09 PM tools.jembot      9196    3.44    0.00    0.00    3.45     2  php-cgi
tools-webgrid-lighttpd-1417.tools.eqiad.wmflabs: 03:14:09 PM tools.jembot      9197    3.45    0.00    0.00    3.45     1  php-cgi
tools-webgrid-lighttpd-1422.tools.eqiad.wmflabs: 03:14:09 PM tools.jembot     20794    0.00    0.00    0.00    0.00     0  lighttpd
tools-webgrid-lighttpd-1422.tools.eqiad.wmflabs: 03:14:09 PM tools.jembot     20799    0.00    0.00    0.00    0.00     1  php-cgi
tools-webgrid-lighttpd-1422.tools.eqiad.wmflabs: 03:14:09 PM tools.jembot     20802    0.00    0.00    0.00    0.00     1  php-cgi
tools-webgrid-lighttpd-1421.tools.eqiad.wmflabs: 03:14:10 PM tools.jembot      5072    3.14    0.00    0.00    3.14     2  php-cgi
tools-webgrid-lighttpd-1421.tools.eqiad.wmflabs: 03:14:10 PM tools.jembot      5073    3.13    0.00    0.00    3.14     3  php-cgi
tools-webgrid-lighttpd-1421.tools.eqiad.wmflabs: 03:14:10 PM tools.jembot      5075    3.14    0.00    0.00    3.14     0  php-cgi
tools-webgrid-lighttpd-1421.tools.eqiad.wmflabs: 03:14:10 PM tools.jembot      5076    3.14    0.00    0.00    3.14     1  php-cgi
tools-webgrid-lighttpd-1428.tools.eqiad.wmflabs: 03:14:10 PM tools.jembot      3821    7.33    0.01    0.00    7.34     2  php-cgi
tools-webgrid-lighttpd-1428.tools.eqiad.wmflabs: 03:14:10 PM tools.jembot      3822    7.33    0.01    0.00    7.33     1  php-cgi
tools-webgrid-lighttpd-1428.tools.eqiad.wmflabs: 03:14:10 PM tools.jembot      3824    7.33    0.01    0.00    7.34     0  php-cgi
tools-webgrid-lighttpd-1428.tools.eqiad.wmflabs: 03:14:10 PM tools.jembot      3825    7.33    0.01    0.00    7.34     3  php-cgi
Andrew added a subscriber: Andrew.Sep 4 2018, 2:32 PM

This bot was just now gobbling up CPU throughout the cluster.

I stopped the webservice, then ran 'sudo cumin --force --timeout 500 -o json "project:tools" "/usr/bin/pkill --signal 9 -u tools.jembot"' and then restarted the webservice.

Things seem better, for now... but we need a better fix.

Mentioned in SAL (#wikimedia-cloud) [2018-09-04T14:33:51Z] <andrewbogott> restarted webservice, purged stray processes. Details on T132880

Bstorm added a subscriber: Bstorm.Sep 4 2018, 2:49 PM
hashar updated the task description. (Show Details)Sep 21 2018, 7:59 PM
hashar added a comment.EditedSep 21 2018, 8:02 PM

I have added a link to the task description to check the busiest webgrid-lighttpd instance: https://graphite-labs.wikimedia.org/render/?width=648&height=396&_salt=1537559933.891&hideLegend=false&target=cactiStyle(highestAverage(tools.*webgrid-lighttpd*.cpu.total.user%2C8))&from=-1hours

There are again a bunch of blocked process, including new ones (several tools.wsexport for 30-90 days):

webgrid-lighttpddays-h:m:scpu%processaccount
14029-01:50:0296.0/usr/bin/php-cgitools.jembot
14029-01:50:3896.0/usr/bin/php-cgitools.jembot
14029-01:51:5996.0/usr/bin/php-cgitools.jembot
14029-01:52:3796.0/usr/bin/php-cgitools.jembot
14069-17:54:3359.8/usr/bin/php-cgitools.jembot
14069-17:55:0159.8/usr/bin/php-cgitools.jembot
14069-17:57:1559.8/usr/bin/php-cgitools.jembot
14069-17:57:3359.8/usr/bin/php-cgitools.jembot
140756-11:11:5799.2/usr/bin/php-cgitools.wsexport
141055-00:29:4899.7/usr/bin/php-cgitools.wsexport
14117-01:03:1683.9/usr/bin/php-cgitools.jembot
14117-01:06:0284.0/usr/bin/php-cgitools.jembot
14117-01:06:3584.0/usr/bin/php-cgitools.jembot
14117-01:07:2584.0/usr/bin/php-cgitools.jembot
14129-20:31:5099.6/usr/bin/php-cgitools.wsexport
141383-16:52:0778.1/usr/bin/php-cgitools.blockcalc
141530-14:00:1599.6/usr/bin/php-cgitools.wsexport
141659-00:18:3399.7/usr/bin/php-cgitools.dupdet
141662-20:26:4798.7/usr/bin/php-cgitools.dupdet
141663-08:27:5199.8/usr/bin/php-cgitools.wsexport
14175-15:52:4994.0/usr/bin/php-cgitools.jembot
14175-15:52:5494.0/usr/bin/php-cgitools.jembot
14175-15:53:4694.0/usr/bin/php-cgitools.jembot
14175-15:55:1694.0/usr/bin/php-cgitools.jembot
141836-15:52:2099.7/usr/bin/php-cgitools.wsexport
14191-10:14:1499.6/usr/bin/php-cgitools.wsexport
141943-03:15:3797.8/usr/bin/php-cgitools.sourcemd
14202-18:55:3952.7/usr/bin/php-cgitools.jembot
14204-19:06:2090.7/usr/bin/php-cgitools.jembot
14204-20:17:4891.6/usr/bin/php-cgitools.jembot
14204-21:03:2092.2/usr/bin/php-cgitools.jembot
142146-10:47:4699.9/usr/bin/php-cgitools.wsexport
142259-07:32:0699.8/usr/bin/php-cgitools.wsexport
142259-07:33:5099.8/usr/bin/php-cgitools.wsexport
142545-21:33:2480.0/usr/bin/php-cgitools.dupdet
142683-11:51:2899.4/usr/bin/php-cgitools.spellcheck
142722-20:55:5499.3/usr/bin/php-cgitools.sourcemd
142752-01:36:0054.1/usr/bin/php-cgitools.tools-info
14285-21:03:3785.7/usr/bin/php-cgitools.jembot
14285-21:04:3485.7/usr/bin/php-cgitools.jembot
14285-21:04:5385.7/usr/bin/php-cgitools.jembot
14285-21:05:3085.7/usr/bin/php-cgitools.jembot

The new parent task (T153281: webgrid-lighttpd queues kill OOM jobs with SIGKILL leaving php-cgi processes behind) describes the general problem that leads to leaked processes on the job grid. It also includes work around instructions for clearing the orphan procs to free up CPU.

bd808 renamed this task from tools.jembot crashes and leaves orphan php-cgi processes regularly to tools.jembot PHP processes run out of memory and leave orphan php-cgi processes regularly.Sep 21 2018, 9:00 PM
bd808 added a comment.Sep 21 2018, 9:05 PM

@-jem- have you ever tried moving your webservice from the grid engine to our Kubernetes cluster? I'm wondering if Kubernetes would be better at cleaning up when the memory limit kills things. We have both PHP 5.6 and PHP 7.2 available there:

Mentioned in SAL (#wikimedia-cloud) [2018-10-25T20:14:31Z] <andrewbogott> stopping/starting service in hopes of cleaning up stray processes, re: T132880

I just now killed off all jembot processes and restarted again.

I did do this: https://phabricator.wikimedia.org/T153281#4687515
Strictly speaking, it should not happen after a certain point. If continues happening after that change, maybe we'll have to create a new job killer script. However, this should be a long-term fix.

I confess that I didn't have a super strong case for killing things just now; one of the labvirts was under strain and I saw several (maybe 4-5?) jembot processes running there and ran straight for the hatchet. It would be useful to know how many procs is a normal amount.

bd808 updated the task description. (Show Details)Oct 29 2018, 4:54 PM
bd808 added a comment.Oct 29 2018, 4:58 PM

This is really the same problem as T153281: webgrid-lighttpd queues kill OOM jobs with SIGKILL leaving php-cgi processes behind from the point of view of the job grid. It would be awesome if the tool could be updated/changed/tweaked to avoid hitting the job grid bug, but in the larger scope of things we need fix the grid or add a cleanup system. It may be reasonable to delay digging really deeply into the root cause and fix until we finish T199271: Upgrade the tools gridengine system and see if this problem persists.

So I take it the jobkill script isn't fixing this, then? It was definitely set to send a SIGKILL when anything went bad previously. Anything submitted before my change would likely still obey the old setting. I guess we just have to decide when we are convinced the new setting didn't do squat :)

Maybe fully resetting everything jembot?

If that was effectively done by @Andrew , and then @bd808 had to do a culling again after, then I'm convinced the new setting did nothing at all.

bd808 added a comment.Oct 29 2018, 5:22 PM

If that was effectively done by @Andrew , and then @bd808 had to do a culling again after, then I'm convinced the new setting did nothing at all.

The round of kills I did today (T153281#4703488) did not find any orphan jembot processes. It did find orphans from croptool, wsexport, iabot, and a few other tools that are known by me to leak occasionally.

Oooh! Ok. That might mean that as things resubmit and restart over time, the problem might actually be fixed.