Page MenuHomePhabricator

tools-webgrid-lighttpd have ~ 90 procs stuck at 100% CPU time (mostly tools.jembot)
Closed, DuplicatePublic

Description

TLDR workaround

Here are the clush commands @bd808 has been using to first check for and then kill processes that have leaked out of grid engine due to some cleanup failure:

find-orphans
$ clush -w @exec -w @webgrid -b 'ps axwo user:20,ppid,pid,cmd | grep -Ev "^($USER|root|daemon|diamond|_lldpd|messagebus|nagios|nslcd|ntp|prometheus|statd|syslog|Debian-exim|www-data|sgeadmin)"|grep -v perl|grep -E "     1 "'
kill-orphans
$ clush -w @exec -w @webgrid -b 'ps axwo user:20,ppid,pid,cmd | grep -Ev "^($USER|root|daemon|diamond|_lldpd|messagebus|nagios|nslcd|ntp|prometheus|statd|syslog|Debian-exim|www-data|sgeadmin)"|grep -v perl|grep -E "     1 "|awk "{print \$3}"|xargs sudo kill -9'

While looking at CPU usage of instances on labs, I found out that tools-webgrid-lighttpd-* have high user CPU usage being reported. That seems to be leaking over time as shown for 5 months (CPU usage is number of CPU being busy):

lighttpd_cpu_usage.png (504×711 px, 89 KB)

Over a week:

lighttpd_cpu_usage_one_week.png (504×711 px, 108 KB)

It seems we have the equivalent of 90 CPU being stuck at 100%.

View of CPU usage for tools-webrid-lighttpd-* instances.

Based on:

# List proces per cputime, percent cpu, command and username
# Filter on cputime of 1 day or more and current cpu being at least 10%
cmd="ps -e -o cputime,pcpu,cmd,user|grep -P '^\d+-.* \d\d\.'|sort -n|sed -e 's%\(^\|\s\+\)% | %g'"
for i in $(seq 1401 1428); do
    ssh "tools-webgrid-lighttpd-$i.tools.eqiad.wmflabs" "$cmd" | sed -e "s%^%| $i %"
done

The top offender seems to be tools.jembot. As of January 4th 2018

InstanceCumulative timeUtilizationCommandUser
140120-14:46:5799.2/usr/bin/php-cgitools.jembot
1401178-13:37:3495.5/usr/bin/php-cgitools.blockcalc
140420-12:40:4599.8/usr/bin/php-cgitools.jembot
140420-18:45:1299.8/usr/bin/php-cgitools.jembot
1404170-12:48:1791.2/usr/bin/php-cgitools.wam
140520-22:20:2499.8/usr/bin/php-cgitools.jembot
1405174-15:13:0993.4/usr/bin/php-cgitools.dupdet
140720-11:46:0799.7/usr/bin/php-cgitools.jembot
140721-02:11:3199.6/usr/bin/php-cgitools.jembot
1407157-10:17:1799.4/usr/bin/php-cgitools.wsexport
140865-23:01:4290.2/usr/bin/php-cgitools.spellcheck
140869-02:01:0790.5/usr/bin/php-cgitools.spellcheck
140877-08:11:1798.3/usr/bin/php-cgitools.pinyin-wiki
140879-03:28:2796.5/usr/bin/php-cgitools.spellcheck
140943-06:46:2572.3/usr/bin/php-cgitools.jembot
140943-06:47:3372.3/usr/bin/php-cgitools.jembot
140943-06:49:1272.3/usr/bin/php-cgitools.jembot
140943-06:50:1472.3/usr/bin/php-cgitools.jembot
14107-05:27:3048.8/usr/bin/php-cgitools.jembot
14107-05:27:3248.8/usr/bin/php-cgitools.jembot
14107-05:27:4148.8/usr/bin/php-cgitools.jembot
14107-05:27:5948.8/usr/bin/php-cgitools.jembot
141020-13:12:3898.3/usr/bin/php-cgitools.jembot
1410185-05:23:0799.1/usr/bin/php-cgitools.supercount
14117-12:04:1198.9/usr/bin/php-cgitools.dupdet
141121-00:40:1499.9/usr/bin/php-cgitools.jembot
141139-01:19:0398.1/usr/bin/php-cgitools.dupdet
1412117-15:40:4779.5/usr/bin/php-cgitools.sowhy
1412117-16:36:4379.5/usr/bin/php-cgitools.sowhy
141320-11:19:0999.4/usr/bin/php-cgitools.jembot
141399-16:41:2298.0/usr/bin/php-cgitools.croptool
1413101-22:23:0898.6/usr/bin/php-cgitools.wsexport
14146-01:45:2549.1/usr/bin/php-cgitools.jembot
14146-01:46:2149.1/usr/bin/php-cgitools.jembot
14146-01:47:2249.1/usr/bin/php-cgitools.jembot
14146-01:47:3249.1/usr/bin/php-cgitools.jembot
141420-14:59:4998.9/usr/bin/php-cgitools.jembot
141420-23:36:5898.9/usr/bin/php-cgitools.jembot
141512-06:45:3799.8/usr/bin/php-cgitools.jembot
1415110-05:29:5495.8/usr/bin/php-cgitools.dupdet
1415137-08:25:2899.2/usr/bin/php-cgitools.wsexport
141620-09:10:4898.8/usr/bin/php-cgitools.jembot
141620-09:12:5898.8/usr/bin/php-cgitools.jembot
141641-10:19:4690.5/usr/bin/php-cgitools.wsexport
1416111-15:27:5796.1/usr/bin/php-cgitools.jembot
141710-14:06:2071.9/usr/bin/php-cgitools.jembot
141710-14:11:0971.9/usr/bin/php-cgitools.jembot
141710-14:25:1172.0/usr/bin/php-cgitools.jembot
141710-14:38:5272.1/usr/bin/php-cgitools.jembot
1417104-16:14:4096.0/usr/bin/php-cgitools.wsexport
141826-22:28:1498.4/usr/bin/php-cgitools.jembot
141826-22:29:1598.4/usr/bin/php-cgitools.jembot
141826-22:30:0198.4/usr/bin/php-cgitools.jembot
141826-22:30:0698.4/usr/bin/php-cgitools.jembot
141920-14:39:4699.8/usr/bin/php-cgitools.jembot
141921-03:42:0799.7/usr/bin/php-cgitools.jembot
142020-14:47:5299.8/usr/bin/php-cgitools.jembot
142020-14:54:4499.8/usr/bin/php-cgitools.jembot
142021-03:49:5999.8/usr/bin/php-cgitools.jembot
142122-09:58:3798.4/usr/bin/php-cgitools.jembot
142122-10:00:3598.4/usr/bin/php-cgitools.jembot
142122-10:03:0098.4/usr/bin/php-cgitools.jembot
142122-10:03:3298.4/usr/bin/php-cgitools.jembot
142225-05:58:2498.8/usr/bin/php-cgitools.jembot
142225-06:00:0298.8/usr/bin/php-cgitools.jembot
142225-06:00:4798.8/usr/bin/php-cgitools.jembot
142225-06:02:0998.8/usr/bin/php-cgitools.jembot
142523-21:45:0498.4/usr/bin/php-cgitools.jembot
142523-21:46:3598.4/usr/bin/php-cgitools.jembot
142523-21:46:5898.4/usr/bin/php-cgitools.jembot
142523-21:48:5098.4/usr/bin/php-cgitools.jembot
142624-18:01:5598.8/usr/bin/php-cgitools.jembot
142624-18:02:3898.8/usr/bin/php-cgitools.jembot
142624-18:03:0398.8/usr/bin/php-cgitools.jembot
142624-18:03:2298.8/usr/bin/php-cgitools.jembot
142724-18:30:2799.6/usr/bin/php-cgitools.jembot
142724-18:31:1199.6/usr/bin/php-cgitools.jembot
142724-19:33:2699.8/usr/bin/php-cgitools.jembot
142835-13:33:1693.3/usr/bin/php-cgitools.iabot
1428134-02:15:2097.8/usr/bin/php-cgitools.wsexport

Event Timeline

@-jem- if you get sometime available, can you check tools.jembot on toolforge? It seems to have a bunch of process stuck at 100% CPU usage for quite a long time :]

I stopped the webservice running under jembot for now as I'm unsure if this is an issue with this Tool or what but it had indeed leaked procs all over the webgrid nodes. I then purged processes running as the tools.jembot user.

Chase killed off all of jem's processes, and CPU usage is plummeting.

jembotdeath.png (764×2 px, 173 KB)

Awesome. Running the script I pasted earlier, there are way less rogue process left:

InstanceCumulative timeUtilizationCommandUser
1401150-17:59:5694.9/usr/bin/php-cgitools.blockcalc
1404142-11:53:4489.7/usr/bin/php-cgitools.wam
1405146-13:55:0592.3/usr/bin/php-cgitools.dupdet
1407129-10:19:0199.4/usr/bin/php-cgitools.wsexport
140838-14:04:1685.7/usr/bin/php-cgitools.spellcheck
140841-17:03:1986.4/usr/bin/php-cgitools.spellcheck
140849-06:42:0597.4/usr/bin/php-cgitools.pinyin-wiki
140851-01:59:1894.7/usr/bin/php-cgitools.spellcheck
140915-11:50:0148.8/usr/bin/php-cgitools.jembot
140915-11:51:3048.8/usr/bin/php-cgitools.jembot
140915-11:51:4148.8/usr/bin/php-cgitools.jembot
140915-11:51:5948.8/usr/bin/php-cgitools.jembot
1410157-11:49:4599.1/usr/bin/php-cgitools.supercount
141110-23:45:0393.9/usr/bin/php-cgitools.dupdet
141289-14:12:5174.7/usr/bin/php-cgitools.sowhy
141289-15:08:3774.7/usr/bin/php-cgitools.sowhy
141371-18:12:5397.5/usr/bin/php-cgitools.croptool
141373-23:54:3098.3/usr/bin/php-cgitools.wsexport
141582-04:02:5594.5/usr/bin/php-cgitools.dupdet
1415109-06:58:4799.1/usr/bin/php-cgitools.wsexport
141613-13:51:5276.8/usr/bin/php-cgitools.wsexport
141683-19:04:4895.1/usr/bin/php-cgitools.jembot
141777-04:03:0995.4/usr/bin/php-cgitools.wsexport
14287-14:07:0675.8/usr/bin/php-cgitools.iabot
1428106-02:49:1297.3/usr/bin/php-cgitools.wsexport

This was probably not the fault of the tool. We have seen this behavior from webservice occasionally in the past. For currently undiagnosed reasons, the internal watchdog process that checks to see if a webservice job is running for a given tool can lose track of running job(s) and begin spawning a new job on each pass through the status reconciliation loop. The "fix" has historically been to:

  • Stop the webservice using webservice stop
  • Delete the tool's $HOME/service.manifest file
  • Wait a small amount of time for the watchdog process to notice this and stop trying to manage the job
  • Use qdel to stop all of the running jobs for the tool
  • Look across the job grid for "orphaned" processes owned by the tool and kill them
  • Restart the tool's webservice

I'm sorry for the inconveniences and for the late answer, I was delayed with the mail. I don't know if my code is to blame, I haven't made any recent changes, but I'll pay more attention in the future. Also, if needed, you can reach me much faster as jem in IRC; I have permanent connection. Thanks for the intervention.

@-jem- it looks like an issue with the webservice system. Though your bot definitely exacerbate the issue.

I went around and killed a bunch of processes on the grid during my evening hours last night that were owned by tools but had pid 1 as the parent process. This seems to be a way to detect processes that are leaking out of Grid Engine's control. The orphan procs I killed belonged to jembot, iabot, dupdet, and wsexport. I also found that xlinkbot, linkwatcher, and coibot seem to manage to make pid 1 owned processes by design somehow.


Just now I ran the command I dreamed up to search for these and found four new orphaned /usr/bin/php-cgi processes owned by jembot. The information I got from qstat -xml as tools.jembot showed that the grid job had been recently started. The logged job was running on a different grid node. Looking at https://tools.wmflabs.org/grid-jobs/tool/jembot to see the history of things I see that the lighttpd-jembot process has been submitted to the grid 1010 times in the past 7 days. That is a lot of crashing for a php webservice. I went poking around in the tool's $HOME to see if there was a sign of why. The $HOME/service.log did not show a pattern of many restarts. I did find a $HOME/.bigbrotherrc~ file that showed that at some point bigbrother had been setup to run webservice. This won't work on a number of levels, but I think that bigbrother would actually complain about that not being a watchable job. I checked /data/project/.system/bigbrother.scoreboard and do not see jembot listed as a watched job. I then stopped the webservice, removed the service.manifest and service.log files, and started the webservice back up. I'll try to remember to check in again in a day or so and see if the pattern of restarts and orphan processes reappears or not.

Maybe we should take a look at https://arc.liv.ac.uk/SGE/howto/remove_orphaned_processes.html and see if any of the techniques for automatically killing orphan processes described there will work on our grid.

I'll try to remember to check in again in a day or so and see if the pattern of restarts and orphan processes reappears or not.

There are another 4 orphan /usr/bin/php-cgi processes started by tools.jembot on tools-webgrid-lighttpd-1426.tools.eqiad.wmflabs. The webservice is currently running on tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs.

tools.jembot@tools-bastion-02:~$ tail service.log
2018-02-10T09:00:08.001409 No running webservice job found, attempting to start it
2018-02-10T11:20:07.390117 No running webservice job found, attempting to start it
2018-02-10T12:10:07.528442 No running webservice job found, attempting to start it
2018-02-10T12:30:07.276491 No running webservice job found, attempting to start it
2018-02-10T12:40:08.987820 No running webservice job found, attempting to start it
2018-02-10T17:40:06.110219 No running webservice job found, attempting to start it
2018-02-10T17:40:23.206703 Timed out attempting to start webservice (15s)
2018-02-10T19:40:10.160176 No running webservice job found, attempting to start it
2018-02-11T01:00:07.525414 No running webservice job found, attempting to start it
2018-02-11T03:40:08.663015 No running webservice job found, attempting to start it

It looks like jembot's webservice continues to die on a regular basis and sometimes when it dies grid engine / linux does not properly clean up the php runners that were started by lighttpd.

The error.log file contains 74,943 lines like:

2018-02-11 01:40:06: (server.c.1558) server stopped by UID = 0 PID = 22028

These go back to 2016-10-17.

It seems the faulty webgrid jobs have pilled up. If one could kill the stuck /usr/bin/php-cgi processes by tools.jembot, that would be nice :]

It seems the faulty webgrid jobs have pilled up. If one could kill the stuck /usr/bin/php-cgi processes by tools.jembot, that would be nice :]

confirmed, I see tons of leakage via clush -w @all 'sudo pidstat -U tools.jembot' | grep jem

culled things with

clush -w @all 'sudo /usr/bin/pkill --signal 9 -u tools.jembot'

Awesome. Looks like most of them happened to be scheduled on labvirt1013 which has a nice drop in CPU/load.

In T192736, @Nemo_bis wrote:

6 instances and counting are listed on https://grafana-labs.wikimedia.org/dashboard/db/top-instances?orgId=1 as 100 % user CPU, seemingly for tools.jem processes (maybe jembot):

top - 16:34:23 up 95 days, 19:50,  1 user,  load average: 4.01, 4.11, 4.12
%Cpu(s): 36.2 us,  0.4 sy,  0.0 ni, 63.3 id,  0.1 wa,  0.0 hi,  0.0 si,  0.0 st

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
14532 tools.j+  20   0  373064  19424   8820 R 100.0  0.2  34036:27 php-cgi
14533 tools.j+  20   0  371348  12708   3940 R  98.8  0.2  34034:45 php-cgi
14536 tools.j+  20   0  371508  18384   9148 R  98.8  0.2  34034:48 php-cgi
14535 tools.j+  20   0  372328  14824   6092 R  92.7  0.2  34035:35 php-cgi

Are those tasks doing something useful or are they just stuck?

id is the tools-webgrid-lighttpd instance

idtimecpucmduser
14173-23:30:0095.8/usr/bin/php-cgitools.jembot
14173-23:30:2995.8/usr/bin/php-cgitools.jembot
14173-23:30:3195.8/usr/bin/php-cgitools.jembot
14173-23:32:5395.9/usr/bin/php-cgitools.jembot
142010-14:23:0295.0/usr/bin/php-cgitools.jembot
142010-14:23:1095.0/usr/bin/php-cgitools.jembot
142010-14:23:4395.0/usr/bin/php-cgitools.jembot
142010-14:25:2095.0/usr/bin/php-cgitools.jembot
14255-22:25:5196.4/usr/bin/php-cgitools.jembot
14255-22:26:5396.4/usr/bin/php-cgitools.jembot
14255-22:27:4696.5/usr/bin/php-cgitools.jembot
14255-22:28:2696.5/usr/bin/php-cgitools.jembot

I also noticed a few stall process by tools.dupdet:

idtimecpucmduser
14115-20:49:2942.7/usr/bin/php-cgitools.dupdet
14117-16:31:1755.9/usr/bin/php-cgitools.dupdet
141110-17:41:0578.1/usr/bin/php-cgitools.dupdet
141112-05:23:0589.0/usr/bin/php-cgitools.dupdet
14225-18:13:1999.3/usr/bin/php-cgitools.dupdet

Mentioned in SAL (#wikimedia-cloud) [2018-06-20T15:09:06Z] <bd808> Killed orphan processes on webgrid nodes (T182070); most owned by jembot and croptool

Here are the clush commands I have been using to first check for and then kill processes that have leaked out of grid engine due to some cleanup failure:

find-orphans
$ clush -w @exec -w @webgrid -b 'ps axwo user:20,ppid,pid,cmd | grep -Ev "^($USER|root|daemon|diamond|_lldpd|messagebus|nagios|nslcd|ntp|prometheus|statd|syslog|Debian-exim|www-data|sgeadmin)"|grep -v perl|grep -E "     1 "'
kill-orphans
$ clush -w @exec -w @webgrid -b 'ps axwo user:20,ppid,pid,cmd | grep -Ev "^($USER|root|daemon|diamond|_lldpd|messagebus|nagios|nslcd|ntp|prometheus|statd|syslog|Debian-exim|www-data|sgeadmin)"|grep -v perl|grep -E "     1 "|awk "{print \$3}"|xargs sudo kill -9'

As the author of Croptool, I see that the tool hangs from time to time and I don't really understand why. If a user requests a very large image to be cropped, it will naturally cause high CPU usage for some time, but I would expect that the PHP process eventually would be killed / time out. Let me know if there are settings I should try changing!

Earlier there was the T104799 problem, but I'm not sure if that's still a problem.

(Btw. I looked into using Kubernetes instead, but don't think I can use custom images yet: https://github.com/danmichaelo/croptool/issues/106)

As the author of Croptool, I see that the tool hangs from time to time and I don't really understand why. If a user requests a very large image to be cropped, it will naturally cause high CPU usage for some time, but I would expect that the PHP process eventually would be killed / time out. Let me know if there are settings I should try changing!

What I think I am seeing on the job grid from a few tools is the php fastcgi server (/usr/bin/php-cgi) that is launched by lighttpd leaking child processes. The commands shown in T182070#4303107 are looking for running processes with a parent process id of 1. This is a hallmark of an orphan process where the parent process has ended before the child. My hunch is that this is more likely to happen to tools where the PHP scripts are doing large or slow IO operations. PHP is not well known for good interrupt handling which makes me suspect that the orphan process is in a polling loop of some kind waiting to send or receive data and did not respond to the SIGHUP sent by the parent process as it closed.

When I have looked into this more closely in the past I typically found that the webservice for the tool that had left orphans was not actively running on that same exec node in the job grid. This in turn makes me suspect that grid engine itself is part of the problem. A likely scenario goes something like:

  • grid spawns a lighttpd process as the webservice for tool X
  • tool X does some work and eventually exceeds the memory limit that the grid granted it
  • the grid notices the memory limit violation and terminates the lighttpd process
  • one or more PHP fcgi processes are busy and do not respond to the signal sent by lighttpd when the grid kills it
  • processes leak and the grid engine scheduler is unaware of them in its accounting to determine if more work can be sent to that grid exec node
  • the grid node gets overloaded and everything there suffers as a result

There are some things that we might try to tune in the grid engine deploy itself to make it more aggressive about terminating child processes when shutting down a job. I have kind of been hoping that we can wait until we have deployed a newer version of grid engine to dig too far into this however. In the next 9 months we will be updating to "son of grid engine" packaged by the Debian upstream and this grid engine derivative project has some improvements in how it tracks and manages job processes that may make this either go away or at least be less likely to occur.

(Btw. I looked into using Kubernetes instead, but don't think I can use custom images yet: https://github.com/danmichaelo/croptool/issues/106)

You are correct. We still do not have a mechanism to add packages to one of our Kubernetes images for a specific tool (for example an apt-get install ... when first started) nor do we have a way for tools to upload their own custom Docker containers.

Mentioned in SAL (#wikimedia-cloud) [2018-06-29T16:46:08Z] <bd808> Killed orphan tool owned processes running on the job grid. Mostly jembot and wsexport php-cgi processes stuck in deadlock following an OOM. T182070