Page MenuHomePhabricator

osm4wiki generating around 300 perl processes wherever it runs, which overloads the server for purposes of gridengine
Open, NormalPublic0 Story Points

Description

I noticed a webgrid node with a load average over 300 today, it seems that no matter where it gets scheduled, osm4wiki is running enough simultaneous perl processes that it takes over the node, basically.

Restarting the process didn't help. It seems to hang when I try to access to tool via the web, so I think it is malfunctioning somehow.

I did migrate the osmdb service from labsdb1006/7 to cloud internal services, but restarting the service didn't seem to help. I still cannot get a response from it, but it generates lots of load.

Please take a look and let me know if I can help.

Event Timeline

Bstorm triaged this task as Normal priority.Apr 5 2019, 12:39 AM
Bstorm created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 5 2019, 12:39 AM
bd808 added a subscriber: bd808.Apr 5 2019, 1:15 AM

As far as I can tell each request to https://tools.wmflabs.org/osm4wiki/cgi-bin/wiki/wiki-osm.pl is spawning a new copy of the script to handle the request, but then somehow just hangs. strace doesn't show anything happening in any of the processes I checked.

Mentioned in SAL (#wikimedia-cloud) [2019-04-05T01:52:47Z] <bd808> Stopped webservice. Perl CGI processes are hanging and consuming all resources on grid node. (T220164)

bd808 added a comment.Apr 5 2019, 1:56 AM

After watching it drive the system load on tools-sgewebgrid-lighttpd-0912 up to 200 I shut down the webservice. As far as I can tell no users were ever getting results and application was actively breaking other things by consuming all of the grid exec node resources.

This latter part was much more concerning to me. Why is the grid losing track of the child cgi processes and letting them continue running? They all cleaned up as expected when webservice stop ran so they are not orphan or zombie processes.

As far as I know, the osm4wiki use as backend the project kmlexport, which is not running after server move to stretch . So deactivating is in my eyes no problem.
For clarification I'm not a developer of osm4wiki, aim of co-maintenance was to be able to restating a server if necessary. I have no access to project kmlexport and both maintainer have a break or are inactive. It's not so nice as osm4wiki is linked inside different articles in German Wikipedia.

Bstorm added a comment.Apr 5 2019, 5:41 PM

Thanks for the context, @Kolossos. You were the only easily-found person in phabricator who has access to the project.
Is the project effectively abandoned and in need of a maintainer? I see plenz is on there, but I am not sure who they are in Phabricator, if they are on here.

For me it seems so that kmlexport needs new maintainer. I wrote @Dvorapa a mail but he makes a break until July, if his user-page is correct. With osm4wiki it looks better but it doesn't work without kmlexport.

My userpage is correct, but still I can be reached for any operational issues of kmlexport. kmlexport should be actively maintained by me. Also in my experience, writing to Plenz on his home wiki or on Osm4wiki wiki page was usually successful, it usually takes time to get a response from him, but he should be still active.

I linked him this task on both pages I mentioned

Aklapper added a subscriber: Plenz.Apr 6 2019, 6:17 PM
Plenz added a comment.Apr 6 2019, 6:20 PM

I made some changes until Apr 4 22:17 (the currect date of the file "wiki-osm.pl").
Can somebody tell me whether the mentioned problems occurred after my changes or already earlier?

Plenz added a comment.Apr 7 2019, 6:14 AM

OK. Now I turned (hopefully) all my last changes into comments, and my program should work as before. Can somebody restart the service?

Or better: how can I start the service, how can I check the load it is causing and how can I stop it?

Plenz added a comment.Apr 7 2019, 9:35 PM

And by the way: if this problem can not be solved in a short time, I would like my program to produce a message "sorry, temporary out of service" instead of this ugly error message.

bd808 added a comment.Apr 8 2019, 6:01 AM

Or better: how can I start the service, how can I check the load it is causing and how can I stop it?

You can do it via ssh to login.tools.wmflabs.org:

$ ssh login.tools.wmflabs.org
$ become osm4wiki
$ webservice --backend=gridengine lighttpd start

To find which host the webservice ends up running on, you can either use https://tools.wmflabs.org/sge-status/ and search the page with your browser for "osm4wiki" or you can use qstat-full from the shell at login.tools.wmflabs.org which will output something similar to this:

$ qstat-full
1501415  0.27395  lighttpd-sge-status  tools.sge-status  r  2019-04-04T15:08:36  webgrid-lighttpd@tools-sgewebgrid-lighttpd-0907.tools.eqiad.wmflabs  1

This particular output shows the tool I inspected running on the host tools-sgewebgrid-lighttpd-0907.tools.eqiad.wmflabs. You can then ssh from login.tools directly to the grid host as your tool account:

$ ssh tools-sgewebgrid-lighttpd-0907.tools.eqiad.wmflabs
Linux tools-sgewebgrid-lighttpd-0907 4.9.0-8-amd64 #1 SMP Debian 4.9.144-3 (2019-02-02) x86_64
Debian GNU/Linux 9.8 (stretch)
tools-sgewebgrid-lighttpd-0907 is a Toolforge lighttpd web exec node (role::wmcs::toolforge::grid::web::lighttpd)

     #######     This is an execution host of the Toolforge grid engine.
   ###     ###
  ##     ### ##  As a rule, and unless you are actively debugging a tool
  ##   ###   ##  that is currently running here, you probably should not
  ## ###     ##  be connecting to this server directly.
   ###     ###
     #######     Do not run your tools manually here!

The last Puppet run was at Mon Apr  8 02:41:05 UTC 2019 (24 minutes ago).
$

The command w will show you current system load and who else is logged into that server:

$ w
 03:05:56 up 21 days,  9:10,  2 users,  load average: 0.00, 0.02, 0.04
USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
root     ttyS1    -                17Mar19 21days  0.04s  0.02s -bash
tools.sg pts/0    tools-sgebastion 03:05    1.00s  0.03s  0.00s w

The command ps uwwU $USER will show you all the processes on the grid engine exec host running as your tool:

$ ps uwwU $USER
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
tools.s+  6970  0.0  0.0  61632  6568 ?        Ss   Apr04   0:12 /usr/sbin/lighttpd -f /var/run/lighttpd/sge-status -D
tools.s+  6980  0.0  0.4 525720 34152 ?        Ss   Apr04   0:00 /usr/bin/php-cgi
tools.s+  6981  0.0  0.6 560936 55848 ?        S    Apr04   0:01 /usr/bin/php-cgi
tools.s+  6982  0.0  0.6 556376 49576 ?        S    Apr04   0:01 /usr/bin/php-cgi
tools.s+  6983  0.0  0.4 525720 34112 ?        Ss   Apr04   0:00 /usr/bin/php-cgi
tools.s+  6984  0.0  0.6 560780 55788 ?        S    Apr04   2:53 /usr/bin/php-cgi
tools.s+  6985  0.0  0.7 564816 58136 ?        S    Apr04   2:55 /usr/bin/php-cgi
tools.s+ 25422  0.0  0.0 156928  6008 ?        Ss   03:05   0:00 /lib/systemd/systemd --user
tools.s+ 25423  0.0  0.0 435340  2076 ?        S    03:05   0:00 (sd-pam)
tools.s+ 25440  0.0  0.0 301104  4580 ?        R    03:05   0:00 sshd: tools.sge-status@pts/0
tools.s+ 25441  0.0  0.0 131616  5276 pts/0    Ss   03:05   0:00 -bash
tools.s+ 25941  0.0  0.0 148440  1568 pts/0    R+   03:10   0:00 ps uwwU tools.sge-status

The example above is pretty typical for a PHP webservice running on the grid. It shows a lighttpd process (the webservice itself) and 6 php-cgi processes (the fastcgi worker pool for lighttpd to hand off '*.php' requests to). For osm4wiki you should expect to see a few perl processes running instead of the php processes.

If you see a high system load and a lot of perl processes, that is the symptom we also saw before we turned the webservice off. You can turn it off yourself from the login.tools.wmflabs.org server:

$ ssh login.tools.wmflabs.org
$ become osm4wiki
$ webservice stop

And by the way: if this problem can not be solved in a short time, I would like my program to produce a message "sorry, temporary out of service" instead of this ugly error message.

To do this you will need to configure the tools webservice process to show a static page for all requests and add you message there. That should be possible by doing something like:

$ ssh login.tools.wmflabs.org
$ become osm4wiki
$ edit $HOME/.lighttpd.conf
# Contents of config file below
$ edit $HOME/public_html/offline.html
# This will be the static html page shown for all requests.
# Do whatever you would like.
$ webservice --backend=kubernetes php7.2 start
$HOME/.lighttpd.conf
# Respond to all requests with the contents of $HOME/public_html/offline.html
url.rewrite-once += ( "(.*)" => "/osm4wiki/offline.html" )
Plenz added a comment.Apr 9 2019, 8:24 PM

First a big THANKYOU for this long description!!!

I did not change $HOME/.lighttpd.conf, I simply changed my program and restarted the service, now it prints a "sorry" line and ends.

Before that, I copied the whole folder "wiki" to a new folder "work" and made some tests on given parameters.

I found a very strange behaviour:

The link https://tools.wmflabs.org/osm4wiki/cgi-bin/work/wiki-osm.pl?kml=x%26sektion%3DK returns "Parameter: x&sektion=K"
The link https://tools.wmflabs.org/osm4wiki/cgi-bin/work/wiki-osm.pl?kml=x%26section%3DK returns "Parameter: x§ion=K"

It is only "sect" which disappears and turns the & into §. All other combinations of letters stay correct.

My code is quite simple an can IMHO not be the reason:

$link = $cgi->param('kml');
print "Parameter: $link ";

Any explanation?

&sect (or better &sect;) is an HTML entity for § character. Perhaps Perl processes this automatically recursively, unless it is told not to do so?

Plenz added a comment.Apr 9 2019, 10:36 PM

This must be new. And it should not work without a semicolon. Anyway, my program worked all the years with "&section=..."

I found a very strange behaviour:
The link https://tools.wmflabs.org/osm4wiki/cgi-bin/work/wiki-osm.pl?kml=x%26sektion%3DK returns "Parameter: x&sektion=K"
The link https://tools.wmflabs.org/osm4wiki/cgi-bin/work/wiki-osm.pl?kml=x%26section%3DK returns "Parameter: x§ion=K"
It is only "sect" which disappears and turns the & into §. All other combinations of letters stay correct.
My code is quite simple an can IMHO not be the reason:

$link = $cgi->param('kml');
print "Parameter: $link ";

Any explanation?

This sounds like the application is applying HTML entity expansion to the query string rather than applying URL decoding. I'm having a bit of trouble following the query string handling logic generally. It appears to me that the script uses a mixture of calls to CGI->param() and manual processing of CGI->url(-query => 1) output to extract query params. There are not any comments in the code that help me understand why this would be desired.

The presence of %26 in the query string itself looks like problem of improperly applied URL encoding as well (& being encoded so that is treated as a literal ampersand during the parsing phase rather than as the query string pair separator token). This may obviously be by design to work around some other issues, but it is not obvious.

This must be new. And it should not work without a semicolon. Anyway, my program worked all the years with "&section=..."

The switch from the Trusty job grid to the Stretch job grid changed the Perl from 5.18.2 to 5.24.1 which does not seem like a large jump. It looks like some of the Perl modules may have change quite a bit however. See https://wikitech.wikimedia.org/wiki/News/Toolforge_Trusty_deprecation#Language_runtime_and_library_versions for more details.

Dvorapa added a comment.EditedApr 10 2019, 9:34 AM

From my experiences with kmlexport:

  • Be sure you use correct method param or multi_param for your task. There are slight differences, which are sometimes crucial
  • Try using uri_escape and uri_unescape from URI::Escape or perhaps uri_escape_utf8 from CGI, they could help
Plenz added a comment.Apr 14 2019, 3:46 PM

I can not see any mistake in my program. Anyway, if my program calls kmlexport with "section" as first parameter, the problem with &sect is avoided.

But there seems to be a UTF8 problem in kmlexport. Try this link and save the provided kml file. Quite near the end you will see
"Gro%C3%83%C2%9Fe" which should be "Große". The German letter "ß" is coded in UTF8 with U+00C3 U+009F, but kmlexport codes both letters again: %C3%83 and %C2%9F.

Can you please check this?

This comment was removed by Dvorapa.

@Plenz It should be fixed in kmlexport.

Plenz added a comment.Apr 14 2019, 9:56 PM

Thanks Dvorapa, now it looks OK.

Today I made some changes about handling of URLs and made the program run again.

I did not yet check how many processes are now generated, and there is still a minor bug regarding coordinates from linked articles. If something is still seriously wrong, please delete wiki-osm.pl and rename wiki-osm.pl.error to wiki-osm.pl to make the error message visible again.

Plenz added a comment.Apr 15 2019, 6:01 PM

Hi Dvorapa,

there is something else with kmlexport. If it is invoked with the parameter "linksfrom", the link texts have a "left arrow" on the end. Try this link and you get (see line 10):

<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://earth.google.com/kml/2.1">
<Document>
	<name><![CDATA[1572]]></name>
	<open>1</open>
	<Folder>
		<name><![CDATA[Großbritannien/Spanisches Weltreich]]></name>
		<open>1</open>
		<Placemark>
			<name><![CDATA[Karibik ←]]></name>
			<Point>
				<coordinates>-75.82,14.53,0</coordinates>
			</Point>
			<Snippet></Snippet>
[rest deleted]

Bug or feature?

Hi, it is directly in kmlexport code, so it seems it is a feature:

printplacemark((latitude => $linksfrom{$name}[0], longitude => $linksfrom{$name}[1], name => "$name ←", source => $name));
Plenz added a comment.Apr 16 2019, 8:48 PM

Funny... I see no reason for this. Anyway, my program checks whether a name ends with this arrow, and if so, it deletes the arrow.

IMHO everything is now running fine. Thank you very much for good collaboration!

Dvorapa added a subscriber: Para.Apr 16 2019, 9:17 PM

The arrow was added perhaps by @Para. Perhaps to indicate this is not the coordinate from the original article, but from the linked?

No problem, if any other issue with kmlexport occurs in osm4wiki, feel free to write me again!

The arrow looks weird also in Bing and Windy:
https://tools.wmflabs.org/bing-maps/?kml=https%3A%2F%2Ftools.wmflabs.org%2Fkmlexport%2F%3Farticle%3D1571%26project%3Dde%26linksfrom%3D1
https://tools.wmflabs.org/mapycz/?kml=https%3A%2F%2Ftools.wmflabs.org%2Fkmlexport%2F%3Farticle%3D1571%26project%3Dde%26linksfrom%3D1
What could be the reason @Para introduced this arrow into the code except it makes clear it is linked article (except the same job does also the source)?

Dvorapa added a comment.EditedJun 13 2019, 10:39 AM

Hi, what's up? The ?kml= parameter is still not working (see OSM in comparison to Bing) and the encoding still has the issues (see OSM in comparison to Bing). The page is weirdly shown in western encoding instead of UTF-8 and if I manually switch to UTF-8 in my browser, I get broken some letters anyway. Are you sure you saved the code in correct encoding?

Man77 awarded a token.Jun 18 2019, 5:25 PM
Man77 added a subscriber: Man77.