Page MenuHomePhabricator

Rise in "parent, LightProcess exiting" console spam
Closed, ResolvedPublic

Description

From IRC:

2016-01-27 19:02:40	<marxarelli>	i do see loads on "parent, LightProcess exiting" on flourine but jynus (or someone), this is a known issue, right?
2016-01-27 19:03:17	<marxarelli>	Krenair: ^ ?
2016-01-27 19:03:29	<Krenair>	is it from mw1019 marxarelli?
2016-01-27 19:03:36	<jynus>	no, it happening on mira is known
2016-01-27 19:03:43	<jynus>	the other is 19 or something else
2016-01-27 19:04:15	<marxarelli>	ah, yes. it's just 19
2016-01-27 19:04:39	<Krenair>	yes, known
2016-01-27 19:05:00	<greg-g>	Krenair: jynus: known and OK I presume? :) Also, is there a task for it?
2016-01-27 19:05:11	<marxarelli>	alright then, will promote group1 shortly
2016-01-27 19:05:55	<jynus>	Krenair, not ok, but not causing issues
2016-01-27 19:06:00	<jynus>	^greg
2016-01-27 19:06:02	<Krenair>	greg-g, I asked the same thing earlier
2016-01-27 19:06:08	<Krenair>	well
2016-01-27 19:06:09	<Krenair>	sort of
2016-01-27 19:06:13	<Krenair>	<Krenair> bd808, is it time to make a ticket?
2016-01-27 19:06:22	<jynus>	but I am talking about mira, not the other
2016-01-27 19:06:32	<Krenair>	<bd808> Krenair: J.oe said mysteriously 2 days ago that he knew what the problem was and that it was a "red herring". Something about it having not been restarted in a year. Maybe that server is depooled and just puking due to health checks?
2016-01-27 19:06:46	<jynus>	that should be checked
2016-01-27 19:06:48	<greg-g>	let's get a task so we have more than irc logs

Event Timeline

dduvall raised the priority of this task from to Medium.
dduvall updated the task description. (Show Details)
dduvall added subscribers: dduvall, Joe, jcrespo, greg.

Does not coincide with the deploy:
https://logstash.wikimedia.org/#dashboard/temp/AVKJdYMNptxhN1XaDFUZ
https://logstash.wikimedia.org/#dashboard/temp/AVKJeuE0ptxhN1XaDXH4

The rise is around Monday 11:20 UTC, and since then decreasing slowly but steadily. It consists of short spikes exactly seven minutes apart, exactly 305 errors every time; so probably some kind of job.

I also get them on mira when deploying a file with scap sync-file:

[12:27 UTC] krinkle at mira.codfw.wmnet in /srv/mediawiki-staging (master%)
$ sync-file w/static.php 
           ___ ____
         ⎛   ⎛ ,----
          \  //==--'
     _//|,.·//==--'    ____________________________
    _OO≣=-  ︶ ᴹw ⎞_§ ______  ___\ ___\ ,\__ \/ __ \
   (∞)_, )  (     |  ______/__  \/ /__ / /_/ / /_/ /
     ¨--¨|| |- (  / ______\____/ \___/ \__^_/  .__/
         ««_/  «_/ jgs/bd808                /_/

No syntax errors detected in /srv/mediawiki-staging/w/static.php
[Wed Feb 10 12:33:01 2016] [hphp] [7387:7fdd0f6fcd00:0:000001] [] Lost parent, LightProcess exiting
[Wed Feb 10 12:33:01 2016] [hphp] [7385:7fdd0f6fcd00:0:000001] [] Lost parent, LightProcess exiting
[Wed Feb 10 12:33:01 2016] [hphp] [7384:7fdd0f6fcd00:0:000001] [] Lost parent, LightProcess exiting
[Wed Feb 10 12:33:01 2016] [hphp] [7386:7fdd0f6fcd00:0:000001] [] Lost parent, LightProcess exiting
[Wed Feb 10 12:33:01 2016] [hphp] [7388:7fdd0f6fcd00:0:000001] [] Lost parent, LightProcess exiting
12:33:01 Started sync-masters
greg renamed this task from Rise in "parent, LightProcess exiting" fatals on mw1019 since 1.27.0-wmf.11 deploy to Rise in "parent, LightProcess exiting" fatals.Feb 18 2016, 12:15 AM
greg set Security to None.
greg added a project: HHVM.

Soooooo, this is everywhere all the time. I've been told repeatedly that it's "known". This is the only task in Phabricator I can find about it. Can someone in the know tell us what's going on and if/when we should expect it to stop? Thanks.

Change 271714 had a related patch set uploaded (by Ori.livneh):
Override hhvm.server.light_process_count on deployment hosts

https://gerrit.wikimedia.org/r/271714

Change 271714 merged by Ori.livneh:
Override hhvm.server.light_process_count on deployment hosts

https://gerrit.wikimedia.org/r/271714

Ori made some patches that fixed this for Roan during a SWAT, but I'm still seeing this in our beta-scap-eqiad jenkins job.

See eg: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/90857/console

Ori made some patches that fixed this for Roan during a SWAT, but I'm still seeing this in our beta-scap-eqiad jenkins job.

See eg: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/90857/console

That hiera data relies on the "role" magic in the production Puppet stack. In labs we would need to copy the same settings into a host specific hiera file (or the Hiera: namespace on wikitech)

These aren't actually fatals. When HHVM is configured to have a nonzero number of LightProcess workers (pre-forked subprocesses it creates on startup to make shelling out cheaper), each worker prints out this message when it exits, even on normal termination.

greg changed the task status from Invalid to Resolved.Mar 9 2016, 6:24 PM

These aren't actually fatals. When HHVM is configured to have a nonzero number of LightProcess workers (pre-forked subprocesses it creates on startup to make shelling out cheaper), each worker prints out this message when it exits, even on normal termination.

Right, they weren't fatals in that sense, sorry for the bad title/initial report. It was still a valid bug report that was fixed :)

Now we need to backport it to Beta Cluster so that we aren't spammed with the same errors in our logs there. That's tracked in T129385.

greg renamed this task from Rise in "parent, LightProcess exiting" fatals to Rise in "parent, LightProcess exiting" console spam.Mar 9 2016, 6:25 PM

Mentioned in SAL (#wikimedia-operations) [2017-06-09T00:01:55Z] <mutante> seeing "php: Lost parent, LightProcess exiting" in syslog on mw1275 today (T124956)

mw1275 has them all over syslog, and the light_process_count is indeed set to 5, so non-zero number as in Ori's comment T124956#2092151. Just that mw1276 (f.e.) has the exact same setting, also 5, but does not show the error.