Page MenuHomePhabricator

Video transcode job runner on beta cluster runs 5 jobs even though configured for 2
Closed, ResolvedPublic

Description

While fixing up the video scalers in beta cluster for the current job runner infrastructure, @bd808 configured it to run 2 job runners for transcodes but we found it was running up to 5 jobs... This is ok on production boxes but overloads the relatively small VM that's assigned to video transcoding in beta cluster.

We couldn't quite figure out what was going on Friday, need to look at it in more detail.

Event Timeline

brion raised the priority of this task from to Needs Triage.
brion updated the task description. (Show Details)
brion added subscribers: brion, bd808.
thcipriani triaged this task as Medium priority.Sep 14 2015, 7:29 PM
thcipriani moved this task from To Triage to Backlog on the Beta-Cluster-Infrastructure board.

Seems the new instance is now deployment-tmh01.deployment-prep.eqiad.wmflabs

/etc/jobrunner/jobrunner.conf has:

"groups":
       "transcode": {
            "runners": 2,   #  FIXED ???????????
            "include": [
                "webVideoTranscode"
            ]
        },

So seems it is fixed now and this task is a leftover of the sprint.

So I'm still seeing this sort of behavior, including in MediaWiki-Vagrant which now uses a local-web-server-based job runner infrastructure more similar to the production one.

In vagrant, even though only *one* job runner is configured, I see 2, 3, 4 or more ffmpeg or ffmpeg2theora processes running.

Nothing seems to actually break, it's just running more processes than I expect, which required me to up the memory limit on my vagrant VM (long-running transcodes of high-resolution videos can eat a lot of memory, so four of them don't fit reliably in 1536mb).

I haven't debugged it in detail, but I wonder if what's happening is that the job runner shell sees no activity on the web request for a long time and times out and moves on to another job, but the actual running hhvm process doesn't close out and it runs the ffmpeg/ffmpeg2theora to completion, saves its data, and exits cleanly.

Left out a sentence in a cut-n-paste accident! -- I don't see all 3-4 extra processes start at once; rather they seem to accumulate over time. This fits my timeout theory, in that an extra new job only seems to spawn while another, longer-running job is already going.

@brion maybe there are some useful details in the jobrunner stdout/stderr log at /var/log/mediawiki/jobrunner.log ?

At least on beta it definitely has two runners:

2016-06-02T06:30:57+0000: Starting job spawner loop(s)...
2016-06-02T06:30:57+0000: Initialized loop 0 with 2 runner(s).

Ages ago I have made the MediaWiki job runner to log a debug message when a job is starting ( a652d999fbb78aac9a01226b77f55cbea3f12cbe ), that comes in addition of the info message when a job is completed. Thus for each job you get something like:

DEBUG refreshLinks STARTING
INFO refreshLinks t=300 good

In MediaWiki Vagrant, if you enable the debug log for runJobs (maybe it is) you should be able to confirm the timeout hypothesis. You would see STARTING lines not match by good/bad ones. That debug log has been added exactly for that kind of analysis, though it is very low level.

@hashar I get fun stuff like:

2016-06-01T06:41:32+0000: Starting job spawner loop(s)...
2016-06-01T06:41:32+0000: Initialized loop 0 with 1 runner(s).
Caught signal (15)
2016-06-01T20:53:42+0000: Starting job spawner loop(s)...
2016-06-01T20:53:42+0000: Initialized loop 0 with 1 runner(s).
2016-06-01T20:59:59+0000: Runner loop 0 process in slot 0 gave status '0':
curl -XPOST -s -a 'http://127.0.0.1/w/RunJobs.php?wiki=wiki&type=webVideoTranscode&maxtime=60&maxmem=300M'
        <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>503 Service Unavailable</title>
</head><body>
<h1>Service Unavailable</h1>
<p>The server is temporarily unable to service your
request due to maintenance downtime or capacity
problems. Please try again later.</p>
</body></html>

2016-06-01T21:01:59+0000: Runner loop 0 process in slot 0 gave status '0':
curl -XPOST -s -a 'http://127.0.0.1/w/RunJobs.php?wiki=wiki&type=webVideoTranscode&maxtime=60&maxmem=300M'
        <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>503 Service Unavailable</title>
</head><body>
<h1>Service Unavailable</h1>
<p>The server is temporarily unable to service your
request due to maintenance downtime or capacity
problems. Please try again later.</p>
</body></html>

2016-06-01T21:03:59+0000: Runner loop 0 process in slot 0 gave status '0':
curl -XPOST -s -a 'http://127.0.0.1/w/RunJobs.php?wiki=wiki&type=webVideoTranscode&maxtime=60&maxmem=300M'
        <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>503 Service Unavailable</title>
</head><body>
<h1>Service Unavailable</h1>
<p>The server is temporarily unable to service your

and so on

I don't see any "DEBUG (anything) STARTING" etc, how would I enable that log?

In prod that is with $wmgMonologChannels['runJobs'] => 'debug'.

The jobrunner service hits Mediawiki RPC entry point /w/RunJobs.php and if that yields a 503 there must be some backend exception / fatal error straight in MediaWiki. In MediaWiki Vagrant I am not sure the traces ends up though :(

brion claimed this task.

I'm pretty sure we resolved this on the ops/hhvm end. Closing out.