Page MenuHomePhabricator

Job runner not running in deployment-prep
Closed, ResolvedPublic

Description

krenair@deployment-jobrunner01:~$ mwscript showJobs.php --wiki=commonswiki
302
krenair@deployment-jobrunner01:~$ service jobrunner status
jobrunner stop/waiting
krenair@deployment-jobrunner01:~$ service jobrunner start
start: Rejected send message, 1 matched rules; type="method_call", sender=":1.7" (uid=2170 pid=6528 comm="start jobrunner ") interface="com.ubuntu.Upstart0_6.Job" member="Start" error name="(unset)" requested_reply="0" destination="com.ubuntu.Upstart" (uid=0 pid=1 comm="/sbin/init")

Looking at the timing in this graph I'm guessing this could potentially be caused by https://gerrit.wikimedia.org/r/185939/

Event Timeline

Krenair raised the priority of this task from to High.
Krenair updated the task description. (Show Details)
Krenair subscribed.
Krenair set Security to None.
Krenair added a subscriber: yuvipanda.
krenair@deployment-jobrunner01:~$ sudo service jobrunner start
jobrunner start/running, process 549
krenair@deployment-jobrunner01:~$ service jobrunner status
jobrunner stop/waiting
hashar subscribed.

/var/log/upstart/jobrunner.log offers no help. Found out in the upstart file ( /etc/init/jobrunner.conf ) that the output is sent to /var/log/mediawiki/jobrunner.log which has:

Fatal error: Uncaught exception 'Exception' with message 'Invalid profiler address 'labmon1001.eqiad.wmnet'.' in /srv/deployment/jobrunner/jobrunner/redisJobRunnerService:206
Stack trace:
#0 /srv/deployment/jobrunner/jobrunner/redisJobRunnerService(119): RedisJobRunnerService->__construct(Array)
#1 /srv/deployment/jobrunner/jobrunner/redisJobRunnerService(22): RedisJobRunnerService::init(Array)
#2 {main}
  thrown in /srv/deployment/jobrunner/jobrunner/redisJobRunnerService on line 206

I have no idea how it works, seems the the profiler address has been changed recently. Maybe from an IP address to the DNS name which would have broke it. We restarted the instance two or three days ago, so maybe the issue has been around for quite a while.

Also, I am not sure how the jobrunner code is being updated on the job runners instances. Hopefully it is using puppet on beta (via git::clone and notify service to restart the service).

gerritbot subscribed.

Change 187875 had a related patch set uploaded (by Yuvipanda):
Make setting port number in statsd config optional

https://gerrit.wikimedia.org/r/187875

Patch-For-Review

The jobrunner code was being fussy about always requiring a port number for the statsd host, which I think isn't too reasonable. Fixed it in ^ patch.

Change 187875 merged by jenkins-bot:
Make setting port number in statsd config optional

https://gerrit.wikimedia.org/r/187875

dan-nl claimed this task.
dan-nl added a subscriber: Legoktm.

thanks for taking care of this @yuvipanda!

the GWToolset logs show that several jobs started to run after the patch got @Legoktm’s +2 and merged.