Tool Labs jobs locking up
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	MusikAnimal
	Aug 19 2016, 1:09 AM

Description

Since around the beginning of August 2016, I've found that some of my bot's jobs are locking up. That is, the job is submitted and I can see it with qstat, but it never actually starts, and when the cron tries to submit a new job it errors with there is a job named 'my job' already active.

I'll use my "perm_clerk" job as an example. Here some notes:

Cron looks like */10 * * * * jsub -l release=trusty -mem 350m -once ~/perm_clerk.sh >/dev/null 2>&1 (my other jobs look similar)
perm_clerk.sh just exports Ruby stuff to PATH and runs the Ruby file
The Ruby script outputs stuff to another log as soon as it gets ran. I don't see this output when the job locks up
When in a locked state, qstat looks normal: 9897506 0.30204 perm_clerk tools.musikb r 08/17/2016 17:30:14 task@tools-exec-1407.eqiad.wmf 1
I have to manually qdel the job for it get back in working order
Sometimes the job will lock up on the next run. E.g. the issue is consistent but not predictable
My stale_drafts job locked up but it only runs once a week, so I don't think this happens because a job is submitted when the old one hasn't finished

Surely others are experiencing this?

Related Objects

Mentioned In: rOPUPb7f884073b1e: labs: Set timeout for ldap3 using scripts

Event Timeline

MusikAnimal created this task.Aug 19 2016, 1:09 AM

Restricted Application added a project: Cloud-Services. · View Herald TranscriptAug 19 2016, 1:09 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

tom29739 subscribed.Aug 19 2016, 1:50 AM

MusikAnimal updated the task description. (Show Details)Aug 19 2016, 5:53 AM

Change 305616 had a related patch set uploaded (by Yuvipanda):
labs: Set timeout for ldap3 using scripts

https://gerrit.wikimedia.org/r/305616

gerritbot added a project: Patch-For-Review.Aug 19 2016, 6:10 AM

yuvipanda mentioned this in rOPUPb7f884073b1e: labs: Set timeout for ldap3 using scripts.Aug 19 2016, 6:13 AM

Does this consistently occur on the same hosts? A list of tool names/job ids/hosts/times would help a lot. I'll try to work from 9897506 to see if there's anything obviously wrong on tools-exec-1407.

This one is currently stuck:
9948315 0.30056 copypatrol tools.musikb r 08/19/2016 06:45:10 task@tools-exec-1401.eqiad.wmf 1

It locked up around 6:50, August 19 UTC

strace -f -p shows [pid 15733] restart_syscall(<... resuming interrupted call ... and the job seems to have been run correctly and the VM it is on is running other adhoc jobs alright

Using some horrible gdb magic:

First, fix up stdout and stderr to a different file (via https://gist.github.com/zaius/782263 )

valhallasw@tools-exec-1401:~$ sudo su tools.musikbot
tools.musikbot@tools-exec-1401:/home/valhallasw$ touch /tmp/stdout
tools.musikbot@tools-exec-1401:/home/valhallasw$ touch /tmp/stderr
tools.musikbot@tools-exec-1401:/home/valhallasw$ exit
valhallasw@tools-exec-1401:~$ sudo gdb /data/project/musikbot/.rbenv/versions/2.2.1/bin/ruby 15733
[...]
(gdb) p dup2(open("/tmp/stdout", 1), 1)
$1 = 1
(gdb) p dup2(open("/tmp/stderr", 1), 2)
$2 = 2

Then, print out a stack trace (via https://isotope11.com/blog/getting-a-ruby-backtrace-from-gnu-debugger)

(gdb) call rb_backtrace()
(gdb) quit

And print it:

valhallasw@tools-exec-1401:~$ cat /tmp/stderr
        from copypatrol_wikiprojects.rb:89:in `<main>'
        from copypatrol_wikiprojects.rb:8:in `run'
        from copypatrol_wikiprojects.rb:8:in `new'
        from /data/project/musikbot/MusikBot/musikbot.rb:88:in `initialize'
        from /data/project/musikbot/MusikBot/auth.rb:8:in `login'
        from /data/project/musikbot/.gem/ruby/2.2.0/gems/mediawiki-gateway-1.0.7/lib/media_wiki/gateway/users.rb:16:in `login'
        from /data/project/musikbot/.gem/ruby/2.2.0/gems/mediawiki-gateway-1.0.7/lib/media_wiki/gateway.rb:65:in `send_request'
        from /data/project/musikbot/.gem/ruby/2.2.0/gems/mediawiki-gateway-1.0.7/lib/media_wiki/gateway.rb:135:in `make_api_request'
        from /data/project/musikbot/.gem/ruby/2.2.0/gems/mediawiki-gateway-1.0.7/lib/media_wiki/gateway.rb:239:in `http_send'
        from /data/project/musikbot/.gem/ruby/2.2.0/gems/rest-client-1.8.0/lib/restclient/request.rb:41:in `execute'
        from /data/project/musikbot/.gem/ruby/2.2.0/gems/rest-client-1.8.0/lib/restclient/request.rb:176:in `execute'
        from /data/project/musikbot/.gem/ruby/2.2.0/gems/rest-client-1.8.0/lib/restclient/request.rb:413:in `transmit'
        from /data/project/musikbot/.rbenv/versions/2.2.1/lib/ruby/2.2.0/net/http.rb:853:in `start'
        from /data/project/musikbot/.gem/ruby/2.2.0/gems/rest-client-1.8.0/lib/restclient/request.rb:421:in `block in transmit'
        from /data/project/musikbot/.gem/ruby/2.2.0/gems/rest-client-1.8.0/lib/restclient/request.rb:493:in `process_result'
        from /data/project/musikbot/.gem/ruby/2.2.0/gems/rest-client-1.8.0/lib/restclient/request.rb:493:in `call'
        from /data/project/musikbot/.gem/ruby/2.2.0/gems/mediawiki-gateway-1.0.7/lib/media_wiki/gateway.rb:250:in `block in http_send'
        from /data/project/musikbot/.gem/ruby/2.2.0/gems/mediawiki-gateway-1.0.7/lib/media_wiki/gateway.rb:250:in `call'
        from /data/project/musikbot/.gem/ruby/2.2.0/gems/mediawiki-gateway-1.0.7/lib/media_wiki/gateway.rb:190:in `block in make_api_request'
        from /data/project/musikbot/.gem/ruby/2.2.0/gems/mediawiki-gateway-1.0.7/lib/media_wiki/gateway.rb:190:in `sleep'

There's a single connection open:

valhallasw@tools-exec-1401:~$ sudo netstat -np | grep 15733
tcp       32      0 10.68.17.202:39575      208.80.154.224:443      CLOSE_WAIT  15733/ruby

or rather, in CLOSE_WAIT state, which means that the connection has been closed by the remote end, but the local process has not closed the connection locally yet. That could point to a bug in some ruby library, where it maybe tries to re-use a connection that doesn't exist anymore?

This is still happening regularly with my three jobs that run with the highest frequency: copypatrol, fix_pp and perm_clerk. I think I have confirmed it is continually getting stuck in the sleep in MediaWiki::Gateway, as recently I reverted my changes to the logging and I now see the same backtrace in my .err files. I have just now changed the config to output all warnings, so hopefully I can find out if this is specific to logging in. The perm_clerk task has verbose logging and nothing gets outputted when that job gets "stuck", where if it were failing with any ole API error (maxlag, etc) I would expect to occasionally see some other output. I should mention I'm still using action=login to authenticate, should there be an intermittent issue going on there?

One easy thing to check is whether the job can still write to the output file; to do so, ssh to the exec host, and ls -l /proc/<pid>/fd, where <pid> is the process id (which you can get from ps ux). If that shows '(deleted file)', something went wrong with that output.

With ptrace and pdb I can only do so much -- it can tell you where the code is, but not why it's there. Adding more logging to the script is likely to be a quicker way of figuring out what happens. Another option might be to repeatedly dump stack traces (e.g. once per second) from a second thread, but I'm not sure how easy that is for ruby.

I think I've figured it out. The maxlag API response now includes fractional seconds (up to 12 decimal places!), so the the regex that parses with (\d+) seconds lagged grabbed the fractional bit, making the bot sleep for BILLIONS of seconds :) Fixed with 8f4a47d but I haven't deployed it yet. The fact that bot kept hitting the maxlag error only when logging in is purely coincidence.

Note this hack was put in place because enwiki doesn't include the maxlag in the header, only in the response body. Any idea why this is?

I think that was it. Thank you very much for the help everyone (esp. @valhallasw)!! This one was not easy to figure out :)

Do we know of any other API frameworks that parse the maxlag response? I suspect they might need an update as well. My question on why enwiki doesn't include it in the headers is still open, if anyone has any ideas :) Thanks again

Tool Labs jobs locking upClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Tool Labs jobs locking up
Closed, ResolvedPublic
Actions