Page MenuHomePhabricator

grid-jobs tool broken; loads forever with no actual response
Closed, ResolvedPublic

Description

The grid-jobs tool did not survive the recent rolling grid restart. I haven't dug into the problem at all yet, but it seems like something has gone wrong in the access to the OGE accounting files that it reads to gather historical job run data.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Mentioned in SAL (#wikimedia-cloud) [2017-06-22T16:01:07Z] <bd808> Disabled hourly purge cron job while debugging T168653

[2017-06-22 12:49:54,451] ERROR in app: Exception on / [GET]
Traceback (most recent call last):
  File "/data/project/grid-jobs/www/python/venv/lib/python3.4/site-packages/flas
k/app.py", line 1982, in wsgi_app
    response = self.full_dispatch_request()
  File "/data/project/grid-jobs/www/python/venv/lib/python3.4/site-packages/flask/app.py", line 1614, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/data/project/grid-jobs/www/python/venv/lib/python3.4/site-packages/flask/app.py", line 1517, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/data/project/grid-jobs/www/python/venv/lib/python3.4/site-packages/flask/_compat.py", line 33, in reraise
    raise value
  File "/data/project/grid-jobs/www/python/venv/lib/python3.4/site-packages/flask/app.py", line 1612, in full_dispatch_request
    rv = self.dispatch_request()
  File "/data/project/grid-jobs/www/python/venv/lib/python3.4/site-packages/flask/app.py", line 1598, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/data/project/grid-jobs/www/python/src/app.py", line 45, in home
    return flask.render_template('home.html', **ctx)
  File "/data/project/grid-jobs/www/python/venv/lib/python3.4/site-packages/flask/templating.py", line 134, in render_template
    context, ctx.app)
  File "/data/project/grid-jobs/www/python/venv/lib/python3.4/site-packages/flask/templating.py", line 116, in _render
    rv = template.render(context)
  File "/data/project/grid-jobs/www/python/venv/lib/python3.4/site-packages/jinja2/environment.py", line 1008, in render
    return self.environment.handle_exception(exc_info, True)
  File "/data/project/grid-jobs/www/python/venv/lib/python3.4/site-packages/jinja2/environment.py", line 780, in handle_exception
    reraise(exc_type, exc_value, tb)
  File "/data/project/grid-jobs/www/python/venv/lib/python3.4/site-packages/jinja2/_compat.py", line 37, in reraise
    raise value.with_traceback(tb)
  File "/data/project/grid-jobs/www/python/src/templates/home.html", line 1, in top-level template code
    {% extends "layout.html" %}
  File "/data/project/grid-jobs/www/python/src/templates/layout.html", line 82, in top-level template code
    {% block body %}{% endblock %}
  File "/data/project/grid-jobs/www/python/src/templates/home.html", line 17, in block "body"
    {% for name, data in tools|dictsort %}
  File "/data/project/grid-jobs/www/python/venv/lib/python3.4/site-packages/jinja2/filters.py", line 222, in do_dictsort
    return sorted(value.items(), key=sort_func)
TypeError: unorderable types: NoneType() < str()
>>> sorted(['', None])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unorderable types: NoneType() < str()

One of the dict keys must be None for some reason.

[[https://phabricator.wikimedia.org/source/tool-grid-jobs/browse/master/grid_jobs/__init__.py;15634c3b2d5dcc20650c9cfd23bcec6da85d163a$109|# else None -- we ignore non-tool accounts like 'root']]

None is checked if the data is from the accounting logs, but not if it is from gridengine-status (based on the assumption that root won't start run-forever jobs on grid & gridengine-status shows the raw account name with tools.)

Test running the same code shows two instances of None:

  • (None, 'backlogupdater', 'tools-exec-1417')
  • (None, 'afreporter', 'tools-exec-1414')

Both jobs belong to @DatGuy

(I didn't know non-tool accounts are actually able to start jobs on grid)

Mentioned in SAL (#wikimedia-cloud) [2017-06-23T04:10:45Z] <bd808> Update to 8134b16 to fix None crash in T168653

Should I restart them, or will None be automatically changed to the toolname?

Should I restart them, or will None be automatically changed to the toolname?

@DatGuy I think your jobs are probably fine. You just helped us find a bug. :)

[17:22]  <madhuvishy>	bd808: i was watching shinken alerts for puppet errors over the weekend, and tools-worker-1007 was complaining non stop - i found that the load there was super high, and restarted this tool - grid-jobs which seemed to be doing a ton of IO. It then landed in 1017 and the load on 1007 dropped, and 1017 seemed okay ish, but I'm seeing puppet error emails for 1017 now. That tool may need some poking

I've shut the tool down for now.

bd808 gave me a log entry GET /grid-jobs/?purge => generated 83797 bytes in 49443413 msecs, which is approximately 13.73 hours. This script is supposed to run at most a few minutes.

I looked at the relevant code and don't see any potentially bad parts that broke the IO, so did a strace test on the last 1048576 * 16 = 16777216 bytes (normally the tool works on 400 * 45000 * 7 = 126000000 bytes), having a time.sleep(0.001) on each yield:

import os
import time

def tail_lines(filename, nbytes):
    """Get lines from last n bytes from the filename as an iterator."""
    with open(filename, 'r') as f:
        try:
            f.seek(-nbytes, os.SEEK_END)
        except IOError:
            # File got truncated? Start at the front
            f.seek(0, os.SEEK_SET)

        # Ignore first line as it may be only part of a line
        f.readline()

        # We can't simply `return f` as the returned f will be closed
        # Do all the IO within this function
        for line in f:
            yield line


for line in tail_lines('/data/project/.system/accounting', 16777216):
    time.sleep(0.001)
tools.zhuyifei1999-test@tools-bastion-02:~$ strace python2 gridaccountingtest_a.py 
execve("/usr/bin/python2", ["python2", "gridaccountingtest_a.py"], [/* 31 vars */]) = 0
[...]
open("/data/project/.system/accounting", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=1799231451, ...}) = 0
fstat(3, {st_mode=S_IFREG|0644, st_size=1799231451, ...}) = 0
mmap(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0219ff3000
fstat(3, {st_mode=S_IFREG|0644, st_size=1799231451, ...}) = 0
lseek(3, 1781530624, SEEK_SET)          = 1781530624
read(3, "54:22496.000000:0:0:0:0:15057:0:"..., 923611) = 923611
read(3, ":0:0.000000:8:0:0:0:23:4:NONE:de"..., 1048576) = 1048576
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
[... countless]
^C
tools.zhuyifei1999-test@tools-bastion-02:~$ strace python3 gridaccountingtest_a.py 
execve("/usr/bin/python3", ["python3", "gridaccountingtest_a.py"], [/* 31 vars */]) = 0
[...]
open("/data/project/.system/accounting", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=1799258096, ...}) = 0
ioctl(3, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, 0x7ffda8de40d0) = -1 ENOTTY (Inappropriate ioctl for device)
fstat(3, {st_mode=S_IFREG|0644, st_size=1799258096, ...}) = 0
mmap(NULL, 1052672, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f554c743000
lseek(3, 0, SEEK_CUR)                   = 0
ioctl(3, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, 0x7ffda8de3ff0) = -1 ENOTTY (Inappropriate ioctl for device)
lseek(3, 0, SEEK_CUR)                   = 0
lseek(3, 0, SEEK_SET)                   = 0
read(3, "task:tools-exec-1402.eqiad.wmfla"..., 8192) = 8192
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
[... 20 more]
read(3, "E:defaultdepartment:NONE:1:0:0.3"..., 8192) = 8192
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
[... 22 more]
[...]
^C

There is a clear difference between the buffer sizes of python 2 and 3.

Then I replace time.sleep(0.001) with pass, divide 16777216 by 4 = 4194304 (because of the excess time usage in python3), and started a timing test:

tools.zhuyifei1999-test@tools-bastion-02:~$ time python2 gridaccountingtest_a.py

real    0m4.515s
user    0m0.019s
sys     0m0.022s
tools.zhuyifei1999-test@tools-bastion-02:~$ time python2 gridaccountingtest_a.py

real    0m4.466s
user    0m0.017s
sys     0m0.016s
tools.zhuyifei1999-test@tools-bastion-02:~$ time python3 gridaccountingtest_a.py

(takes forever....)
Edit: SIGINT'ed, ate IO for 16m2.934s and did not complete.

We aren't explicitly opening in binary mode, but https://docs.python.org/3/library/functions.html#open says:

Binary files are buffered in fixed-size chunks; the size of the buffer is chosen using a heuristic trying to determine the underlying device’s “block size” and falling back on io.DEFAULT_BUFFER_SIZE. On many systems, the buffer will typically be 4096 or 8192 bytes long.

I wonder if we can make the test program faster by passing a reasonably large read buffer size like 1048576 to the open() call?

Testing setting buffering for both python2 and 3 and stracing it:

buffering = 2097152, python2:

lseek(3, 1795162112, SEEK_SET)          = 1795162112
read(3, "c-1402.eqiad.wmflabs:tools.incol"..., 561056) = 561056
read(3, ":0:25616096.000000:25870688:0:0:"..., 2097152) = 2097152
read(3, "00:152:0:0:0:1528:265:NONE:defau"..., 2097152) = 2097152
read(3, "", 2097152)                    = 0
read(3, "", 2097152)                    = 0
close(3)                                = 0

buffering=2097152, python3: (did seemingly nothing...)

lseek(3, 0, SEEK_SET)                   = 0
read(3, "task:tools-exec-1402.eqiad.wmfla"..., 8192) = 8192
read(3, "E:defaultdepartment:NONE:1:0:0.3"..., 8192) = 8192
read(3, ".127961:12196.000000:0:0:0:0:481"..., 8192) = 8192

buffering=65536, python2:

lseek(3, 1795686400, SEEK_SET)          = 1795686400
read(3, "00:856:0:0:0:131:1322:NONE:defau"..., 63104) = 63104
read(3, "00000:NONE:122773504.000000:0:0\n"..., 65536) = 65536
read(3, "02:0.000695:-u tools.pb -q task "..., 65536) = 65536
read(3, "524288k,release=trusty:0.000000:"..., 65536) = 65536
read(3, "NONE:1:0:0.257305:0.001735:0.000"..., 65536) = 65536

buffering=65536, python3: (still nothing)

lseek(3, 0, SEEK_SET)                   = 0
read(3, "task:tools-exec-1402.eqiad.wmfla"..., 8192) = 8192
read(3, "E:defaultdepartment:NONE:1:0:0.3"..., 8192) = 8192
read(3, ".127961:12196.000000:0:0:0:0:481"..., 8192) = 8192
read(3, "4:0.116091:0.003420:-u tools.che"..., 8192) = 8192

buffering=128, python2: (bizarre)

lseek(3, 1795754624, SEEK_SET)          = 1795754624
read(3, ".eqiad.wmflabs:tools.sbot:tools."..., 52) = 52
read(3, "7119810:sge:0:1499760016:1499760"..., 128) = 128
read(3, "NE:defaultdepartment:NONE:1:0:0."..., 128) = 128
read(3, "965440.000000:0:0\ntask:tools-exe"..., 128) = 128
read(3, "99760014:1499760023:0:1:9:0.8877"..., 8064) = 8064
read(3, "424418:0.263634:18004.000000:0:0"..., 128) = 128
read(3, "070:-u tools.liangent-py -q task"..., 10112) = 10112
read(3, "ags:nlHashtagUpdate:7119853:sge:"..., 128) = 128
read(3, "00:24:0:0:0:739:50:NONE:defaultd"..., 10112) = 10112
read(3, "ols.avicbot:tools.avicbot:clean:"..., 128) = 128

buffering=128, python3: (nothing. why am I repeating myself...)

lseek(3, 0, SEEK_SET)                   = 0
read(3, "task:tools-exec-1402.eqiad.wmfla"..., 8192) = 8192
read(3, "E:defaultdepartment:NONE:1:0:0.3"..., 8192) = 8192
read(3, ".127961:12196.000000:0:0:0:0:481"..., 8192) = 8192
read(3, "4:0.116091:0.003420:-u tools.che"..., 8192) = 8192

Another during this analysis, I notice this lseek behavior in python3 doesn't seem right at all: lseek(3, 0, SEEK_SET) is seeking to the start of the file. Where and why is python doing this?

Ah, all problems solved if I open the file in binary...

If the mode is set to 'rb' and keeping the buffering -1:

python2:

open("/data/project/.system/accounting", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=1800039812, ...}) = 0
fstat(3, {st_mode=S_IFREG|0644, st_size=1800039812, ...}) = 0
mmap(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fccccc84000
fstat(3, {st_mode=S_IFREG|0644, st_size=1800039812, ...}) = 0
lseek(3, 1795162112, SEEK_SET)          = 1795162112
read(3, "c-1402.eqiad.wmflabs:tools.incol"..., 683396) = 683396
read(3, "ache:7120001:sge:10:1499760482:1"..., 1048576) = 1048576
read(3, "ls.cydebot-3:7122927:sge:0:14997"..., 1048576) = 1048576
read(3, "2705:0.277779:35176.000000:0:0:0"..., 1048576) = 1048576
read(3, "6:0:0:0.000000:24:0:0:0:688:106:"..., 1048576) = 1048576
read(3, "task:tools-exec-1415.tools.eqiad"..., 1048576) = 2231
read(3, "", 1048576)                    = 0
read(3, "", 1048576)                    = 0
close(3)                                = 0

python3:

open("/data/project/.system/accounting", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=1800054535, ...}) = 0
ioctl(3, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, 0x7ffdddd74740) = -1 ENOTTY (Inappropriate ioctl for device)
fstat(3, {st_mode=S_IFREG|0644, st_size=1800054535, ...}) = 0
mmap(NULL, 1052672, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f95052f1000
lseek(3, 0, SEEK_CUR)                   = 0
lseek(3, 0, SEEK_CUR)                   = 0
lseek(3, -4194304, SEEK_END)            = 1795860231
read(3, ":24.000000:64:0:0:0:6250:233:NON"..., 1048576) = 1048576
read(3, "9767250:1499767251:1499767320:0:"..., 1048576) = 1048576
read(3, "=524288k,release=trusty:0.000000"..., 1048576) = 1048576
read(3, "\ntask:tools-exec-1427.tools.eqia"..., 1048576) = 1048576
read(3, "", 1048576)                    = 0
close(3)                                = 0

Mentioned in SAL (#wikimedia-cloud) [2017-07-27T23:17:22Z] <bd808> Re-enabled hourly cron after patching for T168653 with 3f1ea69

bd808 reassigned this task from bd808 to zhuyifei1999.

I pushed the patch, but @zhuyifei1999 did all the hard work. :)