The grid-jobs tool did not survive the recent rolling grid restart. I haven't dug into the problem at all yet, but it seems like something has gone wrong in the access to the OGE accounting files that it reads to gather historical job run data.
Description
Related Objects
Event Timeline
Mentioned in SAL (#wikimedia-cloud) [2017-06-22T16:01:07Z] <bd808> Disabled hourly purge cron job while debugging T168653
[2017-06-22 12:49:54,451] ERROR in app: Exception on / [GET] Traceback (most recent call last): File "/data/project/grid-jobs/www/python/venv/lib/python3.4/site-packages/flas k/app.py", line 1982, in wsgi_app response = self.full_dispatch_request() File "/data/project/grid-jobs/www/python/venv/lib/python3.4/site-packages/flask/app.py", line 1614, in full_dispatch_request rv = self.handle_user_exception(e) File "/data/project/grid-jobs/www/python/venv/lib/python3.4/site-packages/flask/app.py", line 1517, in handle_user_exception reraise(exc_type, exc_value, tb) File "/data/project/grid-jobs/www/python/venv/lib/python3.4/site-packages/flask/_compat.py", line 33, in reraise raise value File "/data/project/grid-jobs/www/python/venv/lib/python3.4/site-packages/flask/app.py", line 1612, in full_dispatch_request rv = self.dispatch_request() File "/data/project/grid-jobs/www/python/venv/lib/python3.4/site-packages/flask/app.py", line 1598, in dispatch_request return self.view_functions[rule.endpoint](**req.view_args) File "/data/project/grid-jobs/www/python/src/app.py", line 45, in home return flask.render_template('home.html', **ctx) File "/data/project/grid-jobs/www/python/venv/lib/python3.4/site-packages/flask/templating.py", line 134, in render_template context, ctx.app) File "/data/project/grid-jobs/www/python/venv/lib/python3.4/site-packages/flask/templating.py", line 116, in _render rv = template.render(context) File "/data/project/grid-jobs/www/python/venv/lib/python3.4/site-packages/jinja2/environment.py", line 1008, in render return self.environment.handle_exception(exc_info, True) File "/data/project/grid-jobs/www/python/venv/lib/python3.4/site-packages/jinja2/environment.py", line 780, in handle_exception reraise(exc_type, exc_value, tb) File "/data/project/grid-jobs/www/python/venv/lib/python3.4/site-packages/jinja2/_compat.py", line 37, in reraise raise value.with_traceback(tb) File "/data/project/grid-jobs/www/python/src/templates/home.html", line 1, in top-level template code {% extends "layout.html" %} File "/data/project/grid-jobs/www/python/src/templates/layout.html", line 82, in top-level template code {% block body %}{% endblock %} File "/data/project/grid-jobs/www/python/src/templates/home.html", line 17, in block "body" {% for name, data in tools|dictsort %} File "/data/project/grid-jobs/www/python/venv/lib/python3.4/site-packages/jinja2/filters.py", line 222, in do_dictsort return sorted(value.items(), key=sort_func) TypeError: unorderable types: NoneType() < str()
>>> sorted(['', None]) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: unorderable types: NoneType() < str()
One of the dict keys must be None for some reason.
[[https://phabricator.wikimedia.org/source/tool-grid-jobs/browse/master/grid_jobs/__init__.py;15634c3b2d5dcc20650c9cfd23bcec6da85d163a$109|# else None -- we ignore non-tool accounts like 'root']]
None is checked if the data is from the accounting logs, but not if it is from gridengine-status (based on the assumption that root won't start run-forever jobs on grid & gridengine-status shows the raw account name with tools.)
Test running the same code shows two instances of None:
- (None, 'backlogupdater', 'tools-exec-1417')
- (None, 'afreporter', 'tools-exec-1414')
Both jobs belong to @DatGuy
(I didn't know non-tool accounts are actually able to start jobs on grid)
Mentioned in SAL (#wikimedia-cloud) [2017-06-23T04:10:45Z] <bd808> Update to 8134b16 to fix None crash in T168653
[17:22] <madhuvishy> bd808: i was watching shinken alerts for puppet errors over the weekend, and tools-worker-1007 was complaining non stop - i found that the load there was super high, and restarted this tool - grid-jobs which seemed to be doing a ton of IO. It then landed in 1017 and the load on 1007 dropped, and 1017 seemed okay ish, but I'm seeing puppet error emails for 1017 now. That tool may need some poking
I've shut the tool down for now.
bd808 gave me a log entry GET /grid-jobs/?purge => generated 83797 bytes in 49443413 msecs, which is approximately 13.73 hours. This script is supposed to run at most a few minutes.
I looked at the relevant code and don't see any potentially bad parts that broke the IO, so did a strace test on the last 1048576 * 16 = 16777216 bytes (normally the tool works on 400 * 45000 * 7 = 126000000 bytes), having a time.sleep(0.001) on each yield:
import os import time def tail_lines(filename, nbytes): """Get lines from last n bytes from the filename as an iterator.""" with open(filename, 'r') as f: try: f.seek(-nbytes, os.SEEK_END) except IOError: # File got truncated? Start at the front f.seek(0, os.SEEK_SET) # Ignore first line as it may be only part of a line f.readline() # We can't simply `return f` as the returned f will be closed # Do all the IO within this function for line in f: yield line for line in tail_lines('/data/project/.system/accounting', 16777216): time.sleep(0.001)
tools.zhuyifei1999-test@tools-bastion-02:~$ strace python2 gridaccountingtest_a.py execve("/usr/bin/python2", ["python2", "gridaccountingtest_a.py"], [/* 31 vars */]) = 0 [...] open("/data/project/.system/accounting", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=1799231451, ...}) = 0 fstat(3, {st_mode=S_IFREG|0644, st_size=1799231451, ...}) = 0 mmap(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0219ff3000 fstat(3, {st_mode=S_IFREG|0644, st_size=1799231451, ...}) = 0 lseek(3, 1781530624, SEEK_SET) = 1781530624 read(3, "54:22496.000000:0:0:0:0:15057:0:"..., 923611) = 923611 read(3, ":0:0.000000:8:0:0:0:23:4:NONE:de"..., 1048576) = 1048576 select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) [... countless] ^C tools.zhuyifei1999-test@tools-bastion-02:~$ strace python3 gridaccountingtest_a.py execve("/usr/bin/python3", ["python3", "gridaccountingtest_a.py"], [/* 31 vars */]) = 0 [...] open("/data/project/.system/accounting", O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=1799258096, ...}) = 0 ioctl(3, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, 0x7ffda8de40d0) = -1 ENOTTY (Inappropriate ioctl for device) fstat(3, {st_mode=S_IFREG|0644, st_size=1799258096, ...}) = 0 mmap(NULL, 1052672, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f554c743000 lseek(3, 0, SEEK_CUR) = 0 ioctl(3, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, 0x7ffda8de3ff0) = -1 ENOTTY (Inappropriate ioctl for device) lseek(3, 0, SEEK_CUR) = 0 lseek(3, 0, SEEK_SET) = 0 read(3, "task:tools-exec-1402.eqiad.wmfla"..., 8192) = 8192 select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) [... 20 more] read(3, "E:defaultdepartment:NONE:1:0:0.3"..., 8192) = 8192 select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) [... 22 more] [...] ^C
There is a clear difference between the buffer sizes of python 2 and 3.
Then I replace time.sleep(0.001) with pass, divide 16777216 by 4 = 4194304 (because of the excess time usage in python3), and started a timing test:
tools.zhuyifei1999-test@tools-bastion-02:~$ time python2 gridaccountingtest_a.py real 0m4.515s user 0m0.019s sys 0m0.022s tools.zhuyifei1999-test@tools-bastion-02:~$ time python2 gridaccountingtest_a.py real 0m4.466s user 0m0.017s sys 0m0.016s tools.zhuyifei1999-test@tools-bastion-02:~$ time python3 gridaccountingtest_a.py
(takes forever....)
Edit: SIGINT'ed, ate IO for 16m2.934s and did not complete.
We aren't explicitly opening in binary mode, but https://docs.python.org/3/library/functions.html#open says:
Binary files are buffered in fixed-size chunks; the size of the buffer is chosen using a heuristic trying to determine the underlying device’s “block size” and falling back on io.DEFAULT_BUFFER_SIZE. On many systems, the buffer will typically be 4096 or 8192 bytes long.
I wonder if we can make the test program faster by passing a reasonably large read buffer size like 1048576 to the open() call?
Testing setting buffering for both python2 and 3 and stracing it:
buffering = 2097152, python2:
lseek(3, 1795162112, SEEK_SET) = 1795162112 read(3, "c-1402.eqiad.wmflabs:tools.incol"..., 561056) = 561056 read(3, ":0:25616096.000000:25870688:0:0:"..., 2097152) = 2097152 read(3, "00:152:0:0:0:1528:265:NONE:defau"..., 2097152) = 2097152 read(3, "", 2097152) = 0 read(3, "", 2097152) = 0 close(3) = 0
buffering=2097152, python3: (did seemingly nothing...)
lseek(3, 0, SEEK_SET) = 0 read(3, "task:tools-exec-1402.eqiad.wmfla"..., 8192) = 8192 read(3, "E:defaultdepartment:NONE:1:0:0.3"..., 8192) = 8192 read(3, ".127961:12196.000000:0:0:0:0:481"..., 8192) = 8192
buffering=65536, python2:
lseek(3, 1795686400, SEEK_SET) = 1795686400 read(3, "00:856:0:0:0:131:1322:NONE:defau"..., 63104) = 63104 read(3, "00000:NONE:122773504.000000:0:0\n"..., 65536) = 65536 read(3, "02:0.000695:-u tools.pb -q task "..., 65536) = 65536 read(3, "524288k,release=trusty:0.000000:"..., 65536) = 65536 read(3, "NONE:1:0:0.257305:0.001735:0.000"..., 65536) = 65536
buffering=65536, python3: (still nothing)
lseek(3, 0, SEEK_SET) = 0 read(3, "task:tools-exec-1402.eqiad.wmfla"..., 8192) = 8192 read(3, "E:defaultdepartment:NONE:1:0:0.3"..., 8192) = 8192 read(3, ".127961:12196.000000:0:0:0:0:481"..., 8192) = 8192 read(3, "4:0.116091:0.003420:-u tools.che"..., 8192) = 8192
buffering=128, python2: (bizarre)
lseek(3, 1795754624, SEEK_SET) = 1795754624 read(3, ".eqiad.wmflabs:tools.sbot:tools."..., 52) = 52 read(3, "7119810:sge:0:1499760016:1499760"..., 128) = 128 read(3, "NE:defaultdepartment:NONE:1:0:0."..., 128) = 128 read(3, "965440.000000:0:0\ntask:tools-exe"..., 128) = 128 read(3, "99760014:1499760023:0:1:9:0.8877"..., 8064) = 8064 read(3, "424418:0.263634:18004.000000:0:0"..., 128) = 128 read(3, "070:-u tools.liangent-py -q task"..., 10112) = 10112 read(3, "ags:nlHashtagUpdate:7119853:sge:"..., 128) = 128 read(3, "00:24:0:0:0:739:50:NONE:defaultd"..., 10112) = 10112 read(3, "ols.avicbot:tools.avicbot:clean:"..., 128) = 128
buffering=128, python3: (nothing. why am I repeating myself...)
lseek(3, 0, SEEK_SET) = 0 read(3, "task:tools-exec-1402.eqiad.wmfla"..., 8192) = 8192 read(3, "E:defaultdepartment:NONE:1:0:0.3"..., 8192) = 8192 read(3, ".127961:12196.000000:0:0:0:0:481"..., 8192) = 8192 read(3, "4:0.116091:0.003420:-u tools.che"..., 8192) = 8192
Another during this analysis, I notice this lseek behavior in python3 doesn't seem right at all: lseek(3, 0, SEEK_SET) is seeking to the start of the file. Where and why is python doing this?
If the mode is set to 'rb' and keeping the buffering -1:
python2:
open("/data/project/.system/accounting", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=1800039812, ...}) = 0 fstat(3, {st_mode=S_IFREG|0644, st_size=1800039812, ...}) = 0 mmap(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fccccc84000 fstat(3, {st_mode=S_IFREG|0644, st_size=1800039812, ...}) = 0 lseek(3, 1795162112, SEEK_SET) = 1795162112 read(3, "c-1402.eqiad.wmflabs:tools.incol"..., 683396) = 683396 read(3, "ache:7120001:sge:10:1499760482:1"..., 1048576) = 1048576 read(3, "ls.cydebot-3:7122927:sge:0:14997"..., 1048576) = 1048576 read(3, "2705:0.277779:35176.000000:0:0:0"..., 1048576) = 1048576 read(3, "6:0:0:0.000000:24:0:0:0:688:106:"..., 1048576) = 1048576 read(3, "task:tools-exec-1415.tools.eqiad"..., 1048576) = 2231 read(3, "", 1048576) = 0 read(3, "", 1048576) = 0 close(3) = 0
python3:
open("/data/project/.system/accounting", O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=1800054535, ...}) = 0 ioctl(3, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, 0x7ffdddd74740) = -1 ENOTTY (Inappropriate ioctl for device) fstat(3, {st_mode=S_IFREG|0644, st_size=1800054535, ...}) = 0 mmap(NULL, 1052672, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f95052f1000 lseek(3, 0, SEEK_CUR) = 0 lseek(3, 0, SEEK_CUR) = 0 lseek(3, -4194304, SEEK_END) = 1795860231 read(3, ":24.000000:64:0:0:0:6250:233:NON"..., 1048576) = 1048576 read(3, "9767250:1499767251:1499767320:0:"..., 1048576) = 1048576 read(3, "=524288k,release=trusty:0.000000"..., 1048576) = 1048576 read(3, "\ntask:tools-exec-1427.tools.eqia"..., 1048576) = 1048576 read(3, "", 1048576) = 0 close(3) = 0
Mentioned in SAL (#wikimedia-cloud) [2017-07-27T23:17:22Z] <bd808> Re-enabled hourly cron after patching for T168653 with 3f1ea69