Page MenuHomePhabricator

tools-sgecron-01 virtual memory allocation error at midnight and noon UTC
Closed, ResolvedPublic

Description

Many of my cron gridengine jobs have been failing for last 14 days throwing one of the following errors:

Traceback (most recent call last):
File "/usr/bin/job", line 47, in <module>
proc = subprocess.Popen(['/usr/bin/qstat', '-xml'], stdout=subprocess.PIPE)
File "/usr/lib/python3.5/subprocess.py", line 676, in __init__
restore_signals, start_new_session)
File "/usr/lib/python3.5/subprocess.py", line 1221, in _execute_child
restore_signals, start_new_session, preexec_fn)
OSError: [Errno 12] Cannot allocate memory
Traceback (most recent call last):
File "/usr/bin/job", line 21, in <module>
import xml.etree.ElementTree
File "<frozen importlib._bootstrap>", line 969, in _find_and_load
File "<frozen importlib._bootstrap>", line 958, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 673, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 669, in exec_module
File "<frozen importlib._bootstrap_external>", line 773, in get_code
File "<frozen importlib._bootstrap_external>", line 484, in _compile_bytecode
MemoryError

It never occured to me before. A majority of tasks (cca 6 or 7 of 10 tasks) last days failedwith one of these and even sent me emails with the stacktrace/traceback. It happened even for simple python scripts, that usually take miliseconds.

Event Timeline

aborrero subscribed.

Try allocating additional memory for your job: https://wikitech.wikimedia.org/wiki/Help:Toolforge/Grid#Allocating_additional_memory

Please, let me know if this solves your issue.

I just rerunned all of them now (in a different part of a day at 10 AM UTC, the failures were cca at 2 AM UTC usually) without allocating any additional memory and none of these cca 15 jobs failed again. That's even weirder, it seems it happens in specific times. Also I'm not sure if I have to allocate more memory for jobs that need 50–150 MB max usually.

even sent me emails with the stacktrace/traceback

I think this means the traceback is on the stderr of the cron entry, indication a memory error during job submission.

What is the command you are using to submit the jobs? I don't recall jsub invoking /usr/bin/job.

even sent me emails with the stacktrace/traceback

I think this means the traceback is on the stderr of the cron entry, indication a memory error during job submission.

What is the command you are using to submit the jobs? I don't recall jsub invoking /usr/bin/job.

Example crontab line:

0	0	1	*	*	jsub -N cron-28 -once -quiet cron 28

Preview of my ~/cron bash script:

~/cron
#!/bin/bash

cd pywikibot &&

if [ "$1" -eq "28" ]; then
    python3 list-category-cycles.py
elif [ "$1" -eq "29" ]; then
... 30, 31, ..., 89
fi

jsub does invoke job in is_running() function at line 281.

The graphite data for tools-sgecron-01 does not have any data points that it was running out of memory, but dmesg shows:

[Apr19 11:59] qstat[28885]: segfault at 0 ip 00007fe2da758a8e sp 00007ffcb2f5fd40 error 4 in libjemalloc.so.1[7fe2da754000+34000]
[  +0.012682] qstat[28904]: segfault at 0 ip 00007f5b4076ea8e sp 00007fff98dd21f0 error 4 in libjemalloc.so.1[7f5b4076a000+34000]
[  +0.003268] qstat[28940]: segfault at 0 ip 00007f7e8435ca8e sp 00007fffb91aa380 error 4 in libjemalloc.so.1[7f7e84358000+34000]
[  +0.012101] qstat[28900]: segfault at 0 ip 00007fb3c502da8e sp 00007ffcb824c980 error 4 in libjemalloc.so.1[7fb3c5029000+34000]
[Apr19 23:59] qstat[12949]: segfault at 0 ip 00007f68f3a97a8e sp 00007ffe062c2860 error 4 in libjemalloc.so.1[7f68f3a93000+34000]
[  +0.015150] sendmail[12972]: segfault at 7ffec3f72848 ip 0000565159ff90ad sp 00007ffec3f72850 error 6 in exim4[565159fc1000+ef000]
[  +0.052991] qstat[12955]: segfault at 0 ip 00007fe143e45a8e sp 00007ffd1493cd80 error 4 in libjemalloc.so.1[7fe143e41000+34000]
[Apr20 06:24] Process accounting resumed
[Apr21 06:24] Process accounting resumed
[Apr22 06:24] Process accounting resumed
[Apr23 06:24] Process accounting resumed
[Apr24 06:24] Process accounting resumed
[Apr25 06:24] Process accounting resumed
[Apr25 23:59] qstat[32482]: segfault at 0 ip 00007fb650160a8e sp 00007ffe8cb93210 error 4 in libjemalloc.so.1[7fb65015c000+34000]
[  +0.024774] qstat[32520]: segfault at 0 ip 00007f05a6f8ba8e sp 00007ffe951f1c40 error 4 in libjemalloc.so.1[7f05a6f87000+34000]
[  +0.017353] qstat[32564]: segfault at 0 ip 00007faebd9b3a8e sp 00007ffc3d4eaa10 error 4 in libjemalloc.so.1[7faebd9af000+34000]
[  +0.009400] qsub[32468]: segfault at 0 ip 00007fc67e3faa8e sp 00007ffc7c00e860 error 4 in libjemalloc.so.1[7fc67e3f6000+34000]
[  +0.053438] qstat[32348]: segfault at 0 ip 00007fd8ee58fa8e sp 00007ffc3dbaecb0 error 4 in libjemalloc.so.1[7fd8ee58b000+34000]
[  +0.033767] qstat[32287]: segfault at 0 ip 00007ff898412a8e sp 00007ffd3a0c0900 error 4 in libjemalloc.so.1[7ff89840e000+34000]
[  +0.984610] qstat[32687]: segfault at 0 ip 00007f3a83845a8e sp 00007ffd3cab3ca0 error 4 in libjemalloc.so.1[7f3a83841000+34000]
[  +0.000185] qstat[32669]: segfault at 0 ip 00007f7f4ff68a8e sp 00007ffcf0e9d8a0 error 4 in libjemalloc.so.1[7f7f4ff64000+34000]
[  +0.021948] qstat[32695]: segfault at 0 ip 00007faa956aaa8e sp 00007ffc4b4f0b00 error 4 in libjemalloc.so.1[7faa956a6000+34000]
[Apr26 06:24] Process accounting resumed
[Apr27 06:24] Process accounting resumed
[Apr27 23:59] qstat[31481]: segfault at 0 ip 00007f02593d0a8e sp 00007fff67cbb350 error 4 in libjemalloc.so.1[7f02593cc000+34000]
[  +0.001142] qstat[31490]: segfault at 0 ip 00007fdeb2837a8e sp 00007fff3f3979b0 error 4 in libjemalloc.so.1[7fdeb2833000+34000]
[  +0.208803] exim4[31704]: segfault at 8c ip 00007f048d0fb650 sp 00007ffc858c2b40 error 6 in libtasn1.so.6.5.3[7f048d0ef000+11000]
[  +0.096076] exim4[31653]: segfault at 8c ip 00007fce4ab6a650 sp 00007fffa615d900 error 6 in libtasn1.so.6.5.3[7fce4ab5e000+11000]
[  +0.850732] qstat[31516]: segfault at 0 ip 00007f6d01959a8e sp 00007fff28dc4570 error 4 in libjemalloc.so.1[7f6d01955000+34000]
[Apr28 06:24] Process accounting resumed
[Apr29 06:24] Process accounting resumed
[Apr30 06:24] Process accounting resumed
[Apr30 23:59] qstat[21799]: segfault at 0 ip 00007fca45b51a8e sp 00007ffdcb0cc7e0 error 4 in libjemalloc.so.1[7fca45b4d000+34000]
[  +1.616116] qstat[22228]: segfault at 0 ip 00007f7f6f0d8a8e sp 00007ffdaed69aa0 error 4 in libjemalloc.so.1[7f7f6f0d4000+34000]
[  +0.039497] qstat[22236]: segfault at 0 ip 00007f89b3d23a8e sp 00007ffcd85e43f0 error 4 in libjemalloc.so.1[7f89b3d1f000+34000]

These segfaults could vrery much been some failed malloc() calls that returned to a function that does not check the result of the malloc. The timestamps are either 11:59 or 23:59, so a short term workaround would be to try not to run at this time.

However, I do not understand why malloc() would fail, given they have infinite address space limit:

08:13:48 0 ✓ zhuyifei1999@tools-sgecron-01: ~$ sudo cat /proc/704/limits
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        0                    unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             31860                31860                processes 
Max open files            1024                 4096                 files     
Max locked memory         65536                65536                bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       31860                31860                signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us

(And no, that's not an OOM condition since that would invoke OOM killer which would say something in dmesg that the killer was invoked)

There is no calls to brk() after libjemalloc is mapped into memory:

08:19:14 0 ✓ zhuyifei1999@tools-sgecron-01: ~$ strace -e brk,mmap,open,openat qstat
brk(NULL)                               = 0x5566d1885000
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
mmap(NULL, 105645, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f4fcbcf7000
open("/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4fcbcf5000
mmap(NULL, 3795296, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f4fcb74f000
mmap(0x7f4fcbae4000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x195000) = 0x7f4fcbae4000
mmap(0x7f4fcbaea000, 14688, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f4fcbaea000
brk(NULL)                               = 0x5566d1885000
brk(0x5566d18a6000)                     = 0x5566d18a6000
open("/usr/bin/qstat", O_RDONLY)        = 3
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=2863, si_uid=3306, si_status=0, si_utime=0, si_stime=0} ---
open("/etc/default/gridengine", O_RDONLY) = 3
brk(NULL)                               = 0x563f06955000
open("/usr/lib/gridengine/../../lib/lx-amd64/tls/x86_64/libmunge.so.2", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/usr/lib/gridengine/../../lib/lx-amd64/tls/libmunge.so.2", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/usr/lib/gridengine/../../lib/lx-amd64/x86_64/libmunge.so.2", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/usr/lib/gridengine/../../lib/lx-amd64/libmunge.so.2", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
mmap(NULL, 105645, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f05a9987000
open("/usr/lib/x86_64-linux-gnu/libmunge.so.2", O_RDONLY|O_CLOEXEC) = 3
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f05a9985000
mmap(NULL, 2135336, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f05a9574000
mmap(0x7f05a977c000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x8000) = 0x7f05a977c000
open("/usr/lib/x86_64-linux-gnu/libssl.so.1.0.2", O_RDONLY|O_CLOEXEC) = 3
mmap(NULL, 2526448, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f05a930b000
mmap(0x7f05a956a000, 40960, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x5f000) = 0x7f05a956a000
open("/usr/lib/x86_64-linux-gnu/libcrypto.so.1.0.2", O_RDONLY|O_CLOEXEC) = 3
mmap(NULL, 4609664, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f05a8ea5000
mmap(0x7f05a92dc000, 176128, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x237000) = 0x7f05a92dc000
mmap(0x7f05a9307000, 13952, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f05a9307000
open("/lib/x86_64-linux-gnu/libdl.so.2", O_RDONLY|O_CLOEXEC) = 3
mmap(NULL, 2109680, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f05a8ca1000
mmap(0x7f05a8ea3000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x2000) = 0x7f05a8ea3000
open("/lib/x86_64-linux-gnu/libm.so.6", O_RDONLY|O_CLOEXEC) = 3
mmap(NULL, 3158248, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f05a899d000
mmap(0x7f05a8c9f000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x102000) = 0x7f05a8c9f000
open("/lib/x86_64-linux-gnu/libpthread.so.0", O_RDONLY|O_CLOEXEC) = 3
mmap(NULL, 2212936, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f05a8780000
mmap(0x7f05a8997000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x17000) = 0x7f05a8997000
mmap(0x7f05a8999000, 13384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f05a8999000
open("/usr/lib/x86_64-linux-gnu/libjemalloc.so.1", O_RDONLY|O_CLOEXEC) = 3
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f05a9983000
mmap(NULL, 2319504, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f05a8549000
mmap(0x7f05a877c000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x33000) = 0x7f05a877c000
mmap(0x7f05a877f000, 1168, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f05a877f000
open("/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
mmap(NULL, 3795296, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f05a81aa000
mmap(0x7f05a853f000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x195000) = 0x7f05a853f000
mmap(0x7f05a8545000, 14688, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f05a8545000
mmap(NULL, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f05a9980000
mmap(NULL, 4194304, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f05a7daa000
mmap(NULL, 8384512, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f05a79ab000
open("/sys/devices/system/cpu/online", O_RDONLY|O_CLOEXEC) = 3
mmap(NULL, 217032, PROT_READ, MAP_SHARED, 4, 0) = 0x7f05a994b000
mmap(NULL, 4194304, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f05a7800000
mmap(NULL, 102803208, PROT_READ, MAP_SHARED, 4, 0) = 0x7f05a15f5000
mmap(NULL, 102803208, PROT_READ, MAP_SHARED, 4, 0) = 0x7f059b3ea000
open("/etc/resolv.conf", O_RDONLY|O_CLOEXEC) = 3
mmap(NULL, 1010648, PROT_READ, MAP_SHARED, 4, 0) = 0x7f05a9854000
open("/var/lib/gridengine/default/common/bootstrap", O_RDONLY) = 3
open("/etc/resolv.conf", O_RDONLY|O_CLOEXEC) = 3
open("/var/lib/gridengine/default/common/act_qmaster", O_RDONLY) = 3
+++ exited with 0 +++

So, AFAICT, the kernel must have refused a call to mmap for some weird reason.

There are around 467 jobs that gets submitted at midnight UTC:

05:32:44 0 ✓ zhuyifei1999@tools-sgecron-01: ~$ sudo bash -c 'cat /var/spool/cron/crontabs/*' | sed '/^ *#/d' | sed '/^ *$/d' | sed -r 's/\s+/ /g' | sed -r 's/^ //' | awk '(($1 ~ /\*/ || $1 ~ /(^|,)0($|,)/) && ($2 ~ /\*/ || $2 ~ /(^|,)0($|,)/)) { print $0 }' | wc -l
467
zhuyifei1999 renamed this task from Cron gridengine jobs are failing for memory errors to tools-sgecron-01 virtual memory allocation error at midnight and noon UTC.May 5 2019, 5:38 PM
zhuyifei1999 raised the priority of this task from Low to High.May 5 2019, 6:20 PM

Missed my chance today to capture an strace of what is going on. Was using bashisms in my crontab :(

I got a different error at 2019-05-06T00:00. Not sure if it is related.

Traceback (most recent call last):
  File "/usr/bin/job", line 48, in <module>
    root = xml.etree.ElementTree.fromstring(proc.stdout.read())
  File "/usr/lib/python3.5/xml/etree/ElementTree.py", line 1345, in XML
    return parser.close()
xml.etree.ElementTree.ParseError: no element found: line 1, column 0

@JJMC89 Does it have anything to do with this. Sorry if it is totally unrelated.

I got a different error at 2019-05-06T00:00. Not sure if it is related.

Likely yes. It does not seem to check the exitcode of qstat.

There are 4 cases of qstat segfaults this time.

Quote T222429:

/usr/lib/gridengine/qstat: error while loading shared libraries: libm.so.6: failed to map segment from shared object

This is almost definitely a mmap() failure. It should also have almost nothing to do with RSS. @Bstorm Is there any limits to cron daemon VMS? I'm unable to find the limit from either rlimit or cgroups.

Are these older ones: T183216 and T218038 related?

No. Network issue.

T218038

No. NFS issue.

This is about virtual memory allocation failure on the cron host at 0 and 12 UTC, including any segfaults, job submission failures, xml parsing errors that are a direct consequence of the memory allocation failure. This likely relates to syscalls brk, mmap, mremap, madvise. select, recvfrom, and those unrelated to memory management are highly unlikely to be related.

As for the times:

$ sudo dmesg -H | grep segfault
[Apr 8 23:59] qstat[31987]: segfault at 0 ip 00007fb05483fa8e sp 00007fff050e8790 error 4 in libjemalloc.so.1[7fb05483b000+34000]
[Apr19 11:59] qstat[28885]: segfault at 0 ip 00007fe2da758a8e sp 00007ffcb2f5fd40 error 4 in libjemalloc.so.1[7fe2da754000+34000]
[  +0.012682] qstat[28904]: segfault at 0 ip 00007f5b4076ea8e sp 00007fff98dd21f0 error 4 in libjemalloc.so.1[7f5b4076a000+34000]
[  +0.003268] qstat[28940]: segfault at 0 ip 00007f7e8435ca8e sp 00007fffb91aa380 error 4 in libjemalloc.so.1[7f7e84358000+34000]
[  +0.012101] qstat[28900]: segfault at 0 ip 00007fb3c502da8e sp 00007ffcb824c980 error 4 in libjemalloc.so.1[7fb3c5029000+34000]
[Apr19 23:59] qstat[12949]: segfault at 0 ip 00007f68f3a97a8e sp 00007ffe062c2860 error 4 in libjemalloc.so.1[7f68f3a93000+34000]
[  +0.015150] sendmail[12972]: segfault at 7ffec3f72848 ip 0000565159ff90ad sp 00007ffec3f72850 error 6 in exim4[565159fc1000+ef000]
[  +0.052991] qstat[12955]: segfault at 0 ip 00007fe143e45a8e sp 00007ffd1493cd80 error 4 in libjemalloc.so.1[7fe143e41000+34000]
[Apr25 23:59] qstat[32482]: segfault at 0 ip 00007fb650160a8e sp 00007ffe8cb93210 error 4 in libjemalloc.so.1[7fb65015c000+34000]
[  +0.024774] qstat[32520]: segfault at 0 ip 00007f05a6f8ba8e sp 00007ffe951f1c40 error 4 in libjemalloc.so.1[7f05a6f87000+34000]
[  +0.017353] qstat[32564]: segfault at 0 ip 00007faebd9b3a8e sp 00007ffc3d4eaa10 error 4 in libjemalloc.so.1[7faebd9af000+34000]
[  +0.009400] qsub[32468]: segfault at 0 ip 00007fc67e3faa8e sp 00007ffc7c00e860 error 4 in libjemalloc.so.1[7fc67e3f6000+34000]
[  +0.053438] qstat[32348]: segfault at 0 ip 00007fd8ee58fa8e sp 00007ffc3dbaecb0 error 4 in libjemalloc.so.1[7fd8ee58b000+34000]
[  +0.033767] qstat[32287]: segfault at 0 ip 00007ff898412a8e sp 00007ffd3a0c0900 error 4 in libjemalloc.so.1[7ff89840e000+34000]
[  +0.984610] qstat[32687]: segfault at 0 ip 00007f3a83845a8e sp 00007ffd3cab3ca0 error 4 in libjemalloc.so.1[7f3a83841000+34000]
[  +0.000185] qstat[32669]: segfault at 0 ip 00007f7f4ff68a8e sp 00007ffcf0e9d8a0 error 4 in libjemalloc.so.1[7f7f4ff64000+34000]
[  +0.021948] qstat[32695]: segfault at 0 ip 00007faa956aaa8e sp 00007ffc4b4f0b00 error 4 in libjemalloc.so.1[7faa956a6000+34000]
[Apr27 23:59] qstat[31481]: segfault at 0 ip 00007f02593d0a8e sp 00007fff67cbb350 error 4 in libjemalloc.so.1[7f02593cc000+34000]
[  +0.001142] qstat[31490]: segfault at 0 ip 00007fdeb2837a8e sp 00007fff3f3979b0 error 4 in libjemalloc.so.1[7fdeb2833000+34000]
[  +0.208803] exim4[31704]: segfault at 8c ip 00007f048d0fb650 sp 00007ffc858c2b40 error 6 in libtasn1.so.6.5.3[7f048d0ef000+11000]
[  +0.096076] exim4[31653]: segfault at 8c ip 00007fce4ab6a650 sp 00007fffa615d900 error 6 in libtasn1.so.6.5.3[7fce4ab5e000+11000]
[  +0.850732] qstat[31516]: segfault at 0 ip 00007f6d01959a8e sp 00007fff28dc4570 error 4 in libjemalloc.so.1[7f6d01955000+34000]
[Apr30 23:59] qstat[21799]: segfault at 0 ip 00007fca45b51a8e sp 00007ffdcb0cc7e0 error 4 in libjemalloc.so.1[7fca45b4d000+34000]
[  +1.616116] qstat[22228]: segfault at 0 ip 00007f7f6f0d8a8e sp 00007ffdaed69aa0 error 4 in libjemalloc.so.1[7f7f6f0d4000+34000]
[  +0.039497] qstat[22236]: segfault at 0 ip 00007f89b3d23a8e sp 00007ffcd85e43f0 error 4 in libjemalloc.so.1[7f89b3d1f000+34000]
[May 2 23:59] qstat[31939]: segfault at 0 ip 00007f629ee35a8e sp 00007ffce25df370 error 4 in libjemalloc.so.1[7f629ee31000+34000]
[  +0.012823] qstat[31999]: segfault at 7ffd6ac12edc ip 00007f5e68cbbd4e sp 00007ffd6ac12e20 error 6 in libc-2.24.so[7f5e68c75000+195000]
[May 5 23:59] qstat[7476]: segfault at 0 ip 00007f47614f6a8e sp 00007ffcbe6a3340 error 4 in libjemalloc.so.1[7f47614f2000+34000]
[  +0.061838] qstat[7495]: segfault at 0 ip 00007fd66faefa8e sp 00007fff0bba5e50 error 4 in libjemalloc.so.1[7fd66faeb000+34000]
[  +0.030899] qstat[7502]: segfault at 0 ip 00007feb54518a8e sp 00007fff36b34c10 error 4 in libjemalloc.so.1[7feb54514000+34000]
[  +0.021800] qstat[7507]: segfault at 0 ip 00007fd483651a8e sp 00007fff47717cf0 error 4 in libjemalloc.so.1[7fd48364d000+34000]

Any time not here is highly unlikely to be related.

BTW another different error email today:

Traceback (most recent call last):
File "/usr/bin/jsub", line 648, in <module>
main()
File "/usr/bin/jsub", line 585, in main
sys.stderr = open(args.log_err, 'a')
MemoryError

That is likely same issue; however, I don't think that has anything new in it.

Sure, it seems it just varies from the command runned, different for Python command, bash command, etc.

Why is memory overcommitting disabled? This config is ages old without explanation.

The current global VMS limit (with overcommiting disabled) is:

06:50:08 0 ✓ zhuyifei1999@tools-sgecron-01: ~$ cat /proc/meminfo | grep CommitLimit
CommitLimit:     8293552 kB

8293552kB / 467 = 17 MiB VMS per job... This doesn't sound enough for me at all.

I'm out-of-office, but I've been trying to poke at this problem while away.

It is apparent that many things are failing at the level of the cron server itself. The exec_environ doesn't apply there, since it isn't a bastion (but that bit is worth reviewing and I bet I can find more info buried in there somewhere). Have to run, but I will try to help more today. The jsub failures in particular are telling--since that's all on the cron server locally without making it to the exec_environ. There might be something useful in the grafana dashboards as well.

There might be something useful in the grafana dashboards as well.

Any hints?

https://grafana-labs.wikimedia.org/dashboard/db/labs-project-board?orgId=1&var-project=tools&var-server=tools-sgecron-01&from=1557095400000&to=1557106200000
From what I can see, the loagavg skyrocketed and there's a dip in disk space, but the memory usage is stable and I can't explain why VMS allocation would fail just from these graphs.

Yeah, there isn't anything there.

Looking a bit deeper, overcommitting being limited to 95% of total memory sounds like Java and python self-defense--against OOM killers everywhere. I wouldn't recommend changing that.

I'm curious about what is happening here that is new, though. This server has had occasional issues since we moved everyone to stretch--it appeared mostly storage and ldap related previously, but that was likely hiding this. It is possible more jobs are launching successfully now than before, but I'm inclined to think it is something else. Looking around more. I'm vaguely afraid its going to be more systemic (kernel or systemd problems), and we're going to have to implement something weird to fix it.

This is interesting. There is a series of qstat processes segfaulting at 23:59:<seconds> many (not all) nights (according to dmesg). Occasionally it also affects sendmail, exim and qsub. Rarely at other times (11:59:45 was one set). Frustrating problem to track down.

There is a series of qstat processes segfaulting at 23:59:<seconds>

I'm pretty sure it's actually 00:00. Using uptime to calculate the absolute time isn't exactly accurate.

I'm pretty sure segfaults are the use of NULL pointers from failed malloc(). This can be easily reproduced:

03:39:30 0 ✓ zhuyifei1999@tools-sgebastion-08: ~$ (ulimit -v 30000; strace qstat)
execve("/usr/bin/qstat", ["qstat"], [/* 26 vars */]) = 0
[...]
mmap(NULL, 8384512, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
rt_sigprocmask(SIG_SETMASK, ~[INT QUIT ILL TRAP ABRT BUS FPE SEGV ALRM TERM TTIN TTOU URG VTALRM PROF IO RTMIN RT_1], [], 8) = 0
rt_sigaction(SIGALRM, {sa_handler=0x556157b76400, sa_mask=~[QUIT ILL ABRT RTMIN RT_1], sa_flags=SA_RESTORER, sa_restorer=0x7f43b41020e0}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGTERM, {sa_handler=0x556157b76430, sa_mask=~[ILL ABRT RTMIN RT_1], sa_flags=SA_RESTORER, sa_restorer=0x7f43b41020e0}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGINT, {sa_handler=0x556157b76430, sa_mask=~[ILL ABRT RTMIN RT_1], sa_flags=SA_RESTORER, sa_restorer=0x7f43b41020e0}, NULL, 8) = 0
socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
connect(3, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = 0
sendto(3, "\2\0\0\0\22\0\0\0\t\0\0\0services\0", 21, MSG_NOSIGNAL, NULL, 0) = 21
poll([{fd=3, events=POLLIN|POLLERR|POLLHUP}], 1, 5000) = 1 ([{fd=3, revents=POLLIN}])
recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="services\0", iov_len=9}, {iov_base="\310O\3\0\0\0\0\0", iov_len=8}], msg_iovlen=2, msg_control=[{cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, cmsg_data=[4]}], msg_controllen=20, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 17
mmap(NULL, 217032, PROT_READ, MAP_SHARED, 4, 0) = 0x7f43b52bb000
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=NULL} ---
+++ killed by SIGSEGV +++

Python is good enough that when malloc() fails it reports a 'MemoryError' instead of segfaulting.

against OOM killers everywhere. I wouldn't recommend changing that.

What's wrong with OOM killers? It kills the worst memory eater when the system is really out of memory, not causing-segfaults-to-badly-written-code when some arbitrarily-imposed-limit is hit. Besides, the graph is clearly showing a stable memory RSS usage.

I'm curious about what is happening here that is new, though

I think it's just more and more people using cron. We have around 467 jobs at midnight UTC (T222255#5158245)!

I'm curious about what is happening here that is new, though

I think it's just more and more people using cron. We have around 467 jobs at midnight UTC (T222255#5158245)!

Very well could be. I dug around when I was more available and noticed that the exec_environ is absolutely used on the cron server. The reason given there is that supposedly jsub checks everything before submitting a job, but this doesn't seem correct to me at this point--it probably was at another time. qsub might still, but I certainly see no such care taken in jsub's code. That said, if we abstract that overcommit limit away from exec_environ and pull it into exec/bastion nodes but not cron nodes, then we we can see if that unbinds this or not. Since each jsub is a python, and webservice is a python...and python does like to overcommit.

On the OOM, it kills things, but it doesn't always kill exactly what you want, you know? Overcommit can easily block up the grid with some of the languages out there if it is too aggressive. That said, this is a cron server, not an exec node. It should be allowed to overcommit all over the place! I still don't want to mess with the setting on Toolforge, but I DO want to mess with it on the cron server, if you get my meaning :) Flying back today. I should be able yank that out and see if that slows down the problem soon.

The reason given there is that supposedly jsub checks everything before submitting a job

It's an issue with both jsub and qsub :(

04:37:04 0 ✓ zhuyifei1999@tools-sgebastion-08: ~$ jsub foo
usage: jsub [options...] program [args...]
jsub: error: argument program: Program 'foo' not found.
04:37:07 1 ✗ zhuyifei1999@tools-sgebastion-08: ~$ qsub foo
Unable to read script file because of error: error opening foo: No such file or directory

T143458 describes how the missing packages was an issue on the bigbrother hosts.

On the OOM, it kills things, but it doesn't always kill exactly what you want, you know?

Maybe I'm 'that guy' who overly trust the kernel. Anyways, how about in the grid exec nodes limit by RSS (ulimit -m) per job rather than VMS (ulimit -v)? Oh that uses RLIMIT_RSS which doesn't work. Need cgroups. sigh. Really Linux?

I should be able yank that out and see if that slows down the problem soon.

Thanks.

Change 509511 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolforge: remove memory overcommit limitations on cron host

https://gerrit.wikimedia.org/r/509511

Change 509511 merged by Bstorm:
[operations/puppet@production] toolforge: remove memory overcommit limitations on cron host

https://gerrit.wikimedia.org/r/509511

The one danger I see in this is if it actually hits OOM and kills the webservice monitor. Then webservices on the grid will not restart. I do not *think* that will happen because real memory usage isn't spiking from what I can tell, though if there was no resource contention, this wouldn't happen.

So after deploying, I manually switched overcommit_memory on the cron runner to 0. We'll find out if that helps.

I don't know the latest updates about this issue. But I would like to inform that I haven't received any messages stating the Memory allocation error in the last 1 to 2 weeks.

That's kind of what I'm looking for. I was leaving this open for a while to see if we are still getting the issue. I deployed the changes last Friday, so it's fixed, I think, if there have been no messages since then. I'll poke around in the depths of my email filters to see if I have any.

I actually delete many of the messages that I get in may gmail. But I don't delete them from Trash. It was after searching in the trash as well as in my inbox that I made such a statement.

Bstorm claimed this task.

I'm not finding any on my end. I also see no logs or dmesg entries since the week prior to the change. I'm going to call this one fixed by freeing overcommit on the cron host.