Page MenuHomePhabricator

Make qacct less NFS-intensive
Closed, ResolvedPublic

Description

People have used qacct for various reasons, see eg. T181742. However, qacct is slowwwww:

107:09:38 0 ✓ zhuyifei1999@tools-bastion-05: ~$ time strace qacct -j 2793552
2execve("/usr/bin/qacct", ["qacct", "-j", "2793552"], [/* 34 vars */]) = 0
3brk(0) = 0x7fe68ebba000
4[...]
5stat("/var/lib/gridengine/default/common/accounting", {st_mode=S_IFREG|0644, st_size=3830925594, ...}) = 0
6open("/var/lib/gridengine/default/common/accounting", O_RDONLY) = 3
7fstat(3, {st_mode=S_IFREG|0644, st_size=3830925594, ...}) = 0
8mmap(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fcc84e1c000
9read(3, "task:tools-exec-1402.eqiad.wmfla"..., 1048576) = 1048576
10read(3, "ask -l h_vmem=524288k,release=tr"..., 1048576) = 1048576
11read(3, "ad.wmflabs:tools.theoslittlebot:"..., 1048576) = 1048576
12read(3, "-l h_vmem=524288k,release=trusty"..., 1048576) = 1048576
13read(3, "114298:0.065607:0.003496:-u tool"..., 1048576) = 1048576
14read(3, "870:47:NONE:defaultdepartment:NO"..., 1048576) = 1048576
15read(3, "89036193:1489036194:1489036195:0"..., 1048576) = 1048576
16read(3, "0:0:180:0.252655:0.954297:12072."..., 1048576) = 1048576
17read(3, "NONE:1:0:2.961540:0.200328:0.003"..., 1048576) = 1048576
18read(3, " task -l h_rt=3600,h_vmem=512000"..., 1048576) = 1048576
19read(3, ":0:0:0:9843:0:0:0.000000:8:0:0:0"..., 1048576) = 1048576
20read(3, "nt:NONE:1:0:1.381018:0.079499:0."..., 1048576) = 1048576
21read(3, ".000000:32:0:0:0:46309:1619:NONE"..., 1048576) = 1048576
22read(3, "6941:0.961955:12064.000000:0:0:0"..., 1048576) = 1048576
23read(3, ^CProcess 13837 detached
24 <detached ...>
25
26real 0m30.595s
27user 0m0.046s
28sys 0m0.189s
29error: aborted by user

... taking around 30 seconds to read 14 MiB from NFS, and the file is 3.6G:

$ ls -lh /var/lib/gridengine/default/common/accounting /data/project/.system/accounting
-rw-r--r-- 1 sgeadmin sgeadmin 3.6G Dec  8 19:14 /data/project/.system/accounting
lrwxrwxrwx 1 root     sgeadmin   32 Feb 22  2014 /var/lib/gridengine/default/common/accounting -> /data/project/.system/accounting

... which means a complete scan, just to find a single job exit status, might take two hours, assuming nobody notice the resource hog and ask to kill it.

We could:

  • Rotate the accounting logs (T168701), so qacct won't have to go through the whole file to find information about a job that recently exited, or
  • Use an alternative qacct parser such as that from tool grid-jobs, which will not, by default, scan through the file accounting logs.

Event Timeline

Or we could make a wrapper around qacct so it, by default, only reads last * bytes, an example implementation is

1#! /usr/bin/python3 -Es
2
3import io
4import os
5import subprocess
6import sys
7import tempfile
8
9sge_root = os.environ.get('SGE_ROOT', '/var/lib/gridengine')
10sge_cell = os.environ.get('SGE_CELL', 'default')
11account_file = 'common/accounting'
12
13logpath = os.path.join(sge_root, sge_cell, account_file)
14
15with open(logpath, 'rb') as logfile:
16 logfile.seek(-400 * 45000 * 7, os.SEEK_END)
17 # Ignore first line as it may be onlym part of a line
18 logfile.readline()
19 with tempfile.TemporaryDirectory() as new_sge_root:
20 new_logpath = os.path.join(new_sge_root, sge_cell, account_file)
21 os.makedirs(os.path.dirname(new_logpath))
22
23 for fname in os.listdir(os.path.dirname(logpath)):
24 os.symlink(
25 os.path.join(os.path.dirname(logpath), fname),
26 os.path.join(os.path.dirname(new_logpath), fname))
27
28 os.remove(new_logpath)
29 os.mkfifo(new_logpath)
30
31 real_qacct = subprocess.Popen(
32 ['/usr/bin/qacct'] + sys.argv[1:],
33 env=dict(os.environ, SGE_ROOT=new_sge_root))
34
35 with open(new_logpath, 'wb') as new_logfile:
36 while real_qacct.poll() is None:
37 data = logfile.read(io.DEFAULT_BUFFER_SIZE)
38 if not data:
39 break
40 new_logfile.write(data)
41
42 sys.exit(real_qacct.wait())
1tools.zhuyifei1999-test@tools-bastion-05:~$ python3 qacct.py
2Total System Usage
3 WALLCLOCK UTIME STIME CPU MEMORY IO IOW
4================================================================================================================
5 22846397 1268233.041 124225.258 3614818.371 2001583.739 4700.763 0.000
6tools.zhuyifei1999-test@tools-bastion-05:~$ python3 qacct.py -j 2793552
7==============================================================
8qname task
9hostname tools-exec-1439.tools.eqiad.wmflabs
10group tools.multichill
11owner tools.multichill
12project NONE
13department defaultdepartment
14jobname find_painting_images
15jobnumber 2793552
16taskid undefined
17account sge
18priority 0
19qsub_time Fri Dec 8 15:00:21 2017
20start_time Fri Dec 8 15:00:22 2017
21end_time Fri Dec 8 15:42:26 2017
22granted_pe NONE
23slots 1
24failed 0
25exit_status 139
26ru_wallclock 2524
27ru_utime 159.079
28ru_stime 46.221
29ru_maxrss 5945476
30ru_ixrss 0
31ru_ismrss 0
32ru_idrss 0
33ru_isrss 0
34ru_minflt 2717214
35ru_majflt 25
36ru_nswap 0
37ru_inblock 3560
38ru_oublock 12704792
39ru_msgsnd 0
40ru_msgrcv 0
41ru_nsignals 0
42ru_nvcsw 82696
43ru_nivcsw 54856
44cpu 205.301
45mem 277.180
46io 7.684
47iow 0.000
48maxvmem 3.933G
49arid undefined
50tools.zhuyifei1999-test@tools-bastion-05:~$ time python3 qacct.py > /dev/null
51
52real 0m28.820s
53user 0m0.518s
54sys 0m0.158s

... using 28 seconds to read last 17 MiB of the logs.

@madhuvishy has done some testing on moving the Grid Engine control files off of NFS and we talked about rotating the accounting file more often as part of the eventual implementation work for that. We have a few use cases for 14-30 days of audit data, but that would be much easier to manage in 24 hour chunks. The current "file grows until we manually truncate it" model is pretty horrible.

Maybe throw it into some kind of more database like system like logstash so we can just query that?

Maybe throw it into some kind of more database like system like logstash so we can just query that?

I don't see investing in custom Grid Engine accounting tooling having a useful long term return. I certainly would not stop anyone from working on it, but today we have many more pressing issues to deal with from the cloud-services-team side. I would be much more interested in examining slurm as replacement for the grid engine use cases that we can't move to Kubernetes.

It’s also possible to read the file in reverse, using tac (which uses a bunch of seeks under the hood). For example, I used this successfully to get the submission, start and end time of a recently terminated job:

tac /var/lib/gridengine/default/common/accounting |
while IFS=: read -r qname hostname group owner job_name job_number account priority submission_time start_time end_time rest; do
    if [[ $job_name == wd-shex-infer-6 ]]; then
        printf "Submission time: %(%Y-%m-%d %H:%M:%S)T\nStart time: %(%Y-%m-%d %H:%M:%S)T\nEnd time: %(%Y-%m-%d %H:%M:%S)T\n" "$submission_time" "$start_time" "$end_time";
        break;
    fi;
done

By breaking from the loop as soon as the first match is found, we effectively kill the tac with SIGPIPE, so it doesn’t read more data than necessary. (I wrapped the whole thing in timeout 10s bash -c '…' just to be sure, though, and I recommend anyone else experimenting with this to do the same.)


Edit: you can also achieve this with qacct -j JOB -f <(tac /var/lib/gridengine/default/common/accounting), but you have to terminate qacct manually once it’s printed the information you want, otherwise it’ll keep reading the accounting file (in reverse) forever.

Bstorm triaged this task as Medium priority.Feb 11 2020, 5:23 PM

Last year, we started rotating the accounting file. Is this still a problem?

bd808 assigned this task to Bstorm.

Last year, we started rotating the accounting file. Is this still a problem?

The files are much smaller these days:

$ ls -lh /var/lib/gridengine/default/common/accounting*
-rw-r--r-- 1 sgeadmin sgeadmin  67M Jun 11 23:59 /var/lib/gridengine/default/common/accounting
-rw-r--r-- 1 sgeadmin sgeadmin 149M Jun  8 06:27 /var/lib/gridengine/default/common/accounting.1
-rw-r--r-- 1 sgeadmin sgeadmin  21M May 31 06:28 /var/lib/gridengine/default/common/accounting.2.gz
-rw-r--r-- 1 sgeadmin sgeadmin  28M May 25 06:28 /var/lib/gridengine/default/common/accounting.3.gz
-rw-r--r-- 1 sgeadmin sgeadmin  25M May 17 06:28 /var/lib/gridengine/default/common/accounting.4.gz
-rw-r--r-- 1 sgeadmin sgeadmin  24M May 10 06:28 /var/lib/gridengine/default/common/accounting.5.gz
-rw-r--r-- 1 sgeadmin sgeadmin  31M May  4 06:28 /var/lib/gridengine/default/common/accounting.6.gz
-rw-r--r-- 1 sgeadmin sgeadmin  23M Apr 26 06:29 /var/lib/gridengine/default/common/accounting.7.gz
-rw-r--r-- 1 sgeadmin sgeadmin  32M Apr 20 06:27 /var/lib/gridengine/default/common/accounting.8.gz

Reads are still horribly slow, but I think T168701: Rotate Grid Engine accounting file is about the best we can do.